
我想用來grep
在第二個文件中搜尋一個文件中的模式。我的模式檔是這樣的:
K02217
K07448
KO8980
要搜尋的文件是:
>aai:AARI_24510 proP; proline/betaine transporter; K03762 MFS transporter, MHS family, proline/betaine transporter
>aai:AARI_26600 ferritin-like protein; K02217 ferritin [EC:1.16.3.1]
>aai:AARI_28260 hypothetical protein
>aai:AARI_29060 ABC drug resistance transporter, inner membrane subunit; K09686 antibiotic transport system permease protein
>aai:AARI_29070 ABC drug resistance transporter, ATP-binding subunit (EC:3.6.3.-); K09687 antibiotic transport system ATP-binding protein
>aai:AARI_29650 hypothetical protein
>aai:AARI_32480 iron-siderophore ABC transporter ATP-binding subunit (EC:3.6.3.-); K02013 iron complex transport system ATP-binding protein [EC:3.6.3.34]
>aai:AARI_33320 mrr; restriction system protein Mrr; K07448 restriction system protein
我嘗試的命令是:
fgrep --file=pattern.txt file.txt >> output.txt
這將列印 file.txt 中找到模式的行。我需要它來列印一列包含找到的模式的列。所以像這樣:
K07448 mrr; restriction system protein Mrr; K07448 restriction system
K02217 ferritin-like protein; K02217 ferritin [EC:1.16.3.1]
任何人都可以建議我該怎麼做?
答案1
如果您不介意其中有一個額外的列,您可以使用join
和grep
來執行此操作。
$ join <(grep -of patterns.txt file.txt | nl) \
<(grep -f patterns.txt file.txt | nl)
1 KO3322 proteinaseK (KO3322)
2 KO3435 Xxxxx KO3435;folding factor
3 KO3435 Yyyyy KO3435,xxxx
答案2
您可以使用 shell 循環:
$ while read pat; do
grep "$pat" file |
while read match do
echo -e "$pat\t$match"
done
done < patterns
KO3435 Xxxxx KO3435;folding factor
KO3435 Yyyyy KO3435,xxxx
KO3322 proteinaseK (KO3322)
我透過在 UniProt 人類平面檔案 (625M) 上運行它並使用 1000 個 UniProt ID 作為模式進行測試。在我的 Pentium i7 筆記型電腦上大約需要 6 分鐘。當我只查找 100 個模式時,我大約花了 35 秒。
正如下面的評論中所指出的,您可以透過跳過echo
和 使用grep
和--label
選項來稍微加快速度-H
:
$ while read pat; do
grep "$pat" --label="$pat" -H < file
done < patterns
在範例文件上執行此命令會產生:
$ while read pat; do
grep "$pat" --label="$pat" -H < kegg.annotations;
done < allKO.IDs.txt > test1
terdon@oregano foo $ cat test1
K02217:>aai:AARI_26600 ferritin-like protein; K02217 ferritin [EC:1.16.3.1]
K07448:>aai:AARI_33320 mrr; restriction system protein Mrr; K07448 restriction system protein
答案3
您可以使用確認:
$ ack "$(tr '\n' '|' < pattern.txt | sed -e 's/.$//')" --print0 --output='$& $_' file.txt
KO3322 proteinaseK (KO3322)
KO3435 Xxxxx KO3435;folding factor
KO3435 Yyyyy KO3435,xxxx