Grep para pesquisar padrão em um arquivo

Question 1

Se você não se importa com uma coluna extra com um número, você pode usar joine greppara fazer isso.

$ join <(grep -of patterns.txt file.txt | nl) \
       <(grep -f patterns.txt file.txt | nl)
1 KO3322 proteinaseK (KO3322)
2 KO3435 Xxxxx KO3435;folding factor
3 KO3435 Yyyyy KO3435,xxxx

Answer

Se você não se importa com uma coluna extra com um número, você pode usar joine greppara fazer isso.

$ join <(grep -of patterns.txt file.txt | nl) \
       <(grep -f patterns.txt file.txt | nl)
1 KO3322 proteinaseK (KO3322)
2 KO3435 Xxxxx KO3435;folding factor
3 KO3435 Yyyyy KO3435,xxxx

Question 2

Você pode usar um loop de shell:

$ while read pat; do 
    grep "$pat" file | 
        while read match do 
            echo -e "$pat\t$match"
        done
 done < patterns 
KO3435  Xxxxx KO3435;folding factor
KO3435  Yyyyy KO3435,xxxx
KO3322  proteinaseK (KO3322)

Eu testei executando isso no arquivo simples UniProt para humanos (625M) e usando 1000 IDs UniProt como padrões. Demorou cerca de 6 minutos no meu laptop Pentium i7. Demorou cerca de 35 segundos quando procurei apenas 100 padrões.

Conforme apontado nos comentários abaixo, você pode tornar isso um pouco mais rápido ignorando echoe usando grepas opções --labele -H:

$ while read pat; do 
    grep "$pat" --label="$pat" -H < file
done < patterns

Executar isso em seus arquivos de exemplo produz:

$ while read pat; do 
    grep "$pat" --label="$pat" -H < kegg.annotations; 
  done < allKO.IDs.txt > test1
terdon@oregano foo $ cat test1 
K02217:>aai:AARI_26600  ferritin-like protein; K02217 ferritin [EC:1.16.3.1]
K07448:>aai:AARI_33320  mrr; restriction system protein Mrr; K07448 restriction system protein

Answer

Você pode usar um loop de shell:

$ while read pat; do 
    grep "$pat" file | 
        while read match do 
            echo -e "$pat\t$match"
        done
 done < patterns 
KO3435  Xxxxx KO3435;folding factor
KO3435  Yyyyy KO3435,xxxx
KO3322  proteinaseK (KO3322)

Eu testei executando isso no arquivo simples UniProt para humanos (625M) e usando 1000 IDs UniProt como padrões. Demorou cerca de 6 minutos no meu laptop Pentium i7. Demorou cerca de 35 segundos quando procurei apenas 100 padrões.

Conforme apontado nos comentários abaixo, você pode tornar isso um pouco mais rápido ignorando echoe usando grepas opções --labele -H:

$ while read pat; do 
    grep "$pat" --label="$pat" -H < file
done < patterns

Executar isso em seus arquivos de exemplo produz:

$ while read pat; do 
    grep "$pat" --label="$pat" -H < kegg.annotations; 
  done < allKO.IDs.txt > test1
terdon@oregano foo $ cat test1 
K02217:>aai:AARI_26600  ferritin-like protein; K02217 ferritin [EC:1.16.3.1]
K07448:>aai:AARI_33320  mrr; restriction system protein Mrr; K07448 restriction system protein

Question 3

Você pode usarack:

$ ack "$(tr '\n' '|' < pattern.txt | sed -e 's/.$//')" --print0 --output='$& $_' file.txt
KO3322 proteinaseK (KO3322)
KO3435 Xxxxx KO3435;folding factor
KO3435 Yyyyy KO3435,xxxx

Answer

Você pode usarack:

$ ack "$(tr '\n' '|' < pattern.txt | sed -e 's/.$//')" --print0 --output='$& $_' file.txt
KO3322 proteinaseK (KO3322)
KO3435 Xxxxx KO3435;folding factor
KO3435 Yyyyy KO3435,xxxx

Grep para pesquisar padrão em um arquivo

Responder1

Responder2

Responder3

informação relacionada