Como analisar um arquivo e gravar em colunas separadas

Como analisar um arquivo e gravar em colunas separadas

Estou tentando analisar a saída do antismash para contar o número de BGC. Encontrei os scripts que as pessoas têm com python sobre os quais não sei nada, então estou tentando descobrir isso usando o script bash.

Os arquivos em formato de banco de genes com clusters previstos são assim:

head -30 sca_11_chr8_3_0.region001.gbk
LOCUS       sca_11_chr8_3_0        45390 bp    DNA     linear   UNK 01-JAN-1980
DEFINITION  sca_11_chr8_3_0.
ACCESSION   sca_11_chr8_3_0
VERSION     sca_11_chr8_3_0
KEYWORDS    .
SOURCE
  ORGANISM
            .
COMMENT     ##antiSMASH-Data-START##
            Version      :: 6.0.1-a859617(changed)
            Run date     :: 2021-10-31 18:00:02
            NOTE: This is a single cluster extracted from a larger record!
            Orig. start  :: 169481
            Orig. end    :: 214871
            ##antiSMASH-Data-END##
FEATURES             Location/Qualifiers
     protocluster    1..45390
                     /aStool="rule-based-clusters"
                     /contig_edge="False"
                     /core_location="join{[194767:194871](-),
                     [194650:194652](-), [191596:194619](-),
                     [189481:191503](-)}"
                     /cutoff="20000"
                     /detection_rule="cds(Condensation and (AMP-binding or
                     A-OX))"
                     /neighbourhood="20000"
                     /product="NRPS"
                     /protocluster_number="1"
                     /tool="antismash"
     proto_core      complement(join(20001..22022,22116..25138,25170..25171,

cauda sca_11_chr8_3_0.region001.gbk

    44881 ggagcttgtg gagagaagtg agacgtatcg cacgaatgct cttcagcaga tgctgggcag
    44941 ttagaggatt tgcactttag tttcatagag ttgatgtgtc gaggagataa tttgagatac
    45001 cagtatatgt aatttaccta cctacctagt cgagattgga cattgtacaa gagaaataac
    45061 aactaactat acgagacaag cctgatgtgt tgatagtttc attcatgtct ggtgtttgtg
    45121 gcatgtttat gttggagtag ctgtacagaa gataccgcgc tattcccagt gatcatggcc
    45181 cccacgcctc caactcggca cctgaccttg atcccctttg ggaagcatgt ctcagtgtct
    45241 cagccgtgag ccgtagaggc tgcacagcat ggagaagctg tcctgtcaat tcaggggatt
    45301 tgcccacggg ggctatcata tgatgaatct cggacaccct acacgttgtt accgcctttc
    45361 ttagctcctg ctggtagccg tcccctgaac
//

Primeiro concatenei os arquivos gbk em um para que contenha todos os clusters previstos e, em seguida, grep os caracteres que me deram o ID do locus, início e fim do cluster e tipo de cluster.

cat sca_*.gbk > Necha2_SMclusters.gbk
grep "DEFINITION\|Orig\|product=" Necha2_SMclusters.gbk > Necha2_SMclusters_filtered.txt

o que me dá um arquivo como este

DEFINITION  sca_32_chr11_3_0.
            Orig. start  :: 381231
            Orig. end    :: 428233
                     /product="T1PKS"
                     /product="T1PKS"
                     /product="T1PKS"
                     /product="T1PKS"
DEFINITION  sca_32_chr11_3_0.
            Orig. start  :: 464307
            Orig. end    :: 486217
                     /product="terpene"
                     /product="terpene"
                     /product="terpene"
                     /product="terpene"
DEFINITION  sca_33_chr6_1_0.
            Orig. start  :: 140267
            Orig. end    :: 227928
                     /product="NRPS-like"
                     /product="T1PKS"
                     /product="NRPS-like"
                     /product="T1PKS"
                     /product="NRPS-like"
                     /product="NRPS-like"
                     /product="NRPS-like"
                     /product="T1PKS"
                     /product="T1PKS"
                     /product="T1PKS"
DEFINITION  sca_39_chr11_5_0.
            Orig. start  :: 270154
            Orig. end    :: 324310
                     /product="NRPS"
                     /product="NRPS"
                     /product="NRPS"
                     /product="NRPS"

A partir deste arquivo, quero obter um arquivo parecido com este.

Locus name  start   end ClusterType
sca_9_chr7_10_0.    369577  421460  T1PKS,NRPS
sca_33_chr6_1_0.    140267  227928  NRPS-like, T1PKS
sca_32_chr11_3_0    381231  428233  T1PKS

Por enquanto, é disso que preciso: um arquivo com todos os clusters previstos.

Muito obrigado!!

Responder1

Dado este exemplo de entrada:

$ cat file1.gbk
DEFINITION  sca_32_chr11_3_0.
foo
            Orig. start  :: 381231
            Orig. end    :: 428233
                     /product="T1PKS"
                     /product="T1PKS"
bar
                     /product="T1PKS"
                     /product="T1PKS"
//
stuff
DEFINITION  sca_32_chr11_3_0.
            Orig. start  :: 464307
            Orig. end    :: 486217
                     /product="terpene"
nonsense
                     /product="terpene"
                     /product="terpene"
                     /product="terpene"
//
DEFINITION  sca_33_chr6_1_0.
            Orig. start  :: 140267
            Orig. end    :: 227928
                     /product="NRPS-like"
                     /product="T1PKS"
whatever
                     /product="NRPS-like"
                     /product="T1PKS"
                     /product="NRPS-like"
                     /product="NRPS-like"
                     /product="NRPS-like"
                     /product="T1PKS"
                     /product="T1PKS"
                     /product="T1PKS"

$ cat file2.gbk
here we go
DEFINITION  sca_39_chr11_5_0.
            Orig. start  :: 270154
more irrelevant text
            Orig. end    :: 324310
                     /product="NRPS"
                     /product="NRPS"
                     /product="NRPS"
                     /product="NRPS"

Este roteiro:

$ cat tst.awk
BEGIN { OFS="\t" }
$1 == "DEFINITION"    {
    if ( ++cnt == 1 ) {
        print "Locus name", "start", "end", "ClusterType"
    }
    prt()
    locus = $2
}
/Orig\. start/        { start = $NF }
/Orig\. end/          { end = $NF }
sub(".*/product=","") { gsub(/"/,""); types[$NF] }
END { prt() }

function prt(   ct, type) {
    if ( locus != "" ) {
        for (type in types) {
            ct = (ct=="" ? "" : ct ",") type
        }
        print locus, start, end, ct
    }
    delete types
    locus = ""
}

produzirá esta saída:

$ awk -f tst.awk *.gbk
Locus name      start   end     ClusterType
sca_32_chr11_3_0.       381231  428233  T1PKS
sca_32_chr11_3_0.       464307  486217  terpene
sca_33_chr6_1_0.        140267  227928  T1PKS,NRPS-like
sca_39_chr11_5_0.       270154  324310  NRPS

informação relacionada