尋找具有特定字串的連續行並根據表修改文件

尋找具有特定字串的連續行並根據表修改文件

我有一個無法解決的文字操作問題。假設我有一個如下所示的文字檔案(text.txt)。在某些情況下,一行 with 後面/locus_tag跟著一行 with /gene,而有些情況則不是。我想找到所有後面/locus_tag沒有跟隨的行/gene,然後使用如下所示的表(table.txt)將/locus_tag與 a匹配/gene,並將其添加/gene到我的文字檔案中的 .txt 之後/locus_tag

任何關於如何做到這一點的想法都會很棒。

/locus_tag="LOCUS_23770"
/note="ABC"
/locus_tag="LOCUS_23780"
/note="DEF"
/locus_tag="LOCUS_23980"
/note="GHI"
/locus_tag="LOCUS_24780"
/gene="BT_4758"
/note="ONP"
/locus_tag="LOCUS_25780"
/gene="BT_4768"
/note="WZX"

桌子

/locus_tag       /gene
LOCUS_00010      BT_4578
LOCUS_00020      BT_4577
LOCUS_00030      BT_2429

答案1

使用您的連結文件,這可以工作

awk 'BEGIN{FS="[ =]+"; OFS="="}
     BEGINFILE{fno++}
     fno==1{locus["\""$1"\""]="\""$2"\""; }
     fno>1{if (old ~ /LOCUS/ && $0 !~ /gene/) print "/gene", locus[old]; old=$3; print}
    ' table file1

                     /locus_tag="LOCUS_00030"
                     /note="WP_011108293.1 hypothetical protein (Bacteroides

                     /locus_tag="LOCUS_00030"
/gene="BT_2429"
                     /note="WP_011108293.1 hypothetical protein (Bacteroides

由於您不熟悉awk演練

awk 'BEGIN{FS="[ =]+"; OFS="="}
# set up the input field separator as any group of spaces and/or =
# and set the output field separator as =

     BEGINFILE{fno++}
     # Whenever you open a file, increment the file counter fno

     fno==1{locus["\""$1"\""]="\""$2"\""; }
     # if this is the first file (i.e. table) load the array `locus[]`
     # but wrap the fields in "..." so that they are exactly like the data file entries

     fno>1{if (old ~ /LOCUS/ && $0 !~ /gene/) print "/gene", locus[old]; old=$3; print}
     # if this is a data file
     # if the current value of old (i.e. the previous line) is a LOCUS
     # and && this line ($0) isn't a gene
     # add a gene by indexing into the locus array based upon the value of old
     # because old contains the last LOCUS we found
     # in all cases
     #    set old to the 3rd field on the current line,
     #       which on any LOCUS line is the string "LOCUS_?????" and
     #    print the current line
     # See note below re $2 vs $3 and FS

    ' table file1
    # your input files, table must be first, you can have more data files if you want

或者如果沒有多字符,FS則保留,old=$2因為它不會在資料檔案中的文字之前的空白處中斷,而多字符會這樣做。

下面根據您正在讀取的檔案設定欄位分隔符號FS=(fno==1)?" ":"="。表和=數據的空間

awk 'BEGIN{OFS="="}
     BEGINFILE{fno++;FS=(fno==1)?" ":"="}
     fno==1{locus["\""$1"\""]="\""$2"\""; }
     fno>1{if (old ~ /LOCUS/ && $0 !~ /gene/) print "/gene", locus[old]; old=$2; print}
    ' table file1

前提是表檔案沒有大到佔用記憶體。

並進行測試,在缺失的基因處插入一條訊息,如果它不僅僅適合空的基因/gene=

fno>1{if (old ~ /LOCUS/ && $0 !~ /gene/) print "/gene", (old in locus)?locus[old]:"\"GENE_MISSING_AT_LOCUS\""; old=$3; print}

更改欄位引用以old符合FS您正在使用的版本

                     /locus_tag="LOCUS_00020"
/gene="GENE_MISSING_AT_LOCUS"
                     /note="WP_008765457.1 hypothetical protein (Bacteroides

編輯

查看您連結到的範例文件,上面的範例與實際資料之間的格式差異只是一個問題,這與欄位編號混淆了。old=$2只需更改為old=$3.上面已更正。

相關內容