
我有一個製表符分隔的多列文件,其中第 19 列如下所示:
gaA
gGg
Att
gtC
gGa
gcC
ccG
cTc
.
.
.
and so on
我只想 grep 大寫字符,所以我使用了:
cut -f19 1.table | grep -e '[[:upper:]]' -o
輸出是:
A
G
A
C
G
C
G
T
.
.
.
and so on
但我不想在 grep 之前使用 cut 。我現在有兩個問題:
- 有什麼方法可以從第 19 列開始 grep 而不是使用 cut 嗎?或者 grep 中是否有任何選項或參數來指定列?
- 我想將 grep 結果輸出作為新列放入 1.table 檔案中?或如何將 grep 結果輸出作為 1.table 檔案中的新欄位(如第 20 列)進行管道傳輸?
以下是 1.table 的輸入行(1.table 也有標題):
#CHROM POS ID REF ALT QUAL AC AN AF DP ExonicFunc.refGene Func.refGene AAChange.refGene Gene.refGene GeneDetail.refGene GENEINFO EFF refcodon ExAC_AFR ExAC_ALL ExAC_AMR ExAC_EAS ExAC_FIN ExAC_NFE ExAC_OTH ExAC_SAS gnomAD_exome_AFR gnomAD_exome_ALL gnomAD_exome_AMR gnomAD_exome_ASJ gnomAD_exome_EAS gnomAD_exome_FIN gnomAD_exome_NFE gnomAD_exome_OTH gnomAD_exome_SAS gnomAD_genome_AFR gnomAD_genome_ALL gnomAD_genome_AMR gnomAD_genome_ASJ gnomAD_genome_EAS gnomAD_genome_FIN gnomAD_genome_NFE gnomAD_genome_OTH 1000g2015aug_all esp6500siv2_all CADD_phred CADD_raw CADD_raw_rankscore CAF DANN_rankscore DANN_score Eigen Eigen-PC-raw Eigen-raw Eigen_coding_or_noncoding FATHMM_coding FATHMM_converted_rankscore FATHMM_noncoding FATHMM_pred FATHMM_score FS GTEx_V6_gene GTEx_V6_tissue GWAVA_region_scoreGWAVA_tss_score GWAVA_unmatched_score GenoCanyon_score GenoCanyon_score_rankscore Interpro_domain LRT_converted_rankscore LRT_pred LRT_score MetaLR_pred MetaLR_rankscore MetaLR_score MetaSVM_pred MetaSVM_rankscore MetaSVM_score MutationAssessor_pred MutationAssessor_score MutationAssessor_score_rankscore MutationTaster_converted_rankscore MutationTaster_pred MutationTaster_score PROVEAN_converted_rankscore PROVEAN_pred PROVEAN_score Polyphen2_HDIV_pred Polyphen2_HDIV_rankscore Polyphen2_HDIV_score Polyphen2_HVAR_pred Polyphen2_HVAR_rankscore Polyphen2_HVAR_score QD SIFT_converted_rankscore SIFT_pred SIFT_score SiPhy_29way_logOdds SiPhy_29way_logOdds_rankscore VC VEST3_rankscore VEST3_score WGT avsnp147=rs28410799 integrated_confidence_value integrated_fitCons_score integrated_fitCons_score_rankscorephastCons100way_vertebrate phastCons100way_vertebrate_rankscore phastCons20way_mammalian phastCons20way_mammalian_rankscorephyloP100way_vertebrate phyloP100way_vertebrate_rankscore phyloP20way_mammalian phyloP20way_mammalian_rankscore CLINSIG CLNACC CLNDBN CLNDSDB CLNDSDBID GT AD DP GQ PL GT AD DP GQ PL GT AD DP GQPL
chr1 13115765 rs141111983 C T 2280.92 3 6 0.5 153 synonymous_SNV exonic HNRNPCL2:NM_001136561:exon2:c.G636A:p.E212E HNRNPCL2 0 HNRNPCL2:440563 SYNONYMOUS_CODING(LOW|SILENT|gaG/gaA|E212|293|HNRNPCL2|protein_coding|CODING|ENST00000621994|2|T),NEXT_PROT[coiled-coil_region](LOW||||293|HNRNPCL2|protein_coding|CODING||2|T),INTRON(MODIFIER||||478|WI2-3308P17.2|protein_coding|CODING|ENST00000622351|1|T),INTRON(MODIFIER||||478|PRAMEF26|protein_coding|CODING|ENST00000621259|4|T) gaG gaA E212 0.4772 0.4933 0.4993 0.497 0.5 0.4918 0.4967 0.4996 0.4846 0.4959 0.4998 0.4969 0.499 0.4999 0.4939 0.4969 0.4998 0.4939 0.4125 0.4867 0.1888 0.4981 0.4997 0.3321 0.4604 0 0 0 0 0 0 0 0-0.3847 -0.3847-PC-raw -0.3847-raw 0 0.02308 0 0.87915 0 0 18.131 0 0 0.43 0.23 182 000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 000 0 0 0 0 0 14.91 0 0 0 0 0 SNV 0 0 1 rs141111983=rs28410799 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B00H7EW=0/1 38,18 56 99 722,0,1577 B00H7EX=0/1 31,29 60 99 1166,0,1211 B00H7EY=0/1 26,11 3799 423,0,1098
chr1 13115766 rs150951326 T C 2325.92 3 6 0.5 155 nonsynonymous_SNV exonic HNRNPCL2:NM_001136561:exon2:c.A635G:p.E212G HNRNPCL2 0 HNRNPCL2:440563 NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|gAg/gGg|E212G|293|HNRNPCL2|protein_coding|CODING|ENST00000621994|2|C),NEXT_PROT[coiled-coil_region](LOW||||293|HNRNPCL2|protein_coding|CODING||2|C),INTRON(MODIFIER||||478|WI2-3308P17.2|protein_coding|CODING|ENST00000622351|1|C),INTRON(MODIFIER||||478|PRAMEF26|protein_coding|CODING|ENST00000621259|4|C) gAg gGg E212G 0.4775 0.4934 0.4993 0.4972 0.5 0.4919 0.4967 0.4996 0.4851 0.496 0.4998 0.4969 0.4991 0.4999 0.494 0.4969 0.4998 0.494 0.4127 0.4867 0.1875 0.4981 0.4997 0.3323 0.4603 0 0 0.286 -0.453 0.058 0 0.019 0.324 -0.4897 -0.4897-PC-raw -0.4897-raw n 0.01015 0 0.80402 0 0 22.504 0 00.43 0.23 182 0 0.029 0 0 0 0 0 0 0 0 0 0 0 0 000 0 0 0 0 B 0.026 0 B 0.013 0 15.01 0 0 0 0 0 SNV 0.089 0.091 1 rs150951326=rs28410799 0 0.075 0.013 0.947 0.327 0.005 0.09 -0.854 0.044 -0.972 0.023 0 0 0 0 0 B00H7EW=0/1 38,19 57 99 764,0,1574 B00H7EX=0/1 31,30 61 991166,0,1211 B00H7EY=0/1 26,11 37 99 426,0,1056
chr1 13392320 rs767291041 C A 96.12 1 4 0.25 10 nonsynonymous_SNV exonic PRAMEF16:NM_001045480:exon3:c.C1243A:p.P415T,PRAMEF17:NM_001099851:exon3:c.C1243A:p.P415T PRAMEF16,PRAMEF17 0 PRAMEF17:391004 NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Cct/Act|P415T|474|PRAMEF17|protein_coding|CODING|ENST00000376098|3|A) Cct Act P415T 0 0.001 0 0 0 0.002 0 0 0 0.0006 0.0016 0 0 0 0.0009 0.0008 0 0 0.0002 0 0 0 0 0.0007 0 0 0 0 0 0 0 0 0 -0.1177 -0.1177-PC-raw -0.1177-raw 0 0.02548 0 0.24739 0 0 0 0 0 0 0 0 0 0 000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 000 0 0 0 10.68 0 0 0 0 0 SNV 0 0 1 rs782058522=rs28410799 000 0 0 0 0 0 0 0 0 0 0 0 0 0 B00H7EW=0/1 4,5 999 123,0,100 B00H7EX=0/0 1,0 1 3 0,3,30 B00H7EY=./. 0,0 0 0 0,0,0
chr1 13392320 rs767291041 C A 70.13 1 6 0.167 37 nonsynonymous_SNV exonic PRAMEF17:NM_001099851:exon3:c.C1243A:p.P415T PRAMEF17 0 PRAMEF17:391004 NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Cct/Act|P415T|474|PRAMEF17|protein_coding|CODING|ENST00000376098|3|A) Cct Act P415T 0.0048 0.0006 0 0 0 0 0 00.0002 0.0005 0.0009 0 0 0 0.0007 0.0006 0 0 0.0002 0 0 0 0 0.0004 0 0022.7 3.18 0.442 0 0.453 0.988 -0.1208 -0.1208-PC-raw -0.1208-raw c 0.0207 0.12 0.11274 T 2.72 000 0 0 0 0 0.061 0 0.843 D 0 T 0.356 0.094 T 0.045 -1.097 H 3.83 0.957 0.09 N 1 0.954 D -7.63 D 0.899 1 D 0.875 0.998 11.69 0.784 D 0.001 5.599 0.165 SNV 0.353 0.293 1 rs767291041=rs28410799 0 0.487 0.133 0.019 0.194 0.031 0.148 1.655 0.367 0.621 0.289 0 0 0 0 0 B00H7EW=0/1 3,3 6 94 101,0,94 B00H7EX=0/0 13,0 13 39 0,39,442 B00H7EY=0/0 18,0 18 48 0,48,720
答案1
grep 中有沒有選項或參數來指定列?
grep沒有字段分隔符選項。
使用以下內容awk相反的方法:
awk -F'\t' -v OFS='\t' '{match($19,/[A-Z]+/); $20=substr($19,RSTART,RLENGTH) FS $20}1' 1.table
match($19,/[A-Z]+/)
- 捕獲第 19 字段內的大寫字母
$20=substr($19,RSTART,RLENGTH) FS $20
- 從中提取匹配的大寫字母19th 字段並將其插入為20第 字段值
答案2
回答你關於如何做到這一點的字面問題grep
獨自的。即使grep
沒有為此設計,但使用 GNUgrep
並使用 PCRE 支援構建,您可以這樣做:
grep -Po '(?:^(?:[^\t]*\t){18}|\G)[^\t]*?\K[[:upper:]]'
即搜尋<not-TABs><tab>
行首或上一個符合項目末尾的 18 個序列 ( \G
),後面跟著盡可能少的非製表符(因此我們仍在第 19 個欄位),後面跟著大寫字母角色,但\K
我們重置了匹配的大寫字元之前的部分。
所以對於這樣的輸入:
X<tab>X<tab>....<tab>AbC<tab>X<tab>...
它會報告:
A
C
就像你的cut | grep
做法一樣。
如果您只對第 19 欄位中的第一個大寫字元感興趣,可以將其簡化為:
grep -Po '^(?:[^\t]*\t){18}[^\t]*?\K[[:upper:]]'
將其插入為第 20 個柱子,你可以這樣做:
paste <(cut -f1-19 < file) <(grep ...above < file) <(cut -f20- < file) > newfile
或將其插入為最後一列:
grep... < file | paste file - > newfile
答案3
和sed
你一起可以做到
sed '/^#/!s/\([^ ]* *\)\{18\}[a-z]*\([A-Z]\).*/& \2/'
也就是說,對於所有不以#
(/^#/!
選擇器)開頭的行,在 18 個非空格和空格的組合之後,將大寫字母標記為 以便以後\(\)
引用它,將整行本身“替換”,並用找到的大寫字母附加空格信。
如果您喜歡擴展正規表示式,也可以使用
sed -E '/^#/!s/([^ ]* *){18}[a-z]*([A-Z]).*/& \2/'
如果列之間用製表符而不是空格分隔,則可以
sed -E '/^#/!s/([^\t]*\t){18}[a-z]*([A-Z]).*/&\t\2/'