data:image/s3,"s3://crabby-images/51939/5193999f9e09bae727d5a999fd0cfe170ea0a507" alt="AWK:如何從 CSV 中刪除重複的標題行?"
處理由多個 CSV 連接產生的 csv,我正在尋找刪除標題行重複的可能性(每個連接的 CSV 中存在的重複標題行是相同的)。這是我的 CSV,其中包含第一行的重複內容:
ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before)
1000, lig40, 1, 0.805136, -5.5200, 79
1000, lig868, 1, 0.933209, -5.6100, 42
1000, lig278, 1, 0.933689, -5.7600, 40
1000, lig619, 3, 0.946354, -7.6100, 20
1000, lig211, 1, 0.960048, -5.2800, 39
1000, lig40, 2, 0.971051, -4.9900, 40
1000, lig868, 3, 0.986384, -5.5000, 29
1000, lig12, 3, 0.988506, -6.7100, 16
1000, lig800, 16, 0.995574, -4.5300, 40
1000, lig800, 1, 0.999935, -5.7900, 22
1000, lig619, 1, 1.00876, -7.9000, 3
1000, lig619, 2, 1.02254, -7.6400, 1
1000, lig12, 1, 1.02723, -6.8600, 5
1000, lig12, 2, 1.03273, -6.8100, 4
1000, lig211, 2, 1.03722, -5.2000, 19
1000, lig211, 3, 1.03738, -5.0400, 21
ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before)
10V1, lig40, 1, 0.513472, -6.4600, 150
10V1, lig211, 2, 0.695981, -6.8200, 91
10V1, lig278, 1, 0.764432, -7.0900, 70
10V1, lig868, 1, 0.787698, -7.3100, 62
10V1, lig211, 1, 0.83416, -6.8800, 54
10V1, lig868, 3, 0.888408, -6.4700, 44
10V1, lig278, 2, 0.915932, -6.6600, 35
10V1, lig12, 1, 0.922741, -9.3600, 19
10V1, lig12, 8, 0.934144, -7.4600, 24
10V1, lig40, 2, 0.949955, -5.9000, 34
10V1, lig800, 5, 0.964194, -5.9200, 30
10V1, lig868, 2, 0.966243, -6.9100, 20
10V1, lig12, 2, 0.972575, -8.3000, 10
10V1, lig619, 6, 0.979168, -8.1600, 9
10V1, lig619, 4, 0.986202, -8.7800, 5
10V1, lig800, 2, 0.989599, -6.2400, 20
10V1, lig619, 1, 0.989725, -9.2900, 3
10V1, lig12, 7, 0.991535, -7.5800, 9
ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before)
10V2, lig40, 1, 0.525767, -6.4600, 146
10V2, lig211, 2, 0.744702, -6.8200, 78
10V2, lig278, 1, 0.749015, -7.0900, 74
10V2, lig868, 1, 0.772025, -7.3100, 66
10V2, lig211, 1, 0.799829, -6.8700, 63
10V2, lig12, 1, 0.899345, -9.1600, 25
10V2, lig12, 4, 0.899606, -7.5500, 32
10V2, lig868, 3, 0.903364, -6.4800, 40
10V2, lig278, 3, 0.913145, -6.6300, 36
10V2, lig800, 5, 0.94576, -5.9100, 35
要後處理此 CSV,我需要刪除標題行的重複項
ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before)
僅將標題保留在融合的 csv 的開頭(在第一行!)。
awk '{first=$1;gsub("ID(Prot)","");print first,$0}' mycsv.csv > csv_without_repeats.csv
但是它無法識別標題行,這意味著該模式未正確定義。
假設我的 AWK 程式碼應該進一步透過管道進行排序,以在重複過濾後對行進行排序,如何更正它?
awk '{first=$1;gsub(/ID(Prot)?(\([-azA-Z]+\))?/,"");print first,$0}' | LC_ALL=C sort -k4,4g input.csv > sorted_and_without_repeats.csv
答案1
這是一個awk
腳本,它將跳過任何以 開頭的行ID(Prot)
,除非它是第一行:
awk 'NR==1 || !/^ID\(Prot\)/' file > newFile
這裡有同樣的想法perl
:
perl -ne 'print if $.==1 || !/^ID\(Prot\)/' file > newFile
或者,就地編輯原始文件:
perl -i -ne 'print if $.==1 || !/^ID\(Prot\)/' file
答案2
符合 POSIX 標準sed
(在GNU sed
和 上測試busybox sed
):
sed '1!{/^ID/d;}' data
以 . 開頭時,刪除除第一行之外的所有行ID
。某些sed
實作可以-i
選擇啟用就地編輯檔案。
awk
:
awk 'NR == 1 {h=$0; print} $0 == h {next}1' data
如果我們在第一行保存標題並列印它,那麼對於我們處理的每一行,如果這等於標題,則跳過它,否則列印它。
或者同樣的perl
:
perl -lne '$h = $_ if $. == 1; print if($_ ne $h || $. == 1)' data
新增-i
允許perl
就地編輯檔案的選項。
答案3
這是使用該實用程式處理 pbm 的簡單方法awk
。但請注意,即使標題中存在更少/更多的空間,它們也會包含在輸出中。
awk '
NR>1&&$0==hdr{next}
NR==1{hdr=$0}1
' file
相同的方法,但在流編輯器實用程式 sed 中:
sed -En '
1h;1!G;/^(.*)\n\1$/!P
' file
答案4
$ awk 'NR==1{h=$0; print} $0!=h' file
ID(Prot), ID(lig), ID(cluster), dG(rescored), dG(before), POP(before)
1000, lig40, 1, 0.805136, -5.5200, 79
1000, lig868, 1, 0.933209, -5.6100, 42
1000, lig278, 1, 0.933689, -5.7600, 40
1000, lig619, 3, 0.946354, -7.6100, 20
1000, lig211, 1, 0.960048, -5.2800, 39
1000, lig40, 2, 0.971051, -4.9900, 40
1000, lig868, 3, 0.986384, -5.5000, 29
1000, lig12, 3, 0.988506, -6.7100, 16
1000, lig800, 16, 0.995574, -4.5300, 40
1000, lig800, 1, 0.999935, -5.7900, 22
1000, lig619, 1, 1.00876, -7.9000, 3
1000, lig619, 2, 1.02254, -7.6400, 1
1000, lig12, 1, 1.02723, -6.8600, 5
1000, lig12, 2, 1.03273, -6.8100, 4
1000, lig211, 2, 1.03722, -5.2000, 19
1000, lig211, 3, 1.03738, -5.0400, 21
10V1, lig40, 1, 0.513472, -6.4600, 150
10V1, lig211, 2, 0.695981, -6.8200, 91
10V1, lig278, 1, 0.764432, -7.0900, 70
10V1, lig868, 1, 0.787698, -7.3100, 62
10V1, lig211, 1, 0.83416, -6.8800, 54
10V1, lig868, 3, 0.888408, -6.4700, 44
10V1, lig278, 2, 0.915932, -6.6600, 35
10V1, lig12, 1, 0.922741, -9.3600, 19
10V1, lig12, 8, 0.934144, -7.4600, 24
10V1, lig40, 2, 0.949955, -5.9000, 34
10V1, lig800, 5, 0.964194, -5.9200, 30
10V1, lig868, 2, 0.966243, -6.9100, 20
10V1, lig12, 2, 0.972575, -8.3000, 10
10V1, lig619, 6, 0.979168, -8.1600, 9
10V1, lig619, 4, 0.986202, -8.7800, 5
10V1, lig800, 2, 0.989599, -6.2400, 20
10V1, lig619, 1, 0.989725, -9.2900, 3
10V1, lig12, 7, 0.991535, -7.5800, 9
10V2, lig40, 1, 0.525767, -6.4600, 146
10V2, lig211, 2, 0.744702, -6.8200, 78
10V2, lig278, 1, 0.749015, -7.0900, 74
10V2, lig868, 1, 0.772025, -7.3100, 66
10V2, lig211, 1, 0.799829, -6.8700, 63
10V2, lig12, 1, 0.899345, -9.1600, 25
10V2, lig12, 4, 0.899606, -7.5500, 32
10V2, lig868, 3, 0.903364, -6.4800, 40
10V2, lig278, 3, 0.913145, -6.6300, 36
10V2, lig800, 5, 0.94576, -5.9100, 35