我有一個大小約為的文字檔案。 25 GB。我想根據第二列中的值刪除重複的行。如果在檔案中找到重複項,我想刪除列中具有該值的所有行,並僅保留第四列中具有最高值的一行。該文件為 CSV 格式,並且已排序。
storm_id,Cell_id,Windspeed,Storm_Surge,-1
2,10482422,45,0.06,-1
2,10482422,45,0.18,-1
2,10482422,45,0.4,-1
2,10482423,45,0.15,-1
2,10482423,45,0.43,-1
2,10482424,45,0.18,-1
2,10482424,45,0.49,-1
2,10482425,45,0.21,-1
2,10482425,45,0.52,-1
2,10482426,45,0.27,-1
2,10482426,45,0.64,-1
2,10482427,45,0.09,-1
2,10482427,45,0.34,-1
2,10482427,45,0.73,-1
在上面的範例中,我只想Cell_Id
透過刪除其他重複行來為每個行提供一個最大激增值
預期輸出為:
2,10482422,45,0.4,-1
2,10482423,45,0.43,-1
2,10482424,45,0.49,-1
2,10482425,45,0.52,-1
2,10482426,45,0.64,-1
2,10482427,45,0.73,-1
答案1
由於輸入似乎已按第二列分組/排序,這應該非常簡單並且不要求在記憶體中保存和排序整個資料集,一次只有兩筆記錄。1
我首先想到了 Awk 解決方案,但發現它處理數組和非空白字段分隔符號很笨拙。然後我決定寫一個簡短的 Python 程式:
#!/usr/bin/python3
import sys
DELIMITER = ','
def remove_duplicates(records):
prev = None
for r in records:
r = (int(r[0]), int(r[1]), int(r[2]), float(r[3]), int(r[4]))
if prev is None:
prev = r
elif r[1] != prev[1]:
yield prev
prev = r
elif r[3] > prev[3]:
prev = r
if prev is not None:
yield prev
def main():
for r in remove_duplicates(
l.rstrip('\n').rsplit(DELIMITER) for l in sys.stdin
):
print(*r, sep=',')
if __name__ == '__main__':
main()
在我的系統上,它的吞吐量約為 250,000 筆記錄或每 CPU 秒 5 MB。
用法
python3 remove-duplicates.py < input.txt > output.txt
該程式無法處理列標題,因此您需要將其刪除:
tail -n +2 < input.txt | python3 remove-duplicates.py > output.txt
如果您想將它們加回結果中:
{ read -r header && printf '%s\n' "$header" && python3 remove-duplicates.py; } < input.txt > output.txt
答案2
如果你把它們排序減少第四個欄位的順序,您可以簡單地使用關聯數組或哈希來獲取每個第二個字段值的第一次出現,例如awk -F, '!seen[$2]++' file
或perl -F, -ne 'print $_ unless $seen{$F[1]}++'
隨著值按遞增順序排列,以高效的單遍方式執行此操作會有點棘手 - 您可以透過在每次鍵值變更時列印上一行來執行此操作(只需進行一些設定):
awk -F, '
NR==1 {print; next} # print the header line
NR==2 {key=$2; next} # initialize the comparison
$2 != key {
print lastval; key = $2 # print the last (largest) value of the previous key group
}
{lastval = $0} # save the current line
END {print lastval} # clean up
' file
storm_id,Cell_id,Windspeed,Storm_Surge,-1
2,10482422,45,0.4,-1
2,10482423,45,0.43,-1
2,10482424,45,0.49,-1
2,10482425,45,0.52,-1
2,10482426,45,0.64,-1
2,10482427,45,0.73,-1
答案3
如果您沒有太多唯一的 Cell_ids,您可以在 Perl 關聯數組中追蹤已經見過的 Cell_ids。如果您確實有太多(並且我的 Perl 腳本記憶體不足),請編寫C
程式以將唯一的位元字段保留在位元字段中。這是 Perl。
#!/usr/bin/perl -w
use strict;
my %seen = (); # key=Cell_ID, value=1
my @cols=(); # for splitting input
while( <> ) { # read STDIN
@cols = split ',',$_;
next if ( defined $seen{$cols[1]}); # skip if we already saw this Cell_Id
$seen{$cols[1]} = 1;
print;
}
這是我的測試:
walt@bat:~(0)$ cat u.dat
storm_id,Cell_id,Windspeed,Storm_Surge,-1
2,10482422,45,0.06,-1
2,10482422,45,0.18,-1
2,10482422,45,0.4,-1
2,10482423,45,0.15,-1
2,10482423,45,0.43,-1
2,10482424,45,0.18,-1
2,10482424,45,0.49,-1
2,10482425,45,0.21,-1
2,10482425,45,0.52,-1
2,10482426,45,0.27,-1
2,10482426,45,0.64,-1
2,10482427,45,0.09,-1
2,10482427,45,0.34,-1
2,10482427,45,0.73,-1
walt@bat:~(0)$ perl ./unique.pl u.dat
storm_id,Cell_id,Windspeed,Storm_Surge,-1
2,10482422,45,0.06,-1
2,10482423,45,0.15,-1
2,10482424,45,0.18,-1
2,10482425,45,0.21,-1
2,10482426,45,0.27,-1
2,10482427,45,0.09,-1