根據列值刪除重複行

Question 1

由於輸入似乎已按第二列分組/排序，這應該非常簡單並且不要求在記憶體中保存和排序整個資料集，一次只有兩筆記錄。¹

我首先想到了 Awk 解決方案，但發現它處理數組和非空白字段分隔符號很笨拙。然後我決定寫一個簡短的 Python 程式：

#!/usr/bin/python3
import sys
DELIMITER = ','

def remove_duplicates(records):
    prev = None
    for r in records:
        r = (int(r[0]), int(r[1]), int(r[2]), float(r[3]), int(r[4]))
        if prev is None:
            prev = r
        elif r[1] != prev[1]:
            yield prev
            prev = r
        elif r[3] > prev[3]:
            prev = r
    if prev is not None:
        yield prev

def main():
    for r in remove_duplicates(
        l.rstrip('\n').rsplit(DELIMITER) for l in sys.stdin
    ):
        print(*r, sep=',')

if __name__ == '__main__':
    main()

在我的系統上，它的吞吐量約為 250,000 筆記錄或每 CPU 秒 5 MB。

用法

python3 remove-duplicates.py < input.txt > output.txt

該程式無法處理列標題，因此您需要將其刪除：

tail -n +2 < input.txt | python3 remove-duplicates.py > output.txt

如果您想將它們加回結果中：

{ read -r header && printf '%s\n' "$header" && python3 remove-duplicates.py; } < input.txt > output.txt

¹這是相對於沃特納託的和鋼鐵司機的針對不適合主存的資料集的方法。

Answer

由於輸入似乎已按第二列分組/排序，這應該非常簡單並且不要求在記憶體中保存和排序整個資料集，一次只有兩筆記錄。¹

我首先想到了 Awk 解決方案，但發現它處理數組和非空白字段分隔符號很笨拙。然後我決定寫一個簡短的 Python 程式：

#!/usr/bin/python3
import sys
DELIMITER = ','

def remove_duplicates(records):
    prev = None
    for r in records:
        r = (int(r[0]), int(r[1]), int(r[2]), float(r[3]), int(r[4]))
        if prev is None:
            prev = r
        elif r[1] != prev[1]:
            yield prev
            prev = r
        elif r[3] > prev[3]:
            prev = r
    if prev is not None:
        yield prev

def main():
    for r in remove_duplicates(
        l.rstrip('\n').rsplit(DELIMITER) for l in sys.stdin
    ):
        print(*r, sep=',')

if __name__ == '__main__':
    main()

在我的系統上，它的吞吐量約為 250,000 筆記錄或每 CPU 秒 5 MB。

用法

python3 remove-duplicates.py < input.txt > output.txt

該程式無法處理列標題，因此您需要將其刪除：

tail -n +2 < input.txt | python3 remove-duplicates.py > output.txt

如果您想將它們加回結果中：

{ read -r header && printf '%s\n' "$header" && python3 remove-duplicates.py; } < input.txt > output.txt

¹這是相對於沃特納託的和鋼鐵司機的針對不適合主存的資料集的方法。

Question 2

如果你把它們排序減少第四個欄位的順序，您可以簡單地使用關聯數組或哈希來獲取每個第二個字段值的第一次出現，例如awk -F, '!seen[$2]++' file或perl -F, -ne 'print $_ unless $seen{$F[1]}++'

隨著值按遞增順序排列，以高效的單遍方式執行此操作會有點棘手 - 您可以透過在每次鍵值變更時列印上一行來執行此操作（只需進行一些設定）：

awk -F, '
  NR==1 {print; next}        # print the header line
  NR==2 {key=$2; next}       # initialize the comparison
  $2 != key {
    print lastval; key = $2  # print the last (largest) value of the previous key group
  } 
  {lastval = $0}             # save the current line
  END {print lastval}        # clean up
' file
storm_id,Cell_id,Windspeed,Storm_Surge,-1
2,10482422,45,0.4,-1
2,10482423,45,0.43,-1
2,10482424,45,0.49,-1
2,10482425,45,0.52,-1
2,10482426,45,0.64,-1
2,10482427,45,0.73,-1

Answer

如果你把它們排序減少第四個欄位的順序，您可以簡單地使用關聯數組或哈希來獲取每個第二個字段值的第一次出現，例如awk -F, '!seen[$2]++' file或perl -F, -ne 'print $_ unless $seen{$F[1]}++'

隨著值按遞增順序排列，以高效的單遍方式執行此操作會有點棘手 - 您可以透過在每次鍵值變更時列印上一行來執行此操作（只需進行一些設定）：

awk -F, '
  NR==1 {print; next}        # print the header line
  NR==2 {key=$2; next}       # initialize the comparison
  $2 != key {
    print lastval; key = $2  # print the last (largest) value of the previous key group
  } 
  {lastval = $0}             # save the current line
  END {print lastval}        # clean up
' file
storm_id,Cell_id,Windspeed,Storm_Surge,-1
2,10482422,45,0.4,-1
2,10482423,45,0.43,-1
2,10482424,45,0.49,-1
2,10482425,45,0.52,-1
2,10482426,45,0.64,-1
2,10482427,45,0.73,-1

Question 3

如果您沒有太多唯一的 Cell_ids，您可以在 Perl 關聯數組中追蹤已經見過的 Cell_ids。如果您確實有太多（並且我的 Perl 腳本記憶體不足），請編寫C程式以將唯一的位元字段保留在位元字段中。這是 Perl。

#!/usr/bin/perl -w
use strict;
my %seen = ();          # key=Cell_ID, value=1
my @cols=();            # for splitting input

while( <> ) {           # read STDIN
  @cols = split ',',$_;
  next if ( defined $seen{$cols[1]}); # skip if we already saw this Cell_Id
  $seen{$cols[1]} = 1;
  print;
}

這是我的測試：

walt@bat:~(0)$ cat u.dat
storm_id,Cell_id,Windspeed,Storm_Surge,-1
2,10482422,45,0.06,-1
2,10482422,45,0.18,-1
2,10482422,45,0.4,-1
2,10482423,45,0.15,-1
2,10482423,45,0.43,-1
2,10482424,45,0.18,-1
2,10482424,45,0.49,-1
2,10482425,45,0.21,-1
2,10482425,45,0.52,-1
2,10482426,45,0.27,-1
2,10482426,45,0.64,-1
2,10482427,45,0.09,-1
2,10482427,45,0.34,-1
2,10482427,45,0.73,-1
walt@bat:~(0)$ perl ./unique.pl u.dat
storm_id,Cell_id,Windspeed,Storm_Surge,-1
2,10482422,45,0.06,-1
2,10482423,45,0.15,-1
2,10482424,45,0.18,-1
2,10482425,45,0.21,-1
2,10482426,45,0.27,-1
2,10482427,45,0.09,-1

Answer

如果您沒有太多唯一的 Cell_ids，您可以在 Perl 關聯數組中追蹤已經見過的 Cell_ids。如果您確實有太多（並且我的 Perl 腳本記憶體不足），請編寫C程式以將唯一的位元字段保留在位元字段中。這是 Perl。

#!/usr/bin/perl -w
use strict;
my %seen = ();          # key=Cell_ID, value=1
my @cols=();            # for splitting input

while( <> ) {           # read STDIN
  @cols = split ',',$_;
  next if ( defined $seen{$cols[1]}); # skip if we already saw this Cell_Id
  $seen{$cols[1]} = 1;
  print;
}

這是我的測試：

walt@bat:~(0)$ cat u.dat
storm_id,Cell_id,Windspeed,Storm_Surge,-1
2,10482422,45,0.06,-1
2,10482422,45,0.18,-1
2,10482422,45,0.4,-1
2,10482423,45,0.15,-1
2,10482423,45,0.43,-1
2,10482424,45,0.18,-1
2,10482424,45,0.49,-1
2,10482425,45,0.21,-1
2,10482425,45,0.52,-1
2,10482426,45,0.27,-1
2,10482426,45,0.64,-1
2,10482427,45,0.09,-1
2,10482427,45,0.34,-1
2,10482427,45,0.73,-1
walt@bat:~(0)$ perl ./unique.pl u.dat
storm_id,Cell_id,Windspeed,Storm_Surge,-1
2,10482422,45,0.06,-1
2,10482423,45,0.15,-1
2,10482424,45,0.18,-1
2,10482425,45,0.21,-1
2,10482426,45,0.27,-1
2,10482427,45,0.09,-1

根據列值刪除重複行

答案1

用法

答案2

答案3

相關內容