處理大記錄/段落

Question 1

如果您只想跳過有問題的記錄：

awk 'BEGIN { ORS=RS="\n\n" } length <= 100*1000' file

這將列印小於或等於 100k 個字元的每筆記錄。

如果記錄太大，請刪除以特定正整數開頭的欄位：

awk -v number=149 'BEGIN { ORS=RS="\n\n"; OFS=FS="\n" }
    length <= 100*1000 { print; next }
    {
        # This is a too long record.
        # Re-create it without any fields whose first tab-delimited
        # sub-field is the number in the variable number.

        # Split the record into an array of fields, a.
        nf = split($0,a)

        # Empty the record.
        $0 = ""

        # Go through the fields and add back the ones that we
        # want to the output record.
        for (i = 1; i <= nf; ++i) {
            split(a[i],b,"\t")
            if (b[1] != number) $(NF+1) = a[i]
        }

        # Print the output record.
        print
    }' file

這將列印簡短的記錄，就像以前一樣。較長的記錄將被刪除，其第一個製表符分隔的子欄位是數字number（此處在命令列上給出為 149）的所有欄位。

對於大型記錄，會重新建立記錄，其中不包含我們不需要的欄位。內部循環透過拆分製表符上的欄位並附加第一個製表符分隔子欄位不是的欄位來重新建立輸出記錄number：

for (i = 1; i <= nf; ++i) {
    split(a[i],b,"\t")
    if (b[1] != number) $(NF+1) = a[i]
}

由於 POSIX 規範awk留下了當您未指定多字元值時會發生什麼RS（大多數實作將其視為正規表示式），因此您可以使用RS=""; ORS="\n\n"而不是ORS=RS="\n\n"使用嚴格一致的awk實作時。如果執行此操作，請注意資料中的多個空白行將不再分隔空記錄。

Answer

如果您只想跳過有問題的記錄：

awk 'BEGIN { ORS=RS="\n\n" } length <= 100*1000' file

這將列印小於或等於 100k 個字元的每筆記錄。

如果記錄太大，請刪除以特定正整數開頭的欄位：

awk -v number=149 'BEGIN { ORS=RS="\n\n"; OFS=FS="\n" }
    length <= 100*1000 { print; next }
    {
        # This is a too long record.
        # Re-create it without any fields whose first tab-delimited
        # sub-field is the number in the variable number.

        # Split the record into an array of fields, a.
        nf = split($0,a)

        # Empty the record.
        $0 = ""

        # Go through the fields and add back the ones that we
        # want to the output record.
        for (i = 1; i <= nf; ++i) {
            split(a[i],b,"\t")
            if (b[1] != number) $(NF+1) = a[i]
        }

        # Print the output record.
        print
    }' file

這將列印簡短的記錄，就像以前一樣。較長的記錄將被刪除，其第一個製表符分隔的子欄位是數字number（此處在命令列上給出為 149）的所有欄位。

對於大型記錄，會重新建立記錄，其中不包含我們不需要的欄位。內部循環透過拆分製表符上的欄位並附加第一個製表符分隔子欄位不是的欄位來重新建立輸出記錄number：

for (i = 1; i <= nf; ++i) {
    split(a[i],b,"\t")
    if (b[1] != number) $(NF+1) = a[i]
}

由於 POSIX 規範awk留下了當您未指定多字元值時會發生什麼RS（大多數實作將其視為正規表示式），因此您可以使用RS=""; ORS="\n\n"而不是ORS=RS="\n\n"使用嚴格一致的awk實作時。如果執行此操作，請注意資料中的多個空白行將不再分隔空記錄。

Question 2

另一種awk方法：

awk -v lim=99999 'BEGIN{RS=""; ORS="\n\n"}\
 {while (length()>=lim) {if (!sub(/\n149\t[^\n]*/,"")) break;}} length()<lim' file

149如果記錄長度超過變數中指定的限制，這將逐漸刪除以lim「nothing」開頭的行，直到保留限製或不再可能減少（由實際替換的數量表示）為 0)。然後它只會列印最終長度小於限制的記錄。

壞處：它將刪除149從第一行開始的行，因此如果它們構成連續文字的各個元素，則該文字將變得有些難以理解。

筆記：指定RS=""而不是顯式的RS="\n\n"是便攜的在「段落模式」中使用的方式awk，因為多重字元的行為RS不是由 POSIX 規範定義的。然而，如果可以有空的文件中的記錄，它們將被忽略awk，因此不會出現在輸出中。如果這不是您想要的，您可能必須使用明確RS="\n\n"表示法 - 大多數awk實作會將其視為正規表示式，並執行人們「天真的」期望的操作。

Answer

另一種awk方法：

awk -v lim=99999 'BEGIN{RS=""; ORS="\n\n"}\
 {while (length()>=lim) {if (!sub(/\n149\t[^\n]*/,"")) break;}} length()<lim' file

149如果記錄長度超過變數中指定的限制，這將逐漸刪除以lim「nothing」開頭的行，直到保留限製或不再可能減少（由實際替換的數量表示）為 0)。然後它只會列印最終長度小於限制的記錄。

壞處：它將刪除149從第一行開始的行，因此如果它們構成連續文字的各個元素，則該文字將變得有些難以理解。

筆記：指定RS=""而不是顯式的RS="\n\n"是便攜的在「段落模式」中使用的方式awk，因為多重字元的行為RS不是由 POSIX 規範定義的。然而，如果可以有空的文件中的記錄，它們將被忽略awk，因此不會出現在輸出中。如果這不是您想要的，您可能必須使用明確RS="\n\n"表示法 - 大多數awk實作會將其視為正規表示式，並執行人們「天真的」期望的操作。

Question 3

每當您使用\n\n記錄分隔符號時，請考慮 Perl 和段落模式（來自man perlrun）：

-0[octal/hexadecimal]
        specifies the input record separator ($/) as an octal or hexadecimal number.  
   [...]
        The special value 00 will cause Perl to slurp files in paragraph mode.

使用它，您可以執行以下操作：

刪除所有長度超過 100,000 的記錄人物（請注意，這可能與位元組不同，具體取決於文件的編碼）：
```
 perl -00 -ne 'print unless length()>100000' file
```
透過刪除前 100000 個字元之後的所有字元來修剪任何超過 100000 個字元的記錄：
```
 perl -00 -lne 'print substr($_,0,100000)' file
```

149刪除以:開頭的行

 perl -00 -pe 's/(^|\n)149\s+[^\n]+//g;' file

149僅當該記錄長度超過 100000 個字元時才刪除以 but 開頭的行：
```
 perl -00 -pe 's/(^|\n)149\s+[^\n]+//g if length()>100000; ' file
```
如果記錄長度超過 100000 個字符，則刪除以開頭的行，149直到記錄少於 100000 個字符或不再有以 149 開頭的行：
```
 perl -00 -pe 'while(length()>100000 && /(^|\n)149\s/){s/(^|\n)149\s+[^\n]+//}' file
```

如果記錄長度超過 100000 個字符，則刪除以開頭的行，149直到該記錄少於 100000 個字符或不再有 149 的行，並且如果是仍然超過 100000 個字符，僅列印前 100000 個：

 perl -00 -lne 'while(length()>100000 && /(^|\n)149\s/){
                     s/(^|\n)149\s+[^\n]+//
                }
                print substr($_,0,100000)' file

最後，如上所述，但刪除整行，而不僅僅是字符，直到獲得正確的大小，這樣就不會被截斷記錄：

 perl -00 -ne 'while(length()>100000 && /(^|\n)149\s/){
                 s/(^|\n)149\s+[^\n]+//
               }
               map{
                 $out.="$_\n" if length($out . "\n$_")<=100000
               }split(/\n/); 
               print "$out\n"; $out="";' file

Answer

每當您使用\n\n記錄分隔符號時，請考慮 Perl 和段落模式（來自man perlrun）：

-0[octal/hexadecimal]
        specifies the input record separator ($/) as an octal or hexadecimal number.  
   [...]
        The special value 00 will cause Perl to slurp files in paragraph mode.

使用它，您可以執行以下操作：

刪除所有長度超過 100,000 的記錄人物（請注意，這可能與位元組不同，具體取決於文件的編碼）：
```
 perl -00 -ne 'print unless length()>100000' file
```
透過刪除前 100000 個字元之後的所有字元來修剪任何超過 100000 個字元的記錄：
```
 perl -00 -lne 'print substr($_,0,100000)' file
```

149刪除以:開頭的行

 perl -00 -pe 's/(^|\n)149\s+[^\n]+//g;' file

149僅當該記錄長度超過 100000 個字元時才刪除以 but 開頭的行：
```
 perl -00 -pe 's/(^|\n)149\s+[^\n]+//g if length()>100000; ' file
```
如果記錄長度超過 100000 個字符，則刪除以開頭的行，149直到記錄少於 100000 個字符或不再有以 149 開頭的行：
```
 perl -00 -pe 'while(length()>100000 && /(^|\n)149\s/){s/(^|\n)149\s+[^\n]+//}' file
```

如果記錄長度超過 100000 個字符，則刪除以開頭的行，149直到該記錄少於 100000 個字符或不再有 149 的行，並且如果是仍然超過 100000 個字符，僅列印前 100000 個：

 perl -00 -lne 'while(length()>100000 && /(^|\n)149\s/){
                     s/(^|\n)149\s+[^\n]+//
                }
                print substr($_,0,100000)' file

最後，如上所述，但刪除整行，而不僅僅是字符，直到獲得正確的大小，這樣就不會被截斷記錄：

 perl -00 -ne 'while(length()>100000 && /(^|\n)149\s/){
                 s/(^|\n)149\s+[^\n]+//
               }
               map{
                 $out.="$_\n" if length($out . "\n$_")<=100000
               }split(/\n/); 
               print "$out\n"; $out="";' file

Question 4

可能會更優雅，但這裡有一個解決方案：

cat records.txt | awk -v RS='' '{if (length>99999) {gsub(/\n149\t[^\n]*\n/,"\n");print $0"\n"} else {print $0"\n"} }'

_{我相信我知道貓的無用用途從左到右的流程更清晰。}

其中 99999 是閾值大小，149 是在這種情況下要刪除的行的開頭（欄位名稱）。

我使用非貪婪\n149\t[^\n]*\n/來刪除^149\t.*$.

gsub用指定的字串替換模式並傳回所做的替換/替換的數量。

它的靈感來自於這個答案。

Answer