如何刪除重複名稱並在唯一名稱後列印數組

Question 1

更多相同的

awk '$1 != p { if (p>"") {printf "\n"} printf "%s",$1; p=$1 } { printf "\t%s",$2 } END { if(p>"") {printf "\n"} }' datafile

K00002  gene_65472      gene_212051     gene_403626
K00003  gene_666        gene_5168       gene_7635       gene_12687      gene_175295     gene_647659     gene_663019
K00004  gene_88381
K00005  gene_30485      gene_193699     gene_256294     gene_307497

如果你不想分開標籤然後將更改\t為空格。

它的工作原理如下：

# Each line is processed in turn. "p" is the previous line's key field value

# Key field isn't the same as before
$1 != p {
    # Flush this line if we have printed something already
    if (p > "") { printf "\n" }

    # Print the key field name and set it as the current key field
    printf "%s", $1; p = $1
}

# Every line, print the second value on the line
{ printf "\t%s", $2 }

# No more input. Flush the line if we have already printed something
END {
    if (p > "") { printf "\n" }
}

來自模糊的評論你是製作針對每個人的答案，似乎根本問題是您正在使用在 Windows 系統上產生的資料檔案並期望它在 UNIX/Linux 平台上運作。不要那樣做。或者，如果必須，請先將文件轉換為正確的格式。

dos2unix < datafile | awk '...'       # As above

tr -d '\r' < data file | awk '...'    # Also as above

Answer

更多相同的

awk '$1 != p { if (p>"") {printf "\n"} printf "%s",$1; p=$1 } { printf "\t%s",$2 } END { if(p>"") {printf "\n"} }' datafile

K00002  gene_65472      gene_212051     gene_403626
K00003  gene_666        gene_5168       gene_7635       gene_12687      gene_175295     gene_647659     gene_663019
K00004  gene_88381
K00005  gene_30485      gene_193699     gene_256294     gene_307497

如果你不想分開標籤然後將更改\t為空格。

它的工作原理如下：

# Each line is processed in turn. "p" is the previous line's key field value

# Key field isn't the same as before
$1 != p {
    # Flush this line if we have printed something already
    if (p > "") { printf "\n" }

    # Print the key field name and set it as the current key field
    printf "%s", $1; p = $1
}

# Every line, print the second value on the line
{ printf "\t%s", $2 }

# No more input. Flush the line if we have already printed something
END {
    if (p > "") { printf "\n" }
}

來自模糊的評論你是製作針對每個人的答案，似乎根本問題是您正在使用在 Windows 系統上產生的資料檔案並期望它在 UNIX/Linux 平台上運作。不要那樣做。或者，如果必須，請先將文件轉換為正確的格式。

dos2unix < datafile | awk '...'       # As above

tr -d '\r' < data file | awk '...'    # Also as above

Question 2

文件：

K00002  gene_65472
K00002  gene_212051
K00002  gene_403626
K00003  gene_666
K00003  gene_5168
K00003  gene_7635
K00003  gene_12687
K00003  gene_654221
K00003  gene_663019
K00004  gene_88381
K00005  gene_30485
K00005  gene_193699
K00005  gene_256294

使用 awk：

awk '1 {if (a[$1]) {a[$1] = a[$1]" "$2} else {a[$1] = $2}} END {for (i in a) { print i,a[i]}}' file

輸出：

K00002 gene_65472 gene_212051 gene_403626
K00003 gene_666 gene_5168 gene_7635 gene_12687 gene_654221 gene_663019
K00004 gene_88381
K00005 gene_30485 gene_193699 gene_256294

我拿了這個郵政作為參考。

Answer

文件：

K00002  gene_65472
K00002  gene_212051
K00002  gene_403626
K00003  gene_666
K00003  gene_5168
K00003  gene_7635
K00003  gene_12687
K00003  gene_654221
K00003  gene_663019
K00004  gene_88381
K00005  gene_30485
K00005  gene_193699
K00005  gene_256294

使用 awk：

awk '1 {if (a[$1]) {a[$1] = a[$1]" "$2} else {a[$1] = $2}} END {for (i in a) { print i,a[i]}}' file

輸出：

K00002 gene_65472 gene_212051 gene_403626
K00003 gene_666 gene_5168 gene_7635 gene_12687 gene_654221 gene_663019
K00004 gene_88381
K00005 gene_30485 gene_193699 gene_256294

我拿了這個郵政作為參考。

Question 3

使用米勒http://johnkerl.org/miller/doc
和

mlr --csv --implicit-csv-header --headerless-csv-output cat -n -g 1 then label a,b,c then reshape -s a,c then unsparsify --fill-with "" input.csv

以及此範例 csv 輸入

A,234
A,4945
B,8798
B,8798
B,790

你將會擁有

A,234,4945,
B,8798,8798,790

Answer

使用米勒http://johnkerl.org/miller/doc
和

mlr --csv --implicit-csv-header --headerless-csv-output cat -n -g 1 then label a,b,c then reshape -s a,c then unsparsify --fill-with "" input.csv

以及此範例 csv 輸入

A,234
A,4945
B,8798
B,8798
B,790

你將會擁有

A,234,4945,
B,8798,8798,790

Question 4

假設您的值不包含空格並且以空格分隔；也假設您的資料位於名為的檔案中file（請參閱下面的製表符分隔版本）：

for x in $(<file cut -d ' ' -f 1 | sort | uniq); do
    printf '%s %s\n' "$x" "$(grep "$x" file | cut -d ' ' -f 2- | tr '\n' ' ' | sed 's/.$//')"
done

這會：

提取第一個欄位的不同值：
- cut-f 1只選擇一行的第一個區塊 ( )，並在每個空格 ( -d ' ') 處將其斷開；
- sort | uniq將對第一個欄位的值進行排序並僅輸出每個欄位一次（或者，更短、更有效率:) sort -u;
對於每個：
- file從with中提取所有相關行grep；
- cut使用(-f 2-表示「取得第二個及後續欄位」)剝離第一個欄位；
- 將餘數轉換為空格分隔值清單 ( tr)；
- 去掉最後一個字元——一個不需要的空格——使用sed（是的，這真的很不優雅）；
- 將結果連接到第一個欄位的值並列印到標準輸出。

如果您的輸入是製表符分隔的並且您想要製表符分隔的輸出，則上面的程式碼將變為：

for x in $(<file cut -f 1 | sort | uniq); do
    printf '%s\t%s\n' "$x" "$(grep "$x" file | cut -f 2- | tr '\n' '\t' | sed 's/.$//')"
done

筆記：

效能：這種方法的執行時間明顯高於awk基於解決方案的執行時間（我測試過羅艾瑪的回答）。至少是一個數量級。
另一方面，即使輸入檔案未排序，此方法也有效。
儘管這種解決方案是有效完成工作的快速（而且骯髒？）方法，但通常不建議使用 shell 循環處理文字；請參閱“參考”為什麼使用 shell 循環處理文字被認為是不好的做法？」。

Answer

假設您的值不包含空格並且以空格分隔；也假設您的資料位於名為的檔案中file（請參閱下面的製表符分隔版本）：

for x in $(<file cut -d ' ' -f 1 | sort | uniq); do
    printf '%s %s\n' "$x" "$(grep "$x" file | cut -d ' ' -f 2- | tr '\n' ' ' | sed 's/.$//')"
done

這會：

提取第一個欄位的不同值：
- cut-f 1只選擇一行的第一個區塊 ( )，並在每個空格 ( -d ' ') 處將其斷開；
- sort | uniq將對第一個欄位的值進行排序並僅輸出每個欄位一次（或者，更短、更有效率:) sort -u;
對於每個：
- file從with中提取所有相關行grep；
- cut使用(-f 2-表示「取得第二個及後續欄位」)剝離第一個欄位；
- 將餘數轉換為空格分隔值清單 ( tr)；
- 去掉最後一個字元——一個不需要的空格——使用sed（是的，這真的很不優雅）；
- 將結果連接到第一個欄位的值並列印到標準輸出。

如果您的輸入是製表符分隔的並且您想要製表符分隔的輸出，則上面的程式碼將變為：

for x in $(<file cut -f 1 | sort | uniq); do
    printf '%s\t%s\n' "$x" "$(grep "$x" file | cut -f 2- | tr '\n' '\t' | sed 's/.$//')"
done

筆記：

效能：這種方法的執行時間明顯高於awk基於解決方案的執行時間（我測試過羅艾瑪的回答）。至少是一個數量級。
另一方面，即使輸入檔案未排序，此方法也有效。
儘管這種解決方案是有效完成工作的快速（而且骯髒？）方法，但通常不建議使用 shell 循環處理文字；請參閱“參考”為什麼使用 shell 循環處理文字被認為是不好的做法？」。

如何刪除重複名稱並在唯一名稱後列印數組

答案1

答案2

答案3

答案4

相關內容