如何刪除(刪除)文字檔案中列數少於 10% 的行?

如何刪除(刪除)文字檔案中列數少於 10% 的行?

我對 Bash 很陌生,請耐心解答我的問題(可能很愚蠢)。我有一個像這樣的文字檔案(這裡只是一小部分):

                       type test    test    test    test    test    test    test    test    test    test    test    test    control control control control control control control control control control control control control control control control
Actinomyces_odontolyticus   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0.04306 0   0   0   0   0
Actinomyces_sp_HMSC035G02   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0.00575 0   0   0   0   0
Actinomyces_sp_HPA0247  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0.01802 0   0   0   0   0
Actinomyces_sp_ICM47    0   0   0   0   0   0   0   0   0.00244 0   0   0   0   0   0   0   0   0   0   0   0   0   0.00347 0   0   0   0   0
Actinomyces_sp_S6_Spd3  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0.01421 0   0   0   0   0
Actinomyces_sp_oral_taxon_181   0   0   0.00045 0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0.01219 0   0   0   0   0
Aeriscardovia_aeriphila 0   0   0.00786 0.00471 0   0   0   0.00118 0.00645 0.00918 0.01208 0   0.00153 0   0   0   0   0.00923 0   0.01527 0   0.00719 0.00423 0.00177 0   0.00468 0.0047  0.01937
Alloscardovia_omnicolens    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
Bifidobacterium_adolescentis    0.06235 0.05427 0.78772 0.11693 0.03352 0.17129 0.23957 0.25833 0.16216 0.18002 2.27324 0.23361 0.38109 0   0.59227 0   0.46423 1.06198 0.20985 0   0.26431 0.7178  0   0   0.04301 0.27795 0.06356 0.54188
Bifidobacterium_angulatum   0   0   0   0.02457 0   0.03637 0   0   0   0   0   0   0   0   0.03184 0   0   0   0   0   0   0   0   0   0.00368 0   0   0
Bifidobacterium_bifidum 0   0   0   0   0   0   0   0   0   0.08402 0   0   0   0   0.06594 0   0   0   0   0   0   0   0   0   0   0   0   0

我想刪除至少 10% 的欄位(個體)中不存在的行(細菌)。這意味著,例如,如果我有 70 個個體,我想去除至少 7 個個體中不存在的細菌(即 = 0)。

誰能幫我一些 Bash 指令嗎?



awk '{nzeros=0; for(col=2; col<=NF; col++) {if($col == 0) {nzeros++}} {if(nzeros < 0.9 * (NF - 1)) {print $0}}}' file > cleaned_file


  1. nzeros=0:我們初始化一個變量,在其中儲存每行的零數量。

  2. for(col=2; col<=NF; col++) {if($col == 0) {nzeros++}}:對於每一行,我們從第二列(col=2- 第一列是細菌類型)循環到結束(col<=NF-NF是字段數,即總列數)。如果列的值為 0 ( if($col == 0)),我們將 的值增加nzeros1 ( nzeros++)。

  3. if(nzeros < 0.9 * (NF - 1)) {print $0}}:如果零的數量小於總列數減去第一個的 90% (0.9) if(nzeros < 0.9 * (NF - 1)),我們將列印該行(print $0-$0表示 中的整行awk)。


                       type test    test    test    test    test    test    test    test    test    test    test    test    control control control control control control control control control control control control control control control control
Aeriscardovia_aeriphila 0   0   0.00786 0.00471 0   0   0   0.00118 0.00645 0.00918 0.01208 0   0.00153 0   0   0   0   0.00923 0   0.01527 0   0.00719 0.00423 0.00177 0   0.00468 0.0047  0.01937
Bifidobacterium_adolescentis    0.06235 0.05427 0.78772 0.11693 0.03352 0.17129 0.23957 0.25833 0.16216 0.18002 2.27324 0.23361 0.38109 0   0.59227 0   0.46423 1.06198 0.20985 0   0.26431 0.7178  0   0   0.04301 0.27795 0.06356 0.54188
Bifidobacterium_angulatum   0   0   0   0.02457 0   0.03637 0   0   0   0   0   0   0   0   0.03184 0   0   0   0   0   0   0   0   0   0.00368 0   0   0




i_PEOPLE_NUMBER=$(head -n1 ${s_DATA_FILE} | awk '{print NF-1}')
# Be aware Bash does not support decimals so 10% of 28 people is 2
# I increased the example to 50% to get at least 2 results with your sample file
s_BACTERIA_LIST=$(awk '!(NR==1) { print $1 }' ${s_DATA_FILE})

echo "Found ${i_PEOPLE_NUMBER} People (test and control)"
echo "Max empty readings per Bacteria are ${i_PERCENT}%: ${i_MAX_ZEROS}"

    # Please be aware that space after ${s_BACTERIA} is required to avoid matching names that start the same
    # Like if you add Actinomyces_sp_HPA0247 and Actinomyces_sp_HPA0247_2
    # Space makes sure Actinomyces_sp_HPA0247 will return only one row
    i_COUNT_ZEROS=$(grep "${s_BACTERIA} " ${s_DATA_FILE} | awk '{for(i=1; i<=NF; i++) if ($i==0) {i_count_zeros++}; print i_count_zeros; exit}')
    if [[ $i_COUNT_ZEROS -le $i_MAX_ZEROS ]]; then
        echo "* ${s_BACTERIA} meets the criteria with ${i_COUNT_ZEROS} people not being tested"
        echo "- Not meeting the criteria ${s_BACTERIA} with ${i_COUNT_ZEROS} people not being tested"


Found 28 People (test and control)
Max empty readings per Bacteria are 50%: 14

- Not meeting the criteria Actinomyces_odontolyticus with 27 people not being tested
- Not meeting the criteria Actinomyces_sp_HMSC035G02 with 27 people not being tested
- Not meeting the criteria Actinomyces_sp_HPA0247 with 27 people not being tested
- Not meeting the criteria Actinomyces_sp_ICM47 with 26 people not being tested
- Not meeting the criteria Actinomyces_sp_S6_Spd3 with 27 people not being tested
- Not meeting the criteria Actinomyces_sp_oral_taxon_181 with 26 people not being tested
* Aeriscardovia_aeriphila meets the criteria with 13 people not being tested
- Not meeting the criteria Alloscardovia_omnicolens with 28 people not being tested
* Bifidobacterium_adolescentis meets the criteria with 5 people not being tested
- Not meeting the criteria Bifidobacterium_angulatum with 24 people not being tested
- Not meeting the criteria Bifidobacterium_bifidum with 26 people not being tested
