텍스트 파일 열의 10% 미만에 있는 행을 삭제(제거)하는 방법은 무엇입니까?

텍스트 파일 열의 10% 미만에 있는 행을 삭제(제거)하는 방법은 무엇입니까?

저는 Bash를 처음 접했고 제 질문에 대해 양해해 주시기 바랍니다(아마도 어리석은 일입니다). 다음과 같은 텍스트 파일이 있습니다(여기서는 극히 일부임).

                       type test    test    test    test    test    test    test    test    test    test    test    test    control control control control control control control control control control control control control control control control
Actinomyces_odontolyticus   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0.04306 0   0   0   0   0
Actinomyces_sp_HMSC035G02   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0.00575 0   0   0   0   0
Actinomyces_sp_HPA0247  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0.01802 0   0   0   0   0
Actinomyces_sp_ICM47    0   0   0   0   0   0   0   0   0.00244 0   0   0   0   0   0   0   0   0   0   0   0   0   0.00347 0   0   0   0   0
Actinomyces_sp_S6_Spd3  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0.01421 0   0   0   0   0
Actinomyces_sp_oral_taxon_181   0   0   0.00045 0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0.01219 0   0   0   0   0
Aeriscardovia_aeriphila 0   0   0.00786 0.00471 0   0   0   0.00118 0.00645 0.00918 0.01208 0   0.00153 0   0   0   0   0.00923 0   0.01527 0   0.00719 0.00423 0.00177 0   0.00468 0.0047  0.01937
Alloscardovia_omnicolens    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
Bifidobacterium_adolescentis    0.06235 0.05427 0.78772 0.11693 0.03352 0.17129 0.23957 0.25833 0.16216 0.18002 2.27324 0.23361 0.38109 0   0.59227 0   0.46423 1.06198 0.20985 0   0.26431 0.7178  0   0   0.04301 0.27795 0.06356 0.54188
Bifidobacterium_angulatum   0   0   0   0.02457 0   0.03637 0   0   0   0   0   0   0   0   0.03184 0   0   0   0   0   0   0   0   0   0.00368 0   0   0
Bifidobacterium_bifidum 0   0   0   0   0   0   0   0   0   0.08402 0   0   0   0   0.06594 0   0   0   0   0   0   0   0   0   0   0   0   0

열(개체) 중 최소 10%에 존재하지 않는 행(박테리아)을 제거하고 싶습니다. 즉, 예를 들어 70명의 개체가 있는 경우 최소 7명의 개체에 존재하지 않는(즉 = 0) 박테리아를 제거하고 싶습니다.

누구라도 Bash 명령을 도와주실 수 있나요?

답변1

awk이 명령을 사용하여 이를 수행할 수 있습니다. 여기서 file초기 파일은 어디에 있고 cleaned_file결과 파일은 무엇입니까?

awk '{nzeros=0; for(col=2; col<=NF; col++) {if($col == 0) {nzeros++}} {if(nzeros < 0.9 * (NF - 1)) {print $0}}}' file > cleaned_file

설명:

  1. nzeros=0: 각 행의 0 개수를 저장하는 변수를 초기화합니다.

  2. for(col=2; col<=NF; col++) {if($col == 0) {nzeros++}}: 모든 행에 대해 두 번째 열( col=2- 첫 번째 열은 박테리아 유형)부터 끝까지( col<=NF- NF는 필드 수, 즉 총 열 수)까지 반복합니다. 열의 값이 0( if($col == 0))이면 값을 nzeros1( nzeros++)만큼 증가시킵니다.

  3. if(nzeros < 0.9 * (NF - 1)) {print $0}}: 0의 개수가 전체 열 수에서 첫 번째 열 수를 뺀 값의 90%(0.9)보다 작으면 if(nzeros < 0.9 * (NF - 1))해당 행을 인쇄합니다( print $0-는 $0의 전체 행을 의미함 awk).

샘플의 출력은 다음과 같습니다.

                       type test    test    test    test    test    test    test    test    test    test    test    test    control control control control control control control control control control control control control control control control
Aeriscardovia_aeriphila 0   0   0.00786 0.00471 0   0   0   0.00118 0.00645 0.00918 0.01208 0   0.00153 0   0   0   0   0.00923 0   0.01527 0   0.00719 0.00423 0.00177 0   0.00468 0.0047  0.01937
Bifidobacterium_adolescentis    0.06235 0.05427 0.78772 0.11693 0.03352 0.17129 0.23957 0.25833 0.16216 0.18002 2.27324 0.23361 0.38109 0   0.59227 0   0.46423 1.06198 0.20985 0   0.26431 0.7178  0   0   0.04301 0.27795 0.06356 0.54188
Bifidobacterium_angulatum   0   0   0   0.02457 0   0.03637 0   0   0   0   0   0   0   0   0.03184 0   0   0   0   0   0   0   0   0   0.00368 0   0   0

답변2

#!/bin/bash

s_DATA_FILE="remove_bacteria_sample_data.txt"

i_PEOPLE_NUMBER=$(head -n1 ${s_DATA_FILE} | awk '{print NF-1}')
# Be aware Bash does not support decimals so 10% of 28 people is 2
# I increased the example to 50% to get at least 2 results with your sample file
i_PERCENT=50
i_MAX_ZEROS=$((i_PERCENT*i_PEOPLE_NUMBER/100))
s_BACTERIA_LIST=$(awk '!(NR==1) { print $1 }' ${s_DATA_FILE})

echo "Found ${i_PEOPLE_NUMBER} People (test and control)"
echo "Max empty readings per Bacteria are ${i_PERCENT}%: ${i_MAX_ZEROS}"
echo

for s_BACTERIA in ${s_BACTERIA_LIST}
do
    # Please be aware that space after ${s_BACTERIA} is required to avoid matching names that start the same
    # Like if you add Actinomyces_sp_HPA0247 and Actinomyces_sp_HPA0247_2
    # Space makes sure Actinomyces_sp_HPA0247 will return only one row
    i_COUNT_ZEROS=$(grep "${s_BACTERIA} " ${s_DATA_FILE} | awk '{for(i=1; i<=NF; i++) if ($i==0) {i_count_zeros++}; print i_count_zeros; exit}')
    if [[ $i_COUNT_ZEROS -le $i_MAX_ZEROS ]]; then
        echo "* ${s_BACTERIA} meets the criteria with ${i_COUNT_ZEROS} people not being tested"
    else
        echo "- Not meeting the criteria ${s_BACTERIA} with ${i_COUNT_ZEROS} people not being tested"
    fi
done

그러면 다음이 반환됩니다.

./remove_bacteria.sh 
Found 28 People (test and control)
Max empty readings per Bacteria are 50%: 14

- Not meeting the criteria Actinomyces_odontolyticus with 27 people not being tested
- Not meeting the criteria Actinomyces_sp_HMSC035G02 with 27 people not being tested
- Not meeting the criteria Actinomyces_sp_HPA0247 with 27 people not being tested
- Not meeting the criteria Actinomyces_sp_ICM47 with 26 people not being tested
- Not meeting the criteria Actinomyces_sp_S6_Spd3 with 27 people not being tested
- Not meeting the criteria Actinomyces_sp_oral_taxon_181 with 26 people not being tested
* Aeriscardovia_aeriphila meets the criteria with 13 people not being tested
- Not meeting the criteria Alloscardovia_omnicolens with 28 people not being tested
* Bifidobacterium_adolescentis meets the criteria with 5 people not being tested
- Not meeting the criteria Bifidobacterium_angulatum with 24 people not being tested
- Not meeting the criteria Bifidobacterium_bifidum with 26 people not being tested

관련 정보