私は Bash の初心者なので、私の質問(おそらく馬鹿げた質問)をどうかご容赦ください。次のようなテキスト ファイルがあります(ここではほんの一部です)。
type test test test test test test test test test test test test control control control control control control control control control control control control control control control control
Actinomyces_odontolyticus 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.04306 0 0 0 0 0
Actinomyces_sp_HMSC035G02 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00575 0 0 0 0 0
Actinomyces_sp_HPA0247 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.01802 0 0 0 0 0
Actinomyces_sp_ICM47 0 0 0 0 0 0 0 0 0.00244 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00347 0 0 0 0 0
Actinomyces_sp_S6_Spd3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.01421 0 0 0 0 0
Actinomyces_sp_oral_taxon_181 0 0 0.00045 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.01219 0 0 0 0 0
Aeriscardovia_aeriphila 0 0 0.00786 0.00471 0 0 0 0.00118 0.00645 0.00918 0.01208 0 0.00153 0 0 0 0 0.00923 0 0.01527 0 0.00719 0.00423 0.00177 0 0.00468 0.0047 0.01937
Alloscardovia_omnicolens 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Bifidobacterium_adolescentis 0.06235 0.05427 0.78772 0.11693 0.03352 0.17129 0.23957 0.25833 0.16216 0.18002 2.27324 0.23361 0.38109 0 0.59227 0 0.46423 1.06198 0.20985 0 0.26431 0.7178 0 0 0.04301 0.27795 0.06356 0.54188
Bifidobacterium_angulatum 0 0 0 0.02457 0 0.03637 0 0 0 0 0 0 0 0 0.03184 0 0 0 0 0 0 0 0 0 0.00368 0 0 0
Bifidobacterium_bifidum 0 0 0 0 0 0 0 0 0 0.08402 0 0 0 0 0.06594 0 0 0 0 0 0 0 0 0 0 0 0 0
列 (個体) の少なくとも 10% に存在しない行 (細菌) を削除したいです。つまり、たとえば個体が 70 個ある場合、少なくとも 7 個の個体に存在しない (つまり = 0) 細菌を削除したいということです。
どなたか、Bash コマンドについて教えていただけませんか?
答え1
awk
次のコマンドでそれを実行できます。file
は最初のファイルで、cleaned_file
は結果のファイルです。
awk '{nzeros=0; for(col=2; col<=NF; col++) {if($col == 0) {nzeros++}} {if(nzeros < 0.9 * (NF - 1)) {print $0}}}' file > cleaned_file
説明:
nzeros=0
各行のゼロの数を格納する変数を初期化します。for(col=2; col<=NF; col++) {if($col == 0) {nzeros++}}
: 各行について、2 番目の列 (col=2
- 最初の列は細菌の種類) からその末尾 (col<=NF
-NF
はフィールドの数、つまり列の合計数) までループします。列の値が 0 (if($col == 0)
) の場合、 の値をnzeros
1 増やします (nzeros++
)。if(nzeros < 0.9 * (NF - 1)) {print $0}}
: ゼロの数が、列の総数から最初の列を引いた数の 90% (0.9) 未満の場合if(nzeros < 0.9 * (NF - 1))
、その行を出力します (print $0
- は$0
の行全体を意味しますawk
)。
サンプルの出力は次のとおりです。
type test test test test test test test test test test test test control control control control control control control control control control control control control control control control
Aeriscardovia_aeriphila 0 0 0.00786 0.00471 0 0 0 0.00118 0.00645 0.00918 0.01208 0 0.00153 0 0 0 0 0.00923 0 0.01527 0 0.00719 0.00423 0.00177 0 0.00468 0.0047 0.01937
Bifidobacterium_adolescentis 0.06235 0.05427 0.78772 0.11693 0.03352 0.17129 0.23957 0.25833 0.16216 0.18002 2.27324 0.23361 0.38109 0 0.59227 0 0.46423 1.06198 0.20985 0 0.26431 0.7178 0 0 0.04301 0.27795 0.06356 0.54188
Bifidobacterium_angulatum 0 0 0 0.02457 0 0.03637 0 0 0 0 0 0 0 0 0.03184 0 0 0 0 0 0 0 0 0 0.00368 0 0 0
答え2
#!/bin/bash
s_DATA_FILE="remove_bacteria_sample_data.txt"
i_PEOPLE_NUMBER=$(head -n1 ${s_DATA_FILE} | awk '{print NF-1}')
# Be aware Bash does not support decimals so 10% of 28 people is 2
# I increased the example to 50% to get at least 2 results with your sample file
i_PERCENT=50
i_MAX_ZEROS=$((i_PERCENT*i_PEOPLE_NUMBER/100))
s_BACTERIA_LIST=$(awk '!(NR==1) { print $1 }' ${s_DATA_FILE})
echo "Found ${i_PEOPLE_NUMBER} People (test and control)"
echo "Max empty readings per Bacteria are ${i_PERCENT}%: ${i_MAX_ZEROS}"
echo
for s_BACTERIA in ${s_BACTERIA_LIST}
do
# Please be aware that space after ${s_BACTERIA} is required to avoid matching names that start the same
# Like if you add Actinomyces_sp_HPA0247 and Actinomyces_sp_HPA0247_2
# Space makes sure Actinomyces_sp_HPA0247 will return only one row
i_COUNT_ZEROS=$(grep "${s_BACTERIA} " ${s_DATA_FILE} | awk '{for(i=1; i<=NF; i++) if ($i==0) {i_count_zeros++}; print i_count_zeros; exit}')
if [[ $i_COUNT_ZEROS -le $i_MAX_ZEROS ]]; then
echo "* ${s_BACTERIA} meets the criteria with ${i_COUNT_ZEROS} people not being tested"
else
echo "- Not meeting the criteria ${s_BACTERIA} with ${i_COUNT_ZEROS} people not being tested"
fi
done
次のような結果が返されます:
./remove_bacteria.sh
Found 28 People (test and control)
Max empty readings per Bacteria are 50%: 14
- Not meeting the criteria Actinomyces_odontolyticus with 27 people not being tested
- Not meeting the criteria Actinomyces_sp_HMSC035G02 with 27 people not being tested
- Not meeting the criteria Actinomyces_sp_HPA0247 with 27 people not being tested
- Not meeting the criteria Actinomyces_sp_ICM47 with 26 people not being tested
- Not meeting the criteria Actinomyces_sp_S6_Spd3 with 27 people not being tested
- Not meeting the criteria Actinomyces_sp_oral_taxon_181 with 26 people not being tested
* Aeriscardovia_aeriphila meets the criteria with 13 people not being tested
- Not meeting the criteria Alloscardovia_omnicolens with 28 people not being tested
* Bifidobacterium_adolescentis meets the criteria with 5 people not being tested
- Not meeting the criteria Bifidobacterium_angulatum with 24 people not being tested
- Not meeting the criteria Bifidobacterium_bifidum with 26 people not being tested