我的gtf
檔案位於 100 多個目錄中。下面我展示了它們的樣子。
SampleA
|___________ SampleA.GRCh38.gtf
SampleB
|___________ SampleB.GRCh38.gtf
這裡我僅顯示兩個gtf
文件作為範例。
SampleA.GRCh38.gtf
如下圖所示:
# stringtie -e -B -p 8 -G /path/stringtie_output/stringtie_merged.gtf -o /path/SampleA.GRCh38.gtf /path/SampleA.sorted.bam
# StringTie version 1.3.3
chr1 StringTie transcript 11594 191502 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
chr1 StringTie exon 11594 14829 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "1"; cov "0.0";
chr1 StringTie exon 14970 15038 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "2"; cov "0.0";
chr1 StringTie exon 15796 16765 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "3"; cov "0.0";
chr1 StringTie exon 16858 17055 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "4"; cov "0.0";
chr1 StringTie exon 17233 17742 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "5"; cov "0.0";
chr1 StringTie exon 17915 18061 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "6"; cov "0.0";
chr1 StringTie exon 18268 19364 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "7"; cov "0.0";
chr1 StringTie exon 189836 191502 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "8"; cov "0.0";
chr1 StringTie transcript 11594 195411 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
chr1 StringTie exon 11594 14829 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "1"; cov "0.0";
chr1 StringTie exon 14970 15236 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "2"; cov "0.0";
chr1 StringTie exon 185758 187287 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "3"; cov "0.0";
chr1 StringTie exon 187376 187577 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "4"; cov "0.0";
chr1 StringTie exon 187755 187890 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "5"; cov "0.0";
chr1 StringTie exon 188130 188266 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "6"; cov "0.0";
chr1 StringTie exon 188439 188584 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "7"; cov "0.0";
chr1 StringTie exon 188791 188902 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "8"; cov "0.0";
chr1 StringTie exon 195263 195411 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "9"; cov "0.0";
chr1 StringTie transcript 11594 197912 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.5"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
如下SampleB.GRCh38.gtf
圖所示:
# stringtie -e -B -p 8 -G /path/stringtie_output/stringtie_merged.gtf -o /path/SampleB.GRCh38.gtf /path/SampleB.sorted.bam
# StringTie version 1.3.3
chr1 StringTie transcript 11594 191502 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; cov "0.0"; FPKM "0.000000"; TPM "1.000000";
chr1 StringTie exon 11594 14829 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "1"; cov "0.0";
chr1 StringTie exon 14970 15038 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "2"; cov "0.0";
chr1 StringTie exon 15796 16765 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "3"; cov "0.0";
chr1 StringTie exon 16858 17055 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "4"; cov "0.0";
chr1 StringTie exon 17233 17742 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "5"; cov "0.0";
chr1 StringTie exon 17915 18061 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "6"; cov "0.0";
chr1 StringTie exon 18268 19364 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "7"; cov "0.0";
chr1 StringTie exon 189836 191502 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.2"; exon_number "8"; cov "0.0";
chr1 StringTie transcript 11594 195411 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; cov "0.0"; FPKM "0.000000"; TPM "3.000000";
chr1 StringTie exon 11594 14829 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "1"; cov "0.0";
chr1 StringTie exon 14970 15236 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "2"; cov "0.0";
chr1 StringTie exon 185758 187287 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "3"; cov "0.0";
chr1 StringTie exon 187376 187577 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "4"; cov "0.0";
chr1 StringTie exon 187755 187890 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "5"; cov "0.0";
chr1 StringTie exon 188130 188266 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "6"; cov "0.0";
chr1 StringTie exon 188439 188584 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "7"; cov "0.0";
chr1 StringTie exon 188791 188902 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "8"; cov "0.0";
chr1 StringTie exon 195263 195411 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.6"; exon_number "9"; cov "0.0";
chr1 StringTie transcript 11594 197912 . - . gene_id "MSTRG.7542"; transcript_id "MSTRG.7542.5"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
我只想transcript
從第三列中提取,transcript_id
哪一列是第十列,TPM
哪一列是最後一列。但TPM
需要是樣本名稱。
我希望輸出如下所示:
Type transcript_id SampleA SampleB
transcript MSTRG.7542.2 0.000000 1.000000
transcript MSTRG.7542.6 0.000000 3.000000
transcript MSTRG.7542.5 0.000000 1.000000
答案1
您將需要從每個文件中提取相關記錄並將結果寫入兩個新的臨時文件(可能使用awk
),可能sort
同時對其進行排序(使用)(示例文件說它們已排序,但可能不在正確的位置)鑰匙)。以下是處理其中一個文件的範例:
awk '$3 == "transcript" {printf("%s %s %s ", $3, $10, $12, $18);}' SampleA.GRCh38.gtf | sort -k 2 > tf1
然後,您可以使用join
合併 產生的兩個臨時/中間文件,awk
以便每個記錄都具有每個文件中的兩個最終列。
join
以下是您可能使用的命令範例:
join -o 1.1,1.2,1.3,2.3 -1 2 -2 2 tf1 tf2
您可能希望在運行之前列印標題行(例如使用命令printf
)join
,並且您可能希望用join
製表符替換輸出中的空格(例如使用sed
),或使用另一個awk
腳本來格式化輸出。
從這些範例中,您應該能夠編寫一個腳本來處理這兩個檔案並產生所需的輸出(並清理臨時檔案等)。
請注意,根據資料檔案的大小,您甚至可以在一個awk
(或python
或perl
等)程式中完成所有操作(即可以輕鬆地將兩個檔案中的所有選定資料同時保存在記憶體中)。
答案2
您可以只刪除join
文件,然後awk
刪除那些包含的文件NF==4
,因為只有您感興趣的行才有第 18 個欄位。所有其他行將只有 2 個字段
也對計算 的路徑做出某些假設SampleB
,但是您可以修改它以適應...
while IFS= read -r -d '' f; do #read the list of SampleA
g=$(echo "$f" | sed "s/pleA/pleB/g") #calculate path to SampleB
if [[ -f "$g" ]]; then #check SampleB exists
echo "$f" | sed "s/.*pleA\.//g" #print sample No
echo "Type transcript_id SampleA SampleB" #print header
#do the join
join -j 12 -o 1.3 -o 1.12 -o 1.18 -o2.18 <(sort -k 12 "$f") <(sort -k 12 "$g") | awk 'NF==4'
fi | sed 's/[;"]//g'| column -t #make it pretty
done < <(find . -type f -iname "*SampleA*" -print0) #NULL separated list of SampleA
答案3
嘗試使用以下命令
步驟1
awk '$3 ~ /transcript/{print $0}' file1|awk '{print $3,substr($12,2,12),substr($NF,2,8)}' > out1
第2步
awk '$3 == "transcript" {print substr($NF,2,8)}' file2 > out2
步驟3
paste out out1.txt | awk 'BEGIN{print "Type transcript_id SampleA SampleB"}{print $0}'
Output
Type transcript_id SampleA SampleB
transcript MSTRG.7542.2 0.000000 1.000000
transcript MSTRG.7542.6 0.000000 3.000000
transcript MSTRG.7542.5 0.000000 0.000000