如何以特定數量切割 fasta 序列並產生 ORF

Question 1

這個問題需要比這個論壇提供的更大的程式設計工作量（我以這種程式為生）。

這DDBJ/ENA/GenBank 檔案格式（問題中的第一個檔案）很複雜，並且允許 CDS（基因組序列的編碼部分）不僅是簡單的或連接的，而且是互補的及其組合。此外，位置座標可以有修飾語對於通用解決方案來說，需要處理這個問題。

您最好詢問當地的生物資訊學家（或程式設計師）或在生物資訊論壇，例如 StackExchange生物資訊網站。他們會向您指出用於執行此類操作的現有工具，或者，了解生物資訊學家，為您提供一些古怪的 BioPerl/BioPython 腳本，這些腳本可能會更有效；-)

一可能的路線是使用GenBank 特徵提取器，但是在線使用它很可能不是除了小數據集之外的任何東西的最佳選擇。

Answer

這個問題需要比這個論壇提供的更大的程式設計工作量（我以這種程式為生）。

這DDBJ/ENA/GenBank 檔案格式（問題中的第一個檔案）很複雜，並且允許 CDS（基因組序列的編碼部分）不僅是簡單的或連接的，而且是互補的及其組合。此外，位置座標可以有修飾語對於通用解決方案來說，需要處理這個問題。

您最好詢問當地的生物資訊學家（或程式設計師）或在生物資訊論壇，例如 StackExchange生物資訊網站。他們會向您指出用於執行此類操作的現有工具，或者，了解生物資訊學家，為您提供一些古怪的 BioPerl/BioPython 腳本，這些腳本可能會更有效；-)

一可能的路線是使用GenBank 特徵提取器，但是在線使用它很可能不是除了小數據集之外的任何東西的最佳選擇。

Question 2

儘管我給出了先前的答案，但我已經研究了從 fasta 檔案中提取特定子序列的子問題。解決方案分為兩部分：

一個sh執行命令列解析並呼叫...的 shell 腳本
awk解析 fasta 檔案的腳本。

我決定將其發佈在這裡，因為它表明

如何在 shell 腳本中對選項進行命令列解析。
可以編寫一個awk腳本，而不僅僅是awk「一句台詞」。

假設：

序列 ID 將直接出現>在標題行的後面，後面接著一個空格字元。
序列資料中沒有任何空格。

該腳本稱為extract.sh.

gi1234若要運行，請取得位置 36 和 183（含）之間的序列 ID ：

$ sh extract.sh -i gi1234 -a 36 -b 183 <sequence.fa
ACAACAACAACACTGGGGGACACACTGGGACAACACTGGGGGACA
GGACACTGTACAACACTGGGTGTGTCGGGACAGTACACATGTTGGGGGGGTGTGTCGGACAACACTGGGGGACATGTGTG
TACAACACTGGGGGACAGTGACG

輸出未格式化。在本例中，我獲取了問題中的數據，並在運行腳本之前在每 80 個字元後插入換行符。

shell 腳本 ( extract.sh):

#!/bin/sh

usage() {
    cat <<USAGE_END
Usage:
    sh $0 -i seq_id -a start -b end <sequence.fa
USAGE_END
}

while getopts "a:b:i:" opt; do
    case $opt in
        a) start_pos="$OPTARG"  ;;
        b) end_pos="$OPTARG"    ;;
        i) seq_id="$OPTARG"     ;;
        *) usage; exit 1        ;;
    esac
done

if [ "x$seq_id" = "x" ]; then
    echo "Missing sequence ID (-i seq_id)"
    usage
    exit 1
fi

if [ "x$start_pos" = "x" -o "x$end_pos" = "x" ]; then
    echo "Missing either start or end coordinates (-a, -b)"
    usage
    exit 1
fi

if [ ! -f "extract.awk" ]; then
    echo "Can not find the extract.awk AWK script!"
    exit 1
fi

awk -v id="$seq_id" -v a="$start_pos" -v b="$end_pos" -f extract.awk

劇本awk（extract.awk）：

# state == 0:   not yet at wanted sequence entry in file
# state == 1:   found sequence entry, not yet at start position
# state == 2:   found start position, not yet at end position

# off:  offset in sequence read so far

BEGIN   { state = 0 }

state == 0 && $0 ~ "^>" id " " {
    # We found the sequence entry we're looking for.
    state = 1;
    off = 0;
    next;
}

state > 0 && /^>/  {
    # New sequence header found while parsing sequence, terminate.
    exit;
}

state == 1 {
    len = length($0);

    if (len + off < a) {
        # Start position is not on this line.
        off += len;
        next;
    }

    state = 2;

    if (len + off >= b) {
        # Whole sequence is found on this line.
        print(substr($0, a - off, b - a + 1))
        exit;
    }

    # Output partial start of sequence.
    print(substr($0, a - off , len - (a - off) + 1))

    off += len;

    next;
}

state == 2 {
    len = length($0);

    if (off > b) {
        # (we should not arrive here)
        exit;
    }

    if (len + off < b) {
        # End position is not on this line.
        print($0);
        off += len;
        next;
    }

    # We found the end position, output line until end position
    # and terminate.
    print(substr($0, 1, b - off));
    exit;
}

Answer