文件 A 有幾行基因:
A、B、C、D、E
P、Q、R
G、D、V、KL
、Q、X、I、U、G 等。
一次取得每一行,如何獲得以下類型的輸出:
對於第一行:
A、B、C
B、C、D
C、D、E
對於第二行:
P、Q、R
對於第三行:
G、D、V
D、V、K
本質上,我想要的是從每一行中找到基因的「三聯體」。第一個三聯體將具有前三個基因。第二個三聯體將具有第二、第三、第四基因。最後一個三聯體將以行中的最後一個基因結束。
手動實現這一目標將是一項艱鉅的任務。由於我尚未掌握 Linux、Perl 或 Python 腳本,因此無法為此編寫腳本,因此我將不勝感激來自該社群的幫助!
答案1
使用awk
:
function wprint() {
print w[1], w[2], w[3];
}
function wshift(e) {
w[1] = w[2]; w[2] = w[3]; w[3] = e;
}
BEGIN { FS = OFS = "," }
{
wshift($1);
wshift($2);
wshift($3);
wprint();
for (i = 4; i <= NF; ++i) {
wshift($i);
wprint();
}
}
然後:
$ awk -f script data.in
A,B,C
B,C,D
C,D,E
P,Q,R
G,D,V
D,V,K
L,Q,X
Q,X,I
X,I,U
I,U,G
腳本awk
使用三元素移動視窗w
.對於每個輸入行,它使用前三個欄位填充視窗的三個元素,並將它們列印為逗號分隔的清單(後跟換行符)。然後,它迭代該行上的剩餘字段,將它們移入視窗並列印每個元素的視窗。
如果輸入資料中的任何行包含少於兩個字段,您將獲得類似的結果
A,,
或者
A,B,
在輸出中。
如果您確定每個輸入行至少有三個欄位(或如果您想忽略任何沒有的行),那麼您可以awk
稍微縮短腳本:
function wprint() {
print w[1], w[2], w[3];
}
function wshift(e) {
w[1] = w[2]; w[2] = w[3]; w[3] = e;
}
BEGIN { FS = OFS = "," }
{
for (i = 1; i <= NF; ++i) {
wshift($i);
if (i >= 3) {
wprint();
}
}
}
具有可變視窗大小的腳本第一個變體的概括:
function wprint(i) {
for (i = 1; i < n; ++i) {
printf("%s%s", w[i], OFS);
}
print w[n]
}
function wshift(e,i) {
for (i = 1; i < n; ++i) {
w[i] = w[i + 1];
}
w[n] = e;
}
BEGIN { FS = OFS = "," }
{
for (i = 1; i <= n; ++i) {
wshift($i);
}
wprint();
for (i = n + 1; i <= NF; ++i) {
wshift($i);
wprint();
}
}
使用它:
$ awk -v n=4 -f script data.in
A,B,C,D
B,C,D,E
P,Q,R,
G,D,V,K
L,Q,X,I
Q,X,I,U
X,I,U,G
答案2
和perl
:
perl -F, -le 'BEGIN { $, = "," } while(@F >= 3) { print @F[0..2]; shift @F }' file
和awk
:
awk -F, -v OFS=, 'NF>=3 { for(i=1; i<=NF-2; i++) print $i, $(i+1), $(i+2) }' file
答案3
使用 Perl,我們可以將其解決為:
perl -lne '/(?:([^,]+)(?=((?:,[^,]+){2}))(?{ print $1,$2 }))*$/' yourfile
perl -F, -lne '$,=","; print shift @F, @F[0..1] while @F >= 3'
perl -F, -lne '$,=","; print splice @F, 0, 3, @F[1,2] while @F >= 3'
可以寫成如下的擴展形式:
perl -lne '
m/
(?: # set up a do-while loop
([^,]+) # first field which shall be deleted after printing
(?=((?:,[^,]+){2})) # lookahead and remember the next 2 fields
(?{ print $1,$2 }) # print the first field + next 2 fields
)* # loop back for more
$ # till we hit the end of line
/x;
' yourfile
透過 sed,我們可以使用它的各種指令來做到這一點:
sed -e '
/,$/!s/$/,/ # add a dummy comma at the EOL
s/,/\n&/3;ta # while there still are 3 elements in the line jump to label "a"
d # else quit processing this line any further
:a # main action
P # print the leading portion, i.e., that which is left of the first newline in the pattern space
s/\n// # take away the marker
s/,/\n/;tb # get ready to delete the first field
:b
D # delete the first field, and apply the sed code all over from the beginning to what remains in the pattern space
' yourfile
DC 也可以這樣做:
sed -e 's/[^,]*/[&]/g;y/,/ /' gene_data.in |
dc -e '
[q]sq # macro for quitting
[SM z0<a]sa # macro to store stack -> register "M"
[LMd SS zlk>b c]sb # macro to put register "M" -> register "S"
[LS zlk>c]sc # macro to put register "S" -> stack
[n44an dn44an rdn10anr z3!>d]sd # macro to print 1st three stack elements
[zsk lax lbx lcx ldx c]se # macro that initializes & calls all other macros
[?z3>q lex z0=?]s? # while loop to read in file line by line and run macro "e" on each line
l?x # main()
'
結果
A,B,C
B,C,D
C,D,E
D,E,F
E,F,G
P,Q,R
G,D,V
D,V,K
L,Q,X
Q,X,I
X,I,U
I,U,G