
내가 작업하는 파일은 다음과 같습니다
NAMES n0 n1 n2 n3 n4 n5 n6 n7
REGION chr 1 100000
404 AAAAAAGA
992 TTTTTTTA
1146 CCCCGGCC
1727 CCCCCACC
1778 GCCCCCCC
내가 원하는 출력은 다음과 같습니다(두 번째 줄과 숫자가 어떻게 표시되는지 확인하세요).
file1
NAMES n0 n1 n2 n3 n4 n5 n6 n7
REGION chr 404 992
404 AAAAAAGA
992 TTTTTTTA
file2
NAMES n0 n1 n2 n3 n4 n5 n6 n7
REGION chr 1146 1778
1146 CCCCGGCC
1727 CCCCCACC
1778 GCCCCCCC
나는 awk에서 시도했다
awk 'function print_vals() {
fn="file" c;
print hdr > fn;
print "REGION chr", sn, en >> fn;
for (i in a)
print a[i] >> fn;
} NR == 1 {
hdr=$0;
c=0;
next
} NF==2 && $1 >= 1000000*c {
if (c)
print_vals();
delete a;
i=0;
c++;
sn=$1;
} NF==2 {
a[++i]=$0;
en=$1;
} END {
print print_vals();
}' file
작동한 샘플 데이터의 경우 출력을 얻었지만 실제 데이터 세트의 경우 출력을 얻지 못했습니다. 세트는 여기 있어요https://www.dropbox.com/s/h6ukumbj08cwk99/arg_t1.gz?dl=0 이거 좋을 것 같아
NAMES n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13 n14 n15 n16 $
REGION chr 1 10000000
69 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
474 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
584 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
627 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
676 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
690 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
894 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
1104 AAAAAAAAAAAAAAAAA
출력이 어떻게든 전환되는 경우...그렇게 되어야 하는 방식이 아닙니다.
NAMES n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13 n14 n15 n16 $
REGION chr 69 999927
561321 AAAAAACAAAAAAAAACAAAAAAAAAAAAAAAAAACCCAAAACAACAAAACAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAACAAAAACCAACA$
561362 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
562011 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
562029 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
562162 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
562171
누군가 이 문제를 어떻게 해결해야 하는지 말하거나 다른 기능을 제안할 수 있나요?
답변1
연관 배열 에서는 awk
결정되지 않은 순서로 탐색됩니다. 당신의 교체
for (i in a)
~에 의해
n = i
for(i=1;i<=n;i++)
awk를 bash 스크립트로 묶으려면 다음과 같은 것을 사용할 수 있습니다.
#!/bin/bash
for file
do awk -v file="$file" '....' "$file"
done
chmod a+rx
스크립트 파일에서 실행하는지 확인하세요 . 라인도 교체해주세요
fn="file" c;
귀하의 awk 스크립트에서
fn = c "_" file;
이 행은 새 파일 이름이 작성되는 방법입니다. awk 변수는 file
처음에 처리 중인 파일 이름의 값으로 제공됩니다(구문은 awk -v 변수=값). awk 변수 는 새 파일 이름이며 문자 및 파일 이름 변수 와 연결된 숫자를 보유하는 fn
변수입니다 .c
_
여러 파일을 인수로 사용하여 이 bash 명령을 실행할 수 있습니다. awk에 의해 하나씩 처리됩니다.
최종 결과:
#!/bin/bash
for file
do awk -v file="$file" 'function print_vals() {
fn = c "_" file;
print hdr > fn;
print "REGION chr", sn, en >> fn;
n = i
for(i=1;i<=n;i++)
print a[i] >> fn;
} NR == 1 {
hdr=$0;
c=0;
next
} NF==2 && $1 >= 1000000*c {
if (c)
print_vals();
delete a;
i=0;
c++;
sn=$1;
} NF==2 {
a[++i]=$0;
en=$1;
} END {
print print_vals();
}' "$file"
done