특정 패턴으로 파일 분할

특정 패턴으로 파일 분할

내가 작업하는 파일은 다음과 같습니다

NAMES   n0  n1  n2  n3  n4  n5  n6  n7
REGION  chr 1   100000
404 AAAAAAGA
992 TTTTTTTA
1146    CCCCGGCC
1727    CCCCCACC
1778    GCCCCCCC

내가 원하는 출력은 다음과 같습니다(두 번째 줄과 숫자가 어떻게 표시되는지 확인하세요).

file1
 NAMES  n0  n1  n2  n3  n4  n5  n6  n7
    REGION  chr 404 992
    404 AAAAAAGA
    992 TTTTTTTA

file2
 NAMES  n0  n1  n2  n3  n4  n5  n6  n7
     REGION chr 1146    1778
1146 CCCCGGCC
1727 CCCCCACC
1778 GCCCCCCC

나는 awk에서 시도했다

awk 'function print_vals() {
   fn="file" c;
   print hdr > fn;
   print "REGION  chr", sn, en >> fn;
   for (i in a)
      print a[i] >> fn;
} NR == 1 {
   hdr=$0;
   c=0;
   next
} NF==2 && $1 >= 1000000*c {
   if (c)
      print_vals();
   delete a;
   i=0;
   c++;
   sn=$1;
} NF==2 {
   a[++i]=$0;
   en=$1;
} END {
   print print_vals();
 }' file

작동한 샘플 데이터의 경우 출력을 얻었지만 실제 데이터 세트의 경우 출력을 얻지 못했습니다. 세트는 여기 있어요https://www.dropbox.com/s/h6ukumbj08cwk99/arg_t1.gz?dl=0 이거 좋을 것 같아

NAMES   n1  n2  n3  n4  n5  n6  n7  n8  n9  n10     n11     n12     n13     n14     n15     n16 $
REGION  chr     1   10000000
69  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
474     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
584     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
627     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
676     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
690     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
894     AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
1104    AAAAAAAAAAAAAAAAA

출력이 어떻게든 전환되는 경우...그렇게 되어야 하는 방식이 아닙니다.

NAMES   n1  n2  n3  n4  n5  n6  n7  n8  n9  n10     n11     n12     n13     n14     n15     n16 $
REGION  chr 69 999927
561321  AAAAAACAAAAAAAAACAAAAAAAAAAAAAAAAAACCCAAAACAACAAAACAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAACAAAAACCAACA$
561362  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
562011  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
562029  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
562162  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA$
562171

누군가 이 문제를 어떻게 해결해야 하는지 말하거나 다른 기능을 제안할 수 있나요?

답변1

연관 배열 에서는 awk결정되지 않은 순서로 탐색됩니다. 당신의 교체

for (i in a)

~에 의해

n = i
for(i=1;i<=n;i++)

awk를 bash 스크립트로 묶으려면 다음과 같은 것을 사용할 수 있습니다.

#!/bin/bash
for file
do  awk -v file="$file" '....' "$file"
done

chmod a+rx스크립트 파일에서 실행하는지 확인하세요 . 라인도 교체해주세요

fn="file" c;

귀하의 awk 스크립트에서

fn = c "_" file;

이 행은 새 파일 이름이 작성되는 방법입니다. awk 변수는 file처음에 처리 중인 파일 이름의 값으로 제공됩니다(구문은 awk -v 변수=값). awk 변수 는 새 파일 이름이며 문자 및 파일 이름 변수 와 연결된 숫자를 보유하는 fn변수입니다 .c_

여러 파일을 인수로 사용하여 이 bash 명령을 실행할 수 있습니다. awk에 의해 하나씩 처리됩니다.


최종 결과:

#!/bin/bash
for file
do awk -v file="$file" 'function print_vals() {
   fn = c "_" file;
   print hdr > fn;
   print "REGION  chr", sn, en >> fn;
   n = i
   for(i=1;i<=n;i++)
      print a[i] >> fn;
 } NR == 1 {
   hdr=$0;
   c=0;
   next
 } NF==2 && $1 >= 1000000*c {
   if (c)
      print_vals();
   delete a;
   i=0;
   c++;
   sn=$1;
 } NF==2 {
   a[++i]=$0;
   en=$1;
 } END {
   print print_vals();
 }'  "$file"
done

관련 정보