항목을 분할하지 않고 대용량 파일을 청크로 분할

Question 1

다음 제안을 사용하여 csplit:

줄 번호를 기준으로 분할

$ csplit file.txt <num lines> "{repetitions}"

예

1000줄의 파일이 있다고 가정해 보겠습니다.

$ seq 1000 > file.txt

$ csplit file.txt 100 "{8}"
288
400
400
400
400
400
400
400
400
405

다음과 같은 파일이 생성됩니다.

$ wc -l xx*
  99 xx00
 100 xx01
 100 xx02
 100 xx03
 100 xx04
 100 xx05
 100 xx06
 100 xx07
 100 xx08
 101 xx09
   1 xx10
1001 total

특정 파일의 줄 수를 기반으로 숫자를 미리 계산하여 반복 횟수를 지정해야 하는 정적 제한을 피할 수 있습니다.

$ lines=100
$ echo $lines 
100

$ rep=$(( ($(wc -l file.txt | cut -d" " -f1) / $lines) -2 ))
$ echo $rep
8

$ csplit file.txt 100 "{$rep}"
288
400
400
400
400
400
400
400
400
405

빈 줄을 기준으로 분할

반면에 파일에 포함된 빈 줄로 파일을 간단히 분할하려면 다음 버전을 사용할 수 있습니다 split.

$ csplit file2.txt '/^$/' "{*}"

예

위에 빈 줄 4개를 추가하고 file.txt파일을 file2.txt. 다음과 같이 수동으로 추가된 것을 확인할 수 있습니다.

$ grep -A1 -B1 "^$" file2.txt
20

21
--
72

73
--
112

113
--
178

179

위의 내용은 샘플 파일 내 해당 숫자 사이에 추가했음을 보여줍니다. 이제 명령을 실행하면 다음과 같습니다 csplit.

$ csplit file2.txt '/^$/' "{*}"
51
157
134
265
3290

이제 빈 줄을 기준으로 분할된 4개의 파일이 있는 것을 볼 수 있습니다.

$ grep -A1 -B1 '^$' xx0*
xx01:
xx01-21
--
xx02:
xx02-73
--
xx03:
xx03-113
--
xx04:
xx04-179

참고자료

Answer

다음 제안을 사용하여 csplit:

줄 번호를 기준으로 분할

$ csplit file.txt <num lines> "{repetitions}"

예

1000줄의 파일이 있다고 가정해 보겠습니다.

$ seq 1000 > file.txt

$ csplit file.txt 100 "{8}"
288
400
400
400
400
400
400
400
400
405

다음과 같은 파일이 생성됩니다.

$ wc -l xx*
  99 xx00
 100 xx01
 100 xx02
 100 xx03
 100 xx04
 100 xx05
 100 xx06
 100 xx07
 100 xx08
 101 xx09
   1 xx10
1001 total

특정 파일의 줄 수를 기반으로 숫자를 미리 계산하여 반복 횟수를 지정해야 하는 정적 제한을 피할 수 있습니다.

$ lines=100
$ echo $lines 
100

$ rep=$(( ($(wc -l file.txt | cut -d" " -f1) / $lines) -2 ))
$ echo $rep
8

$ csplit file.txt 100 "{$rep}"
288
400
400
400
400
400
400
400
400
405

빈 줄을 기준으로 분할

반면에 파일에 포함된 빈 줄로 파일을 간단히 분할하려면 다음 버전을 사용할 수 있습니다 split.

$ csplit file2.txt '/^$/' "{*}"

예

위에 빈 줄 4개를 추가하고 file.txt파일을 file2.txt. 다음과 같이 수동으로 추가된 것을 확인할 수 있습니다.

$ grep -A1 -B1 "^$" file2.txt
20

21
--
72

73
--
112

113
--
178

179

위의 내용은 샘플 파일 내 해당 숫자 사이에 추가했음을 보여줍니다. 이제 명령을 실행하면 다음과 같습니다 csplit.

$ csplit file2.txt '/^$/' "{*}"
51
157
134
265
3290

이제 빈 줄을 기준으로 분할된 4개의 파일이 있는 것을 볼 수 있습니다.

$ grep -A1 -B1 '^$' xx0*
xx01:
xx01-21
--
xx02:
xx02-73
--
xx03:
xx03-113
--
xx04:
xx04-179

참고자료

Question 2

기록의 순서에 관심이 없다면 다음을 수행할 수 있습니다.

gawk -vRS= '{printf "%s", $0 RT > "file.out." (NR-1)%15}' file.in

그렇지 않으면 먼저 레코드 수를 가져와서 각 출력 파일에 넣을 수 있는 수를 알아야 합니다.

gawk -vRS= -v "n=$(gawk -vRS= 'END {print NR}' file.in)" '
  {printf "%s", $0 RT > "file.out." int((NR-1)*15/n)}' file.in

Answer

기록의 순서에 관심이 없다면 다음을 수행할 수 있습니다.

gawk -vRS= '{printf "%s", $0 RT > "file.out." (NR-1)%15}' file.in

그렇지 않으면 먼저 레코드 수를 가져와서 각 출력 파일에 넣을 수 있는 수를 알아야 합니다.

gawk -vRS= -v "n=$(gawk -vRS= 'END {print NR}' file.in)" '
  {printf "%s", $0 RT > "file.out." int((NR-1)*15/n)}' file.in

Question 3

작동할 수 있는 솔루션은 다음과 같습니다.

seq 1 $(((lines=$(wc -l </tmp/file))/16+1)) $lines |
sed 'N;s|\(.*\)\(\n\)\(.*\)|\1d;\1,\3w /tmp/uptoline\3\2\3|;P;$d;D' |
sed -ne :nl -ne '/\n$/!{N;bnl}' -nf - /tmp/file

첫 번째 사람이 두 번째 사람의 스크립트를 sed작성할 수 있도록 허용함으로써 작동합니다 . sed두 번째는 sed먼저 빈 줄을 만날 때까지 모든 입력 줄을 수집합니다. 그런 다음 모든 출력 행을 파일에 씁니다. 첫 번째는 sed출력을 작성할 위치를 지시하는 두 번째 스크립트를 작성합니다. 내 테스트 사례에서 해당 스크립트는 다음과 같습니다.

1d;1,377w /tmp/uptoline377
377d;377,753w /tmp/uptoline753
753d;753,1129w /tmp/uptoline1129
1129d;1129,1505w /tmp/uptoline1505
1505d;1505,1881w /tmp/uptoline1881
1881d;1881,2257w /tmp/uptoline2257
2257d;2257,2633w /tmp/uptoline2633
2633d;2633,3009w /tmp/uptoline3009
3009d;3009,3385w /tmp/uptoline3385
3385d;3385,3761w /tmp/uptoline3761
3761d;3761,4137w /tmp/uptoline4137
4137d;4137,4513w /tmp/uptoline4513
4513d;4513,4889w /tmp/uptoline4889
4889d;4889,5265w /tmp/uptoline5265
5265d;5265,5641w /tmp/uptoline5641

나는 이것을 다음과 같이 테스트했다.

printf '%s\nand\nmore\nlines\nhere\n\n' $(seq 1000) >/tmp/file

이것은 나에게 다음과 같은 6000줄의 파일을 제공했습니다.

<iteration#>
and
more
lines
here
#blank

...1000번 반복되었습니다.

위 스크립트를 실행한 후:

set -- /tmp/uptoline*
echo $# total splitfiles
for splitfile do
    echo $splitfile
    wc -l <$splitfile
    tail -n6 $splitfile
done

산출

15 total splitfiles
/tmp/uptoline1129
378
188
and
more
lines
here

/tmp/uptoline1505
372
250
and
more
lines
here

/tmp/uptoline1881
378
313
and
more
lines
here

/tmp/uptoline2257
378
376
and
more
lines
here

/tmp/uptoline2633
372
438
and
more
lines
here

/tmp/uptoline3009
378
501
and
more
lines
here

/tmp/uptoline3385
378
564
and
more
lines
here

/tmp/uptoline3761
372
626
and
more
lines
here

/tmp/uptoline377
372
62
and
more
lines
here

/tmp/uptoline4137
378
689
and
more
lines
here

/tmp/uptoline4513
378
752
and
more
lines
here

/tmp/uptoline4889
372
814
and
more
lines
here

/tmp/uptoline5265
378
877
and
more
lines
here

/tmp/uptoline5641
378
940
and
more
lines
here

/tmp/uptoline753
378
125
and
more
lines
here

Answer

작동할 수 있는 솔루션은 다음과 같습니다.

seq 1 $(((lines=$(wc -l </tmp/file))/16+1)) $lines |
sed 'N;s|\(.*\)\(\n\)\(.*\)|\1d;\1,\3w /tmp/uptoline\3\2\3|;P;$d;D' |
sed -ne :nl -ne '/\n$/!{N;bnl}' -nf - /tmp/file

첫 번째 사람이 두 번째 사람의 스크립트를 sed작성할 수 있도록 허용함으로써 작동합니다 . sed두 번째는 sed먼저 빈 줄을 만날 때까지 모든 입력 줄을 수집합니다. 그런 다음 모든 출력 행을 파일에 씁니다. 첫 번째는 sed출력을 작성할 위치를 지시하는 두 번째 스크립트를 작성합니다. 내 테스트 사례에서 해당 스크립트는 다음과 같습니다.

1d;1,377w /tmp/uptoline377
377d;377,753w /tmp/uptoline753
753d;753,1129w /tmp/uptoline1129
1129d;1129,1505w /tmp/uptoline1505
1505d;1505,1881w /tmp/uptoline1881
1881d;1881,2257w /tmp/uptoline2257
2257d;2257,2633w /tmp/uptoline2633
2633d;2633,3009w /tmp/uptoline3009
3009d;3009,3385w /tmp/uptoline3385
3385d;3385,3761w /tmp/uptoline3761
3761d;3761,4137w /tmp/uptoline4137
4137d;4137,4513w /tmp/uptoline4513
4513d;4513,4889w /tmp/uptoline4889
4889d;4889,5265w /tmp/uptoline5265
5265d;5265,5641w /tmp/uptoline5641

나는 이것을 다음과 같이 테스트했다.

printf '%s\nand\nmore\nlines\nhere\n\n' $(seq 1000) >/tmp/file

이것은 나에게 다음과 같은 6000줄의 파일을 제공했습니다.

<iteration#>
and
more
lines
here
#blank

...1000번 반복되었습니다.

위 스크립트를 실행한 후:

set -- /tmp/uptoline*
echo $# total splitfiles
for splitfile do
    echo $splitfile
    wc -l <$splitfile
    tail -n6 $splitfile
done

산출

15 total splitfiles
/tmp/uptoline1129
378
188
and
more
lines
here

/tmp/uptoline1505
372
250
and
more
lines
here

/tmp/uptoline1881
378
313
and
more
lines
here

/tmp/uptoline2257
378
376
and
more
lines
here

/tmp/uptoline2633
372
438
and
more
lines
here

/tmp/uptoline3009
378
501
and
more
lines
here

/tmp/uptoline3385
378
564
and
more
lines
here

/tmp/uptoline3761
372
626
and
more
lines
here

/tmp/uptoline377
372
62
and
more
lines
here

/tmp/uptoline4137
378
689
and
more
lines
here

/tmp/uptoline4513
378
752
and
more
lines
here

/tmp/uptoline4889
372
814
and
more
lines
here

/tmp/uptoline5265
378
877
and
more
lines
here

/tmp/uptoline5641
378
940
and
more
lines
here

/tmp/uptoline753
378
125
and
more
lines
here

Question 4

노력하다awk

awk 'BEGIN{RS="\n\n"}{print $0 > FILENAME"."FNR}' big_db.msg

Answer

노력하다awk

awk 'BEGIN{RS="\n\n"}{print $0 > FILENAME"."FNR}' big_db.msg

항목을 분할하지 않고 대용량 파일을 청크로 분할

답변1

줄 번호를 기준으로 분할

예

빈 줄을 기준으로 분할

예

참고자료

답변2

답변3

산출

답변4

관련 정보