여러 개의 크고 엔트로피가 높지만 매우 유사한 파일을 압축하는 방법은 무엇입니까?

Question 1

10MB의 의사 난수 데이터 파일로 시작하여 두 개의 복사본을 만들어 보겠습니다.

$ dd if=/dev/urandom of=f1 bs=1M count=10
$ cp f1 f2
$ cp f1 f3

(당신이 말했듯이) "거의 완전히 동일"하도록 해당 복사본을 변경해 보겠습니다.

$   # Avoid typos and improve readability
$ alias random='od -t u4 -N 4 /dev/urandom |
  sed -n "1{s/^\S*\s//;s/\s/${fill}/g;p}"'
$ alias randomize='dd if=/dev/urandom bs=1 seek="$(
    echo "scale=0;$(random)$(random)$(random)$(random) % (1024*1024*10)" | bc -l
  )" count="$( echo "scale=0;$(random)$(random) % 512 + 1" |
    bc -l )" conv=notrunc'
$   # In files "f2" and "f3, replace 1 to 512Bytes of data with other
$   #+ pseudo-random data in a pseudo-random position. Do this 3
$   #+ times for each file
$ randomize of=f2
$ randomize of=f2
$ randomize of=f2
$ randomize of=f3
$ randomize of=f3
$ randomize of=f3

이제 각 파일의 데이터를 압축하여 무슨 일이 일어나는지 확인할 수 있습니다.

$ xz -1 < f1 > f1.xz
$ xz -1 < f2 > f2.xz
$ xz -1 < f3 > f3.xz
$ ls -lh f{1..3}{,.xz}
-rw-rw-r-- 1 myuser mygroup 10M may 29 09:31 f1
-rw-rw-r-- 1 myuser mygroup 11M may 29 10:07 f1.xz
-rw-rw-r-- 1 myuser mygroup 10M may 29 10:00 f2
-rw-rw-r-- 1 myuser mygroup 11M may 29 10:07 f2.xz
-rw-rw-r-- 1 myuser mygroup 10M may 29 10:05 f3
-rw-rw-r-- 1 myuser mygroup 11M may 29 10:07 f3.xz

실제로 데이터 크기가 증가하는 것을 볼 수 있습니다. 이제 데이터를 사람이 읽을 수 있는 16진수 데이터(일종의)로 변환하고 결과를 압축해 보겠습니다.

$ xxd f1 | tee f1.hex | xz -1 > f1.hex.xz
$ xxd f2 | tee f2.hex | xz -1 > f2.hex.xz
$ xxd f3 | tee f3.hex | xz -1 > f3.hex.xz
$ ls -lh f{1..3}.hex*
-rw-rw-r-- 1 myuser mygroup 42M may 29 10:03 f1.hex
-rw-rw-r-- 1 myuser mygroup 22M may 29 10:04 f1.hex.xz
-rw-rw-r-- 1 myuser mygroup 42M may 29 10:04 f2.hex
-rw-rw-r-- 1 myuser mygroup 22M may 29 10:07 f2.hex.xz
-rw-rw-r-- 1 myuser mygroup 42M may 29 10:05 f3.hex
-rw-rw-r-- 1 myuser mygroup 22M may 29 10:07 f3.hex.xz

데이터가 정말 커졌습니다. 16진수로 4번, 16진수로 압축한 경우 2번입니다. 이제 재미있는 부분: 16진수와 압축 간의 차이를 계산해 보겠습니다.

$ diff f{1,2}.hex | tee f1-f2.diff | xz -1 > f1-f2.diff.xz
$ diff f{1,3}.hex | tee f1-f3.diff | xz -1 > f1-f3.diff.xz
$ ls -lh f1-*
-rw-rw-r-- 1 myuser mygroup 7,8K may 29 10:04 f1-f2.diff
-rw-rw-r-- 1 myuser mygroup 4,3K may 29 10:06 f1-f2.diff.xz
-rw-rw-r-- 1 myuser mygroup 2,6K may 29 10:06 f1-f3.diff
-rw-rw-r-- 1 myuser mygroup 1,7K may 29 10:06 f1-f3.diff.xz

정말 멋지네요. 요약해보자:

$   # All you need to save to disk is this
$ du -cb f1{,-*z}
10485760        f1
4400    f1-f2.diff.xz
1652    f1-f3.diff.xz
10491812        total
$   # This is what you would have had to store
$ du -cb f{1..3}
10485760        f1
10485760        f2
10485760        f3
31457280        total
$   # Compared to "f2"'s original size, this is the percentage
$   #+ of all the new information you need to store about it
$ echo 'scale=4; 4400 * 100 / 31457280' | bc -l
.0419
$   # Compared to "f3"'s original size, this is the percentage
$   #+ of all the new information you need to store about it
$ echo 'scale=4; 1652 * 100 / 10485760' | bc -l
.0157
$   # So, compared to the grand total, this is the percetage
$   #+ of information you need to store 
$ echo 'scale=2; 10491812 * 100 / 10485760' | bc -l
33.35

파일이 많을수록 더 잘 작동합니다. "f2"의 압축된 diff에서 데이터 복원 테스트를 수행하려면:

$ xz -d < f1-f2.diff.xz > f1-f2.diff.restored
$   # Assuming you haven't deleted "f1.diff":
$ patch -o f2.hex.restored f1.hex f1-f2.diff.restored
patching file f1.hex
$ diff f2.hex.restored f2.hex # No diffs will be found unless corrupted
$ xxd -r f2.hex.restored f2.restored # We get the completely restored file
$ diff -q f2 f2.restored # No diffs will be found unless corrupted

비고

원본 파일의 압축 버전 및 압축된 16진수와 같이 여기에서 생성된 일부 파일은 필요하지 않습니다. 나는 단지 요점을 밝히기 위해 그것들을 만들었습니다.
이 방법의 성공 여부는 "거의 완전히 동일하다"는 의미에 크게 좌우됩니다. 테스트를 해야 합니다. 몇 가지 테스트를 해보았는데 이는 매우 다양한 유형의 데이터(즉, 데이터베이스 덤프, 심지어 편집된 이미지와 비디오까지)에 매우 효과적이었습니다. 실제로 일부 백업에 이것을 사용합니다.
보다 정교한 방법은 librsync를 사용하는 것이지만 이는 많은 상황에서 매우 잘 작동하며 새 소프트웨어를 설치할 필요 없이 거의 모든 *nix 환경에서 완벽하게 작동합니다.
단점으로는 약간의 스크립팅이 필요할 수 있습니다.
이 모든 작업을 수행하는 도구는 없습니다.

Answer