檔案a.txt
大約有100k字,每個字換行
july.cpp
windows.exe
ttm.rar
document.zip
檔案b.txt
有 150k 個單詞,逐行 - 有些單字來自 file a.txt
,但有些單字是新的:
july.cpp
NOVEMBER.txt
windows.exe
ttm.rar
document.zip
diary.txt
如何將這些檔案合併為一個,刪除所有重複行,並保留新行(存在於a.txt
但不存在於 中的行b.txt
,反之亦然)?
答案1
有一個命令可以執行此操作:comm
。如 所述man comm
,它很簡單:
comm -3 file1 file2
Print lines in file1 not in file2, and vice versa.
請注意,comm
需要對文件內容進行排序,因此您必須在調用它們之前comm
對它們進行排序,就像這樣:
sort unsorted-file.txt > sorted-file.txt
總結一下:
sort a.txt > as.txt
sort b.txt > bs.txt
comm -3 as.txt bs.txt > result.txt
在執行上述命令後,您將在result.txt
文件中看到預期的行。
答案2
這是一個簡短的 python3 腳本,基於格爾瑪的回答,這應該在保留b.txt
未排序順序的同時完成此任務。
#!/usr/bin/python3
with open('a.txt', 'r') as afile:
a = set(line.rstrip('\n') for line in afile)
with open('b.txt', 'r') as bfile:
for line in bfile:
line = line.rstrip('\n')
if line not in a:
print(line)
# Uncomment the following if you also want to remove duplicates:
# a.add(line)
答案3
#!/usr/bin/env python3
with open('a.txt', 'r') as f:
a_txt = f.read()
a = a_txt.split('\n')
del(a_txt)
with open('b.txt', 'r') as f:
while True:
b = f.readline().strip('\n ')
if not len(b):
break
if not b in a:
print(b)
答案4
看看 coreutilscomm
命令 -man comm
NAME
comm - compare two sorted files line by line
SYNOPSIS
comm [OPTION]... FILE1 FILE2
DESCRIPTION
Compare sorted files FILE1 and FILE2 line by line.
With no options, produce three-column output. Column one contains
lines unique to FILE1, column two contains lines unique to FILE2, and
column three contains lines common to both files.
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
例如你可以這樣做
$ comm -13 <(sort a.txt) <(sort b.txt)
diary.txt
NOVEMBER.txt
( 獨有的線條b.txt
)