刪除電子郵件 (.eml) 重複項

刪除電子郵件 (.eml) 重複項

我有一個資料夾,其中包含大約 50.000 封 .eml 格式的電子郵件。有很多重複的,甚至是三聯體或四聯體,我想總共大約有 30,000 個。我嘗試使用 Mozilla Thunderbird 附加刪除重複訊息(替代)來刪除重複項,但它只刪除了其中的一小部分(幾百個)。然後,我使用了Windows 桌面應用程序,例如Wise 重複查找器、重複清除器免費、AllDup、快速重複查找器和Anti-Twin,使用逐字節(60% 比較),但這些應用程式都沒有成功找到正確的重複項(再次,我成功地只刪除了其中的一部分,這次是幾千)。

我附上了我擁有的兩封電子郵件的示例,儘管它們的源代碼略有不同(和不同的文件名),但它們基本上是相同的- 它們是在同一時間從同一電子郵件地址發送的,並且它們具有相同的內容檔案大小:

第一封電子郵件- 訊息-1-34437.eml

Received: from e11mailgw02.isp.com ([212.200.12.195]) by mtain3.isp.com (Sun Java(tm) System Messaging Server 6.3-4.01 (built Aug  3 2007; 32bit)) with ESMTP id <[email protected]> for user@com; Tue, 02 Jun 2009 22:53:58 +0200 (CEST)
Received: from unknown (HELO vps.mafiascene.com) ([69.73.156.173]) by e11mailgw02.isp.com with ESMTP; Tue, 02 Jun 2009 22:53:57 +0200
Received: (qmail 24030 invoked by uid 48); Tue, 02 Jun 2009 16:53:51 -0400
Date: Tue, 02 Jun 2009 16:53:51 -0400
From: "Mafia Scene" <[email protected]>
Subject: Mafia Scene Registration Confirmation
To: <user@com>
X-Priority: 3
X-MSMail-Priority: Normal
Importance: Normal
Message-ID: <[email protected]>
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Au0JAFEuJUpFSZyt/2dsb2JhbACOFhEBsRIRCAMEj2iCMR4IBAwEgSAF
X-IronPort-AV: E=McAfee;i="5300,2777,5634"; a="7766158"
X-MimeOLE: Produced By Microsoft MimeOLE V14.0.8089.726
Old-X-EsetId: 4FAA1F2928B4776950AC1F7F23E634
X-EsetId: 745B6128E6F033696B5D617DE9A773
X-EsetScannerBuild: 6455


Thank you for registering with Mafia Scene!



The details you registered your account with at 4:53pm EDT Tuesday - 2nd June 2009 are as follows:

Username: username 
Password: password

To active your account you MUST visit the following link WITHIN the next 24 HOURS.

http://mafiascene.com/modules.php?name=users&action=activate&id=c284c0e0a7a7aec0772709511b2b8f3e

Regards,

The Mafia Scene Staff


__________ Information from ESET NOD32 Antivirus, version of virus signature database 4124 (20090602) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com





__________ Information from ESET NOD32 Antivirus, version of virus signature database 4801 (20100124) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com

第二封電子郵件- 訊息-1-54557.eml

Received: from e11mailgw02.com ([212.200.12.195])
 by mtain3.isp.com
 (Sun Java(tm) System Messaging Server 6.3-4.01 (built Aug  3 2007; 32bit))
 with ESMTP id <[email protected]> for
 user@com; Tue, 02 Jun 2009 22:53:58 +0200 (CEST)
Received: from unknown (HELO vps.mafiascene.com) ([69.73.156.173])
 by e11mailgw02.com with ESMTP; Tue, 02 Jun 2009 22:53:57 +0200
Received: (qmail 24030 invoked by uid 48); Tue, 02 Jun 2009 16:53:51 -0400
Date: Tue, 02 Jun 2009 16:53:51 -0400
From: Mafia Scene <[email protected]>
Subject: Mafia Scene Registration Confirmation
To: user@com
Message-id: <[email protected]>
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result:
 Au0JAFEuJUpFSZyt/2dsb2JhbACOFhEBsRIRCAMEj2iCMR4IBAwEgSAF
X-IronPort-AV: E=McAfee;i="5300,2777,5634"; a="7766158"
X-EsetId: 4FAA1F2928B4776950AC1F7F23E634


Thank you for registering with Mafia Scene!



The details you registered your account with at 4:53pm EDT Tuesday - 2nd June 2009 are as follows:

Username: username
Password: password

To active your account you MUST visit the following link WITHIN the next 24 HOURS.

http://mafiascene.com/modules.php?name=users&action=activate&id=c284c0e0a7a7aec0772709511b2b8f3e

Regards,

The Mafia Scene Staff


__________ Information from ESET NOD32 Antivirus, version of virus signature database 4124 (20090602) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com

有什麼方法可以偵測此類電子郵件是否重複?

答案1

標題完全不同,內容也不同。查找重複項的常見解決方案無法識別該資訊。

你必須自己釀造一些東西。例如,您可以編寫腳本來提取與您相關的信息,標記可疑的重複項,並應用一些其他技術來檢查它是否確實是重複項。它可能會在某種程度上涉及手工工作。

更簡單的第一步可能是只切斷標頭並運行比較。

相關內容