Nick Shubin
 Home  Articles  Photos  Publications  Site Map

 

Nick Shubin

Finding Duplicates in the Mail Database

April 14, 2012

You may be one of those who backs up of all critical information on the computer including the sent and received messages. If you are using Apple Mail intensively, your mail database is fairly large. Backups can quickly take a lot of space on your backup drive.

One of effective ways to reduce the database size is to remove duplicates. To do this, one can use one of AppleScripts available on the Internet. I'm finding them not so effective especially when you receive many messages containing attachments.

This article describes another method. All cleaning is performed on the file system level. This method can potentially lead to data loss. You should back up the Mail folder before any actions. Its location is indicated below. The Get Backup program can automate backing up.

The Mailbox Structure

The mail database is located in user's Home:
~/Library/Mail/

Quit Mail.app before performing any operations on files in this folder.

The details here relate to Mac OS 10.7 Lion. The previous (and maybe further) versions of the operating system have an identical structure of mailboxes.

On Mac OS 10.7 Lion, the Library folder in user's Home is hidden by default. To access it in the Finder, select Go > Go to Folder (Cmd-Shift-G) and enter the path.

The mailboxes listed in the left hand panel in Mail.app are physically stored inside the Mailboxes folder.

Files_1

If a mailbox has sub-mailboxes (for instance, Online_Shopping), the respective folder Online_Shopping.mbox will have subfolders with the same names.

Mailbox

Files_2

Inside each mailbox there is a Info.plist file in the XML format containing the name of this mailbox and some settings. Note that if you have mailboxes named the same way (for instance, by years as in the second picture), the content of their Info.plist can be the same. This creates a potential risk to delete all of them but one. This issue is discussed further.

If you go deep inside, you'll get to the messages

Files_3

and attachments

Files_4

A folder with attachment(s) has the same name as the respective message. If, for example, a message 580210.emlx includes several attachments, the corresponding folder named 580210 contains the attached files in the original format (images, spreadsheets, and so on).

As you can see, Mail.app gives sequential numeric names to the messages, so if there are copies, the Finder cannot help you find them. You need a program that can compare the content of files.

Finding Duplicates

At first, I used an AppleScript to find duplicates from the Mail Scripts package created by Andreas Amann.

Why I decided to find something different? The performance of an AppleScript isn't perfect comparing with a stand-alone program with identical functionality. The script uses the message IDs to find duplicates. I prefer comparing the message content. Also, I'm not sure that the script can find duplicated attachments.

Thus it is better to use a program aimed to find duplicated files on the hard drive, and limit its work area by the Mailboxes folder. I used for my task Find Duplicate Files by Araxis. It lets you find copies ant then select those you wish to delete. The analyze passes quite quickly. The result is displayed in a table.

FDF_main_window

Identical files have the same colors. The program calculates the hash to insure that found files are copies. You just need to click "Select Duplicated Items" and then "Delete Selected Items". Also, you can select or deselect files manually to delete only what you need. The deletion process takes a lot of time. It is much slower than deleting files in the Finder.

You have an option to search inside the whole Mailboxes folder, or inside each mailbox independently.

Searching in the entire Mailboxes folder

Pros:
- The process doesn't require much of your attention.
- Since duplicates can be in different mailboxes, you'll find all of them.

Cons:
- If you have sub-mailboxes named the same way, the Info.plist files can be deleted by mistake. This can also happen if you clean several mailboxes. To avoid this, deselect these file in the list. How to restore an accidentally deleted Info.plist is described below.

Searching in each mailbox individually

Pros:
- The job can be split into parts with duration about 1-2 hours each (depending on the mailbox content).

Cons:
- Duplicates located in the mailboxes you scan in different sessions, won't be found.

 

Restoring Info.plist

If you delete this file, the corresponding mailbox will disappear in Mail, but messages will remain.

A typical content of an Info.plist file:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>DisplayInThreadedMode</key>
<string>yes</string>
<key>MailboxName</key>
<string>2012</string>
<key>SortOrder</key>
<string>received-date</string>
<key>SortedDescending</key>
<string>YES</string>
</dict>
</plist>

To restore a deleted Info.plist, copy it from another mailbox and enter the correct mailbox name. In the example above, you need to replace "2012" with the proper name.

 

Duplicated Attachments. Where are they from?

An *.emlx file (a message) can be opened in a text editor. At the beginning there is the full header (you can show it in Mail by choosing View > Message > All Headers). Then you can find the body of the message.

*.emlx files can contain attachments. Binary data of the attached file is encoded by text and looks like chaotic sequence of letters and numbers:
mgCfAKQAqQCuALIAtwC8AMEAxgDLANAA1QDbAOAA5QDrAPAA9gD7AQEBBwENARMBGQEfASUBKwEy
ATgBPgFFAUwBUgFZAWABZwFuAXUBfAGDAYsBkgGaAaEBqQGxAbkBwQHJAdEB2QHhAekB8gH6AgMC

Once you receive a message, Mail saves the attached file in the original format into the Attachments folder even if you didn't preview or save it manually. From this moment, you have two copies of the attached file (inside the message and separate one). Mail can restore files in the Attachments folder if they were deleted. This seems to be forced by the Mailbox > Rebuild command in the main menu. According to my experience, launching Rebuild several times created multiple copies of attached files increasing the mailbox size by 20-30%.

 


© 2016 Nick Shubin. All rights reserved.