Recovering from a Corrupted Zimbra Mailstore
Recently, we ran across a problem that started with Nagios alerting us that our mail server running on Zimbra 8 was not reachable. This was on a Friday, about 5:30pm.
The server rebooted unexpectedly and was now running an fsck on the partition that holds the Zimbra mailstore. That server was running on a recently expanded RAID1 array that was about a month old.
After the 3rd reboot, I decided to remove one of the members to have a backup plan in case fsck
was not able to recover from whatever errors the filesystem had. After several more attempts to fix the filesystem errors, it became clear that there was no recovering from the corruption. df
shows just 60MB of used space on a partition that originally had 1.4TB of data.
A couple of hours in, the plan shifted from reviving the service to installing a new instance of Zimbra to get operations up and running with email again. We'll recover the emails later.
We had an updated sheet of all the email accounts on that mail server so it was easy to create a bash script that used zmprov
to create email accounts with temporary passwords. The team disseminated the temporary passwords to each user while I came up with a plan on how to recover those lost emails.
With the mailstore data drive mounted on a fresh installation of Ubuntu Server, I fired up photorec
and ran it on the whole disk. The target directory was an NFS share on our file server since we didn't have an available spare drive at the time.
Photorec recovered a lot of .mbox files which meant that it should be possible to write a script that used php-mime-mail-parser to go through all of these files and build a metadata index.
It was possible to look at the headers and figure out who an email is for by looking at various things. The most accurate was a header inserted by Zimbra so something as simple as this would return the email address that the email was originally delivered to.
foreach ($headers['received'] as $header) { if (Str::of($header)->contains('@teleserv.com.ph')) { return $header->between('<', '>'); } }It worked great but the header was not present in a lot of emails so the rest of the time was spent letting the script run, letting it throw an error, then figuring out based on the parsed information where else to take the owner from. I ended up inspecting the return-path, then checking if there was only one recipient and it was a teleserv.com.ph address.
The rest of the script then proceeded to extract the sender, subject, other recipients that were CC'd on the message, and a list of attachments that were part of the email.
This metadata was eventually turned into a small web application built on Laravel and Jetstream. Users would register with their email address, the app would use Laravel's built-in email verification, then the app would show the user a searchable list of his old email messages.
The app would decode a message on-demand when a user wants to view the contents. This meant that it was not possible to search the email body but having a date, sender, and subject as filters were enough to make it useful.
The app indexed a total of 5.7 million emails from the 1.4TB data retrieved by photorec from the corrupted filesystem.
It's worth noting that our mission critical email accounts are hosted by Google Workspaces. The types of email that this Zimbra instance contained are non-critical. Think hourly reports with attachments and announcements.
When I write it down like this, it feels very straightforward. But I don't want to go into all the details of the things that we tried and failed. My intention in posting this is that if anyone else ever gets a corrupted Zimbra mailstore, then there is hope by using Photorec, PHP MIME Mail Parser, and Laravel to recover and build a quick viewer of thes recovered emails.