Yesterday we experienced an outage affecting the main easyDNS website, the control panel, easyMail and mail forwarding.
The cause was a corrupted database in our core DB cluster. As expedient as it might be to blame the root cause on Russian state-sponsored hackers, the reality is that it was caused by operator error.
A sysadmin was working on an issue when he fat-fingered a command and wound up corrupting the main primary node of our DB cluster.
Ordinarily, no big deal. Upon realizing the situation he proceeded to rollback to a snapshot taken before the failed operation. That should have been the end of it, but for some esoteric reason the rollback repeatedly failed. This rapidly devolved into an “all hands on deck” situation and despite everybody’s best efforts, it was FUBAR-ed.
The ops group then spun up a new DB cluster and had to load another snapshot from earlier in the day (2am EST) which restored services. Concurrent with that, they then managed to roll forward to that original snapshot we wanted all along.
In other words, Plan A failed, Plan B failed, Plan C worked, then we were able to get back to Plan B.
That said, what should have been a 10 to 30 minute outage was in fact a 3 to 4 hour one, which is sub-optimal. (But as far as completely blowing up one’s database cluster goes, it’s not catastrophic either).
We apologize for the grief this caused. We know email is a horrible service to lose, ours was down right along with yours.
Rest assured we’re dissecting this to death today and will make appropriate adjustments.