This is the third in our “Data Loss Horror Stories” series, the others being Nightmare on Data Street and Data Razer. It is a dramatic telling of a tale of data loss; as always, the names and dates have been changed to protect the identity of the data loss victims.
Once, as a younger DBA, I was tasked with rebuilding a NoSQL database on a staging cluster. The client had a terrible naming convention, and refused to use standard names like PROD, STAGE, etc. The clusters were identified by port number, which were all close. 16554, 16555, 16556, and so on.
Well, I started off correctly, using some standard scripts to wipe out the database against the port number, but I got an error saying there were active writes. I checked the usage, and sure enough there was activity. I followed up in the ticket asking if they were sure it was safe, since I’d seen active writes. They confirmed that it was safe, and that they likely forgot to turn off a script or two on their end. So I continue, running the script a couple of times to make sure any writes were cleaned out.
At this point an urgent ticket comes in asking about the production cluster, as data was missing. Long story short, the port number was off by one and I’d wiped 5TB worth of production data. 20~ hours and 5TB worth of backup files later, the customer was back up and running.
Moral of the Story: Better naming conventions can help reduce errors.