The story you are about to read is a dramatic telling of a tale of data loss; the names and dates have been changed to protect the identity of the data loss victims.
I am a backend engineer for a large tech company, and regularly on call. Our company manages a lot of customer data. We had a contractor come on board to work on a project for us around inverted indices to improve the speed of our searches. Let’s call this contractor Pinhead.
As a part of this project, Pinhead needed to reindex a bunch of data and decided that to build this application properly, it was absolutely imperative that production data be used in order to test how the app would work in production. Anytime someone mentions testing something in production, I start to twitch.
So, while many of us advised against it, Pinhead was allowed to test the app in production, using our production data. Part of Pinhead’s app called an API to delete data that was supposed to be from a local store. Of course Pinhead didn’t test anything outside of production before running it in production and due to an epic level FAIL on Pinhead’s part, the API actually ended up calling the wrong location and deleting massive amounts of production data from S3.
However, the worst part of all this is that no one knew this was happening. After 3 TB of production data were deleted, our customer data searches started failing left and right since they couldn’t locate any of the data to run the search on. Our entire on-call team started blowing up with notifications at this point.
It wasn’t hard to figure out why the searches were failing once we were notified, but having to drop everything and start trying to write scripts on the fly to try and recover any data we could from Cache was not how I wanted to spend my VERY early Saturday morning.
Let’s just say, I would have loved to stick Pinhead back in his little puzzle box and send him back to the developer hell he came from.
Moral of the story: There are ways to test with production data, without testing IN production. Testing in production is just a bad idea.