Michael Maclean

How to accidentally a MySQL server

A long time ago, in a company far, far away

Our site ran on dedicated hardware, rented from a managed hosting provider who we never really got around to doing a lot of managing. As such, we all had root on all the machines, and so were free to do what we wanted. We also didn’t really know a lot about configuration management at that point, so changes were largely ad-hoc and unrepeatable.

At some point we found that the DB server was a bit slow for some queries, and that it had a lot more RAM than it appeared to be using. While nowadays I would probably assume that I could bump the innodb_buffer_pool_size up by some gigabytes, at that point what appeared to be a really great idea to past-Michael was to stick the MySQL temporary tables into a RAM disk. This made things fly and seemed like a totally great idea with no possible future negative consequences.

In the mean time

We were using the site to show some media that required us to implement a geographic filter for the viewers. This was implemented by either using the users’ device’s geolocation to supply a latitude and longitude, or by having the user enter a postal code. Our site could look up whether it was within a certain bounding box, and if not it would redirect to another provider.

Eventually

We came to move to AWS. We had everything built out with configuration management, deployed our site to EC2, used memcached provided by Elasticache, and finally put our MySQL database on RDS.

We had a couple of runs at switching it over from the dedicated host. The second time it finally stuck, after bumping the database size up a bit. That evening we had one of these media events that needed the geographic filter. I had hoped that we wouldn’t need to use it, but it turned out we had to. I hung around in the office to watch what ensued.

About 10 minutes prior to the event, the site fell all the way over. Something was thrashing the database in a way that hadn’t been happening before.

Skip to the end

It turned out our geographic filter was implemented like this:

  1. User provides their latitude and longitude via the geolocation API on a device like an iPad
  2. The site does a query against a table containing a snapshot of the Ordnance Survey Codepoint database. This is a list of every known UK postal code and the coordinates of its centre point, which consisted of something like 350 000 rows.
  3. This query implements the Haversine formula, which uses spherical trigonometry to calculate a pretty accurate great circle distance between the user-supplied point, and each of the 350 000 rows in the Codepoint database.
  4. The query finished up with ORDER BY distance LIMIT 1, which threw away 349 999 of those rows and returned the distance to the nearest. It could then see if this row was within the area for which we held the rights.

The query looked like the below, and I suspect it was taken from this StackOverflow answer which is from around the right time period.

SELECT
postcode,
( 6371 * acos( cos( radians(18.204540500000) ) 
                   * cos( radians( latitude ) ) 
                   * cos( radians( longitude ) 
                    - radians(-66.450958500000) ) 
                   + sin( radians(18.204540500000 ) )
                   * sin( radians( latitude ) ) 
              )
 ) AS distance 
 FROM codepoint
 ORDER BY distance
 LIMIT 1

On our dedicated hardware, with the oh-so-clever RAM disk trick, this query took 0.2 seconds. On RDS, where you are prevented from doing bullshit like that, it took 14 seconds. As it was not even slightly cacheable the traffic brought down the DB and hence the site in pretty short order.

The fix

We moved it to Elasticsearch, where the same query executed in about 0.4ms.

Postscript

This was probably only the third-worst failure we experienced while I was there.