Database Disaster and Mysql Config Changes

August 26, 2019

One day, all of a sudden, we were having serious database connection issues. We reached out to everyone we knew who might be able to help (not many database experts out there, honestly).

On Aug 26, 2019, at 1:47 PM, Ken Sills wrote:

Hi Brad,

Coursicle is having a complete db disaster right now and Joe (founder, cc'd) could use someone to bounce ideas off of. Joe knows you're spread beyond thin, but if you are interested in someone else's problems for a change...

Thanks,

Ken.

Ken Sills, M.Eng. M.Sc.
CEO and CoFounder
Preteckt, Inc.
901.335.8447
Preteckt
http://www.preteckt.com/
https://www.linkedin.com/in/kensills/

On Mon, Aug 26, 2019 at 3:54 PM Joe Puccio wrote:
Ken, thanks so much for the introduction.

Hi Brad, we actually met when I was in Memphis and you did a talk about technical debt.

We're currently getting some incredible high, unexpected load on the Coursicle web server, which is causing production outages. Over the past four hours, we have narrowed things down to certain scripts which hit our MySQL database (which is hosted on the web server). The issue is we can't figure out why these scripts are causing this all of a sudden since nothing significant has changed since last week. The load on MySQL (load average is >100 on an 8 core machine) is causing MySQL connections to fail, meaning web traffic and the 1000+ people using Coursicle at any given period in the day, are being affected.

We've been doing EXPLAINs on these queries and it looks like MySQL isn't using indexes that we've created even though it should be. As you can see from the screenshots, we have an index called `classID` on the crns table, but it's not being used in the query for some reason. When we did FORCE INDEX on that index, it started using the index, but it stopped using one of the other indexes it was using.

I've also attached our MySQL config file, in case that could be the issue. We honestly haven't had to mess with it up until this point.

Best,

Joe Puccio
Co-founder, Coursicle
joe@coursicle.com

On Aug 26, 2019, at 2:48 PM, Brad Montgomery wrote:

Hey Joe,

Yeah, I remember you and I remember coursicle! Unfortunately, I'm heading out the door to an event soon, but here are my instant gut reactions to what you've said (keep in mind these are very "just off the top of my head")...
First, I'd disable those scripts for a while if at all possible... anything to try to keep your site up. Then post some sort of message up for your users that you're working on some issues & that some features may be unavailable.
I can't remember what language/framework you guys are using, but I'd check if there's a quick way to enable site-wide cacheing... ie. anything that let you serve (subsequent) GET requests without hitting the DB... these are often backed by memcache or redis; There are commercial tools for this too (https://www.cloudflare.com/ maaaay have something like this). Again... just giving your DB some space so you can debug it.
This would hopefully buy you some time to figure out what's up w/ your indexes and your queries (I've not really touched Mysql in a long time, so I don't have much insight for you there other than googling: https://www.google.com/search?q=mysql+not+using+index ... which I'm sure you've done ;))
Lastly--and this may be something to plan for the future---look into running a Read replica of your DB. Essentially, this would let you have a copy of the DB from which you just run Reads... while scripts that may do a lot of heavy writes would write to read/write copy of the DB and won't impact those reads as much; sometimes your framework can even help with load-balancing across your DB instances; (of course.. you then may have to deal with issues like reading stale/old data... but that's often better than not having your site working).
Apologies for hastily writing this & good luck... hopefully something in there helps.

Brad Montgomery
Lead Software Engineer
Preteckt, Inc

On Thu, Aug 29, 2019 at 5:12 PM Joe Puccio wrote:
Hi Brad,

Sorry for the late reply. Things are finally beginning to normalize now that we're nearing the end of the first week of the semester. I can't thank you enough for taking the time to look this over and send your thoughts. Sys administration can be so anxiety-ridden, but it makes me feel a lot better knowing that I have people to turn to when I'm really stuck.

Much to our surprise, the cause wasn't MySQL not using an index, some table growing to a certain size that it could no longer fit in RAM, or our poorly written queries (although these certainly didn't help). It turned out to be our neighbors on the VPS decided to be abusive to the CPU that day. When we told our contacts at DigitalOcean about what we were seeing, they moved these customers to a different physical machine and things immediately went back to normal. We'll be moving over to a VPS with dedicated cores to prevent this sort of contention in the future (we didn't even know this sort of thing was possible; first time it's happened in 6 years with DigitalOcean).

This does seem to explain one peculiar aspect of the crisis we were facing: the load average numbers were huge (100 or higher on an 8 core system), even though `top` didn't really show that we a whole bunch of processes heavily taxing the CPU. So, I suppose load average is one quality of the physical CPU that can bleed through the containerization a bit?

Anyway, we're using this opportunity to re-examine our database and slow queries. I think a read replica of our database would certainly help since the vast majority of queries are reads, although we'll first probably want to sever our database from our web server first (right now they're on the same machine). Would you imagine seeing any slowdown after severing the two since queries will have to be forwarded over the network (granted, in the same datacenter) rather than made locally on the web server? Or is it easy to more than makeup for this latency in improved query performance via read replicas or other optimizations?

Thank you again for your help.

Best,

Joe Puccio
Co-founder, Coursicle
joe@coursicle.com

On Sep 3, 2019, at 6:19 AM, Brad Montgomery wrote:

Hi Joe,

Thanks for the follow up; Ken had shared some info about this earlier, and I'm very surprised this ended up being the problem. I'm happy DO support got you sorted out, though. I thought this kind of thing was effectively solved by the hypervisor a long time ago.

I agree that emergency sysadmin work is super stressful; Automating as much of your system build / deploy process can help with that (e.g. being able to spin up a new web server in a reliable & repeatable manner in a minute or two is a lofty goal). I do think splitting your DB & webserver into separate hosts is a good idea, and if those are in the same data center, you can often configure them to communicate over a private network. Those networks are often gigabit speeds, so you shouldn't see much in the way of network latency. Once you've done this, a general rule of thumb is:
Scale UP your DB server (as much CPU & RAM as you can afford)
Scale OUT your web servers; More hosts & a load balancer in front of them will help you serve more http requests. This is easiest when you application servers don't need to store any state.
Another technique used to help optimize that connection between your web server and your DB is to use connection pooling (I have no idea how to do this with mysql), but honestly.... I'd look into that later if/when you know latency and/or creating connections becomes a problem.

Good luck!

Brad Montgomery
Lead Software Engineer
Preteckt, Inc

On Tue, Sep 3, 2019 at 3:22 PM Joe Puccio wrote:
Hi Brad,

Agreed, we were very surprised as well (although, very relieved to find out we didn't have to continue debugging our SQL queries under pressure).

Got it, yeah we'll definitely be configuring use of a private network for our 10 or so Droplets; it's been on the back burner for a while but will be especially relevant when we sever the database from the web server, you're right.

Thanks very much for the advice. Given our database is a mere 6GB (and at most adding about 5GB every year), I think paying for a Droplet that keeps our database entirely in RAM will still be pretty cheap (< $100/month).

Regarding your point (2.): basically every request that comes into our web server is either static (CSS, JS) or hits a PHP script which pulls data from the database. We may be using PHP session variables in one legacy script (which could be changed rather easily to be a POST parameter). Is this what you mean by stateless application servers (I'm guessing statelessness is important because a user's first 3 requests may get sent to server A by the load balancer, but then requests 4 and 5 get sent to server B)?

I'll make a note to look at connection pooling this week, thanks.

Best,

Joe Puccio
Co-founder, Coursicle
joe@coursicle.com

On Sep 3, 2019, at 3:00 PM, Brad Montgomery wrote:

Hey Joe,

Being able to fit your DB in RAM is kind of the best possible scenario; If you're only at 6Gb you should be albe to keep that light & nimble for a quite a while...

Regarding stateless apps: basically you web server doesn't have anything that's not part of your app's code repo; 90% of the time this means separating your DB like you're planning to do, but it may also mean finding some shared resource for user-uploaded files (like S3 or DO Spaces). Less common would be the case if your server wrote any sort of data to the file system (e.g. some sort of cache or user-specific data). Logs are typical example, but for a lot of apps, losing logs wouldn't be the end of the world. You can read more about this in https://12factor.net/processes

And you're absolutely right about the potential of a user's requests getting served by different servers during the course of a session.

There's a lot of great guidelines at https://12factor.net/; some are easier to achieve than others, overall if you follow those principles you'll find yourself in a place where you're app is easier to scale out.

Hey Brad,

Agreed. I spent the weekend reading all of MySQL's optimization documentation, and wrote a small program to analyze our database's stats based on my learnings. I've made a lot of config changes, many stemming from the fact that we have a rather unconventional setup (4000+ tables, mostly MyISAM, which are read-mostly tables and only modified 2 times a day for a minute or two). I've attached the config changes I made and the output of the program. If you have any use for something like this (and you use MySQL), I'd be happy to send you the script.

Got it, thanks very much for the explanation and sending that link along; we're still lacking on the conventional scaling knowledge for sure.

Best,

Joe Puccio
Co-founder, Coursicle
joe@coursicle.com

(joe@coursicle:~/performance $ python customMySQLPerformanceReccomendations.py
MySQL has created 33642718 tmp tables in memory (many things can cause this, like UNIONs). Should be okay if this is high.
MySQL has created 1423097 tmp tables on disk. Now, if this is high, that's not okay, because it meant a lot of disk I/0 (super slow).
This could be happening because you have a very low threshold for when a tp table can be constructed in memory, which is defined by the lesser of:
tmp_table_size: 16.0 MB
max_heap_table_size: 16.0 MB

MySQL currently has 4 threads cached and ready to be used by a new query.
MySQL has created new threads 113801 times because a cached thread wasn't available to be used. The process of creating these new threads caused overhead (for instance, probably plenty of CPU cycles)
This could be because thread_cache_size is set too low: 8, which means only 8 inactive threads were allowed to sit in the cache waiting for someone to pick them up.

MYSQL has evicted 29682746 queries from the query cache because it was full and a new query needed to be cached.
If this number is big, you'll probably want to increase the size of the query cache (16.0 MB). This should very significantly reduce query time because many queries will hit the cache.
You may also want to adjust the query_cache_limit, which is 1.0 MB, since any queries whose results are bigger than this value are not eligible to be cached.

MySQL has had 1495 queries stuck in a waiting state due to a table lock (800774830 were immediate)

The size of the cache for indexes on MyISAM tables is 16.0 MB. You may want to increase this if you have a lot of MyISAM tables so that more of their indexes can be cached.
It's important to note that this does not apply to the data blocks (just index blocks). Data blocks in MyISAM are cached on the OS level.
You may even consider creating multiple key caches, which reduces contention for the cache (which can exist for the default, sole key cache).
You may want to run 'LOAD INDEX INTO CACHE' on startup. This will load the indexes into the key cache such that their blocks are contiquous in memory (rather than semi-random), which improves performance.

MySQL is currently configured to allow 151 concurrent connections. Since startup, it's had as many as 151 concurrent connections.
Since you've used so many of your simultaneous connections (presumably during a high peak time), you should raise Max_connections.
On Linux, you can pretty safely set Max_connections to 500 or 1,000 (even 10,000 if you have plenty of RAM).
Note though that inceasing the number of concurrent connections will increase the number of file descriptors may use, so open_files_limit (1024) may need to be increased as well
Moreover, you may need to adjust the total number of files that a process can have open (on the OS level).

MySQL currently has opened tables from disk 5017465 times
If you're not regularly issuing 'FLUSH TABLES' and this value is large or increasing rapidly, then you should increase the table_open_cache value (currently set to 400).
Note though that you may also need to increase open_files_limit (1024), beacuse more simultaneously open tables means more file descriptors being used.
And of course, you may also need to increase the OS' limit on how many files can be open at once.
You'll probably also want to increase Table_definition_cache, which is currently set to 400.

MySQL currently has a buffer pool size of 8.0 GB.
This should be made to be pretty high, possibly 50-75% of RAM. But this really depends on whether you have a lot of, or large, InnoDB tables.
It's important to know that the InnoDB buffer is used for both caching the indexes and data from InnoDB tables (note how this differs from MyISAM).

The thread concurrency for InnoDB tables is set to 0.
You may want to increase this value so that more threads can run concurrently, Something as high as 32 is reasonable for a high powered systems.
However, innodb_concurrency_tickets (currently 500) should be set to something reasonable (docs reccomend 5,000 or less), to prevent performance degredation due to context switching.
Specifically, increasing concurrency means that threads could be constantly interrupting each other in such a way that none are able to get any work done.

# Joe's modified settings
tmp_table_size = 64M
max_heap_table_size = 64M
innodb_buffer_pool_size = 2G
thread_cache_size = 16
query_cache_size = 1G
query_cache_limit = 20M
key_buffer_size = 512M
max_connections = 300
table_open_cache = 100000
table_definition_cache = 100000
open_files limit = 300000
innodb_thread_concurrency = 16
innodb_concurrency_tickets = 1000