Handling heavy load in production

chris · January 20, 2016, 8:31pm

Caching

The name of the game in production is caching. You want to pay attention both to how the application’s code is cached, via Emergence’s VFS, and how you application data from SQL is cached, via ActiveRecord. Inadequate caching can prevent a site from handling it’s load on a capable server, or leave the application exposed to the risk that it can’t handle restarting with a cold cache under load.

VFS caching

Files inherited from a parent site are stored in a local persistent cache. Site::$autoPull configures whether the site will pull needed code/assets from its parent site as-needed in runtime. This is quick and convenient for prototyping and development, but a bad idea in production. You might not notice the performance impact when a site has a hot cache already, but restarting a server under high load that has autoPull enabled will lead to a stampede as a large number of user requests get held up waiting for assets to get pulled from the parent site. The effect can be even worse if the parent site is hosted on the same machine, as both loads will chair the same php-fpm worker pool.

Before turning $autoPull off, use the /site-admin/precache interface to pre-populate the parent site cache with all files available from the parent. Since this cache is persistent this process does not need to be repeated after restarts, but should be part of your regular process for deploying new versions to production.

With autoPull disabled and the parent cache pre-populated, the site will have no runtime dependency or load on its parent site. The parent site could go permanently offline and there would be no way to notice other than trying to run precache or pulltool. Aside from controlling whether autoPull is enabled, the VFS handles caching at all other levels intelligently by itself – ensuring disks are never read twice for the same thing, code is never parsed or compiled twice, and changes get pushed through every level immediately and incrementally.

ActiveRecord caching

Any class derived from ActiveRecord can set $useCache true to cache the contents of objects looked up in the database via one of their unique keys. All updates to a record make via the proper ActiveRecord APIs (i.e. setting fields and calling $record->save() get pushed into the cache immediately and automatically. This is a major performance advantage, especially when making heavy use of one-one relationships. This gain is nearly free, all you have to do is make sure that for the given table, the application isn’t making any modifications to the database with raw SQL. If your application does require raw SQL updates to a given table, you should either explicitly disable caching for that model via $useCache = false (so that enabling ActiveRecord::$useCache globally doesn’t effect it), or wrap these SQL updates in model methods that handle making necessary updates to the cache as incrementally as possible.

A production site should use a partial config file for the ActiveRecord class to set ActiveRecord::$useCache = true for all tables on a site, while staging/development sites leave it off so developers can easily edit data manually. Before doing this, do a search through the application’s code for UPDATE and DELETE and ensure that any tables being modified with raw SQL have $useCache locally overridden to false. General-purpose models should be built to support $useCache but not provide their own setting for it so that the can inherit a site-wide setting from ActiveRecord.

Misc application caching

Aside from the transparent row-level caching provided by ActiveRecord::$useCache, the core Cache class also provides a generic API for caching arbitrary site-level data under unique string keys. Consider first for caching any data you’re pulling from SQL or other remote sources that’s not joined against user-personalized data. Among those, focus first on queries evolving many joins and/or unable to make good use of keys. It also may be possible and advantages in cases of user-personalized data queries, to break the data down into two parts: the first being a shared data structure that’s pre-processed as much as possible and stored in the cache, and the second being a bare-bones query for only the user-personalized bits. Custom application code can zip the two together. It might even be worth doing more intensive things than you normally would need to for handling the current request in order to build a cachable data structure that can eliminate the most future load.

Where it’s possible to have an event or hook actively push changes to the cached data live, skip specifying a TTL for the cached object and let the caching engine handle aging it out automatically. There that’s impracticable, set a TTL that establishes a reasonable pace of change aggregation. Even a very short TTL like 1 minute or even 1 second can make a significant performance contribution, as it will at the very least put an upper-bound on how often a particular set of data needs to be fetched from the database or other remote source. In practice though for something like a dynamic content feed embedded on a home page, it is unlikely any users would even notice a delay as small as 1-5 minutes for new content.

Database optimization

Starting SQL optimization practices apply like using keys, rewriting queries to join instead of using subqueries, and using temporary tables to break big operations down and eliminate repeat searches.

When adding keys to a table managed by ActiveRecord, be sure to apply the change both to the existing table and to the $fields and/or $indexes that will be used next time an instance of that table needs to be generated. If making changes to an application that is redeployed or inherited by other sites, deploy the update to the existing table via a php-migrations script that can be pushed out alongside the update to the model’s configuration to upgrade existing instances

Server optimization

Production servers should have multiple cores and 2GB+ of memory.

Swap file

Ensure the server has a swapfile and swappiness set to 10 (Digital Ocean has a good guide on this applicable to most servers). While you should always aim to give a server as much memory as it needs and configure applications to stay within available memory, random processes will crash if a machine runs out of memory without a swap file available. It is nearly always preferable that a swap file be available as a last-resort to handle temporary overflows smoothly.

Larger PHP worker pool

A multicore production system with 2GB+ memory should be able to amply handle a 150 worker php-fpm pool, and this can be grown with more processing and memory. Exact numbers can vary based on the load profile of the applications at hand, so you should use a benchmarking tool to see what happens when you max out a given worked pool. New connections should get refused before the server starts to thrash.

The number of PHP workers in the pool is determined by the maxClients setting for the php service under /emergence/config.json, with a default of 50 that’s not written to the initial configuration file. Changes to this value require a restart of both the emergence-kernel and then it’s php-fpm process to take effect.

Before going live

Use a benchmarking tool like wkr to generate at least 2x expected peak load against a website’s primary pages. If possible, use a performance metrics tool like NewRelic to observe application and server metrics during the benchmark test. You want to verify that as throughput rises gradually, response times plateau rather than growing in turn. A response time that continues to grow with load may indicate slow SQL queries that can be optimized and/or reduced in volume with caching.

Checklist

[ ] autoPull is disabled
[ ] precache tool has been used to fully pre-populate persistent parent cache
[ ] ActiveRecord caching enabled where possible
[ ] Site::$debug is set to false and Site::$production is set to true
[ ] Site::$webmasterEmail is configured for exception reports
[ ] A server- or application-level performance monitoring service like NewRelic is setup
[ ] A load test has been run to simulate peak traffic against at least the home page

Deploying changes

It is best to deploy changes during an applications lowest usage period, both to minimize the impact of possible slowdowns or down time, and to keep high-load from complicating operations. Schedule a maintenance window with all stakeholders including any technical support that may be needed. If possible, run a high-load benchmark after deploying a change during off-peak hours to ensure the updated system is ready to handle peak load.

Checklist

[ ] precache tool has been used to fully pre-populate persistent parent cache with any new files
[ ] A load test has been run to simulate peak traffic against at least the home page
[ ] Run any new php-migrations scripts via /site-admin/migrations during the maintenence window