Recovering from service- and host-level failures


#1

The daemon processes that emergence manages (i.e. php-fpm, mysql, nginx, postfix) can fail independently for a variety of reasons: MySQL will hang if the drive runs out of space, php-fpm might crash randomly if a buggy extension is loaded, emergence may fail to restart all processes after an unclean system reboot, etc. Emergence does not currently monitor the processes it starts for independent failures, so when this happens it is necessary to manually reset the state of the system.

Recovering MySQL during a full hard disk

If the system has hung due to a full hard disk, it is best to free up some space and send MySQL a friendly shutdown signal before attempting to reset anything to minimize the risk of data loss. First, stop nginx and php-fpm via the emergence control panel or, if necessary, via killall nginx && killall php-fpm on the shell. This ensures that no user traffic or log growth from those processes will impede your recovery of MySQL.

While MySQL is still running and hung, free up some initial hard drive space, some good places to find space in an emergency are:

  • Clear out etmp directories or other files under /tmp
  • Delete large downloads in home directories
  • Delete or xz compress large access logs under /emergence/sites/*/logs/access.log
  • Delete extraneous database backup dates for large databases within /emergence/sql-backups

Once you’ve freed up at least a few hundred megabytes of space, try connecting to MySQL via the command line client and see if you can get to its command prompt. This should indicate the MySQL server is responding and likely has flushed its state to disk. Run reset master to clear its binary logs, freeing up additional space.

Stop all running processes

The first step to resetting the system is to stop all running services processes so emergence can start fresh. Open emergence’s web-based control panel and issue a stop command through it to each service. It’s ok if any of them are stuck in the “online” state – we’ll get to them.

Then on the system’s shell, run ps aux | grep 'php\|mysqld\|nginx' to discover if any processes are still running. If you find any, run killall nginx; killall mysqld; killall php-fpm; killall php5-fpm to send them all the SIGTERM signal. They should shut down cleanly after that as long as the machine isn’t experiencing a hardware or 0 disk space. If you absolutely must, killall -9 can be used instead to force the processes to exit with SIGKILL. There is generally no risk to doing this with php-fpm or nginx, but caution should be used with mysqld as you may leave some recent data unwritten or a table corrupted. In most cases though any tables corrupted from an unclean shutdown can be fixed quickly with MySQL’s built-in repair tools.

Reset emergence’s tracking of process state

After all the running daemon processes have been cleared from the system, you need to manually erase any remaining PID files so emergence knows they’re gone. Go into the /emergence/services/run directory and run ls -l *. If you see any files listed, run rm */*.

Next, ensure the kernel.sock file has been removed. Cd into the /emergence/ directory and run sudo rm kernel.sock.

Restart emergence kernel

Finally, restart the emergence kernel. Start with shutting it down and ensuring the kernel’s API socket is erased:

sudo service emergence-kernel stop
sudo rm /emergence/kernel.sock

If that goes well, it’s time to start everything back up: sudo service emergence-kernel start

With the PID files all erased and daemon processes terminated, emergence should be able to start everything back up cleanly when it starts again as long as there are no lower level problems like a lack of disk space remaining.


Installing and Updating SSL Certificates with Emergence