Why Pipes has new Timeouts and a new Pricing Plan, Incident Report

Summary: Among other optimizations, Pipes had to introduce a limit on how long each free pipe can run to keep the server from falling over. A new cheaper paid plan is meant as an option for those needing a longer execution time back and not being covered by an enterprise. The following will describe in detail what happened.

The incident

On december 23, Pipes went offline. During the last year this happened just a few times. The server is meant to autorestart Pipes’ ruby service via supervisor if something goes wrong, but there are several situation where this can fail, e.g. when the Out-Of-Memory (OOM) protection kills the ruby process of the webserver we use, puma. That OOM situation did happen a bit too often a while ago, but earlier in the year a server admin colleague of mine gave expert advice on how to configure the server for our workload, and since then Pipes had been stable. So when I got notified this time I restarted the service and chalked it up to a load spike and bad luck. Pipes was up and running.

/blog/upload/joshua-aragon-FkjaN-7gWC0-unsplash.jpg

But then it broke again. And again. Restarting the service, even restarting the server only helped for a short while. Something was wrong and I had to get serious about this, otherwise Pipes could have been unstable over the holidays. And no one wants to work on this while preparing a christmas dinner ;) But the site not being available for so long was not an option.

Investigation

So this is what was tried. Escalating step by step:

  1. I investigated why the ruby process was dying. The log showed: At first it was indeed the OOM-killer just doing its work, but why was it triggered now?
  2. To get an answer, I ran the local development environment with a production size database. This revealed that the hourly sqlite VACUUM after a cache clean was very heavy with such a slightly bigger database, together with the base load qualified to be a problem for the server. Maybe that was the root cause, our database got bigger over time and now hit a critical level? Indeed after removing the VACUUM instruction Pipes came up nicely and worked for a while. No luck, after a while the site got stuck again.
  3. But this time, no OOM action was in the logs. Related to VACUUM or not, one could finally observe in which state the ruby process was when it was killed earlier. First observation: 100% processor load on one core. The only long running actions Pipes has are downloads (which should not cause high processor load) and runnning a Pipe itself. That’s the moment when I had to try a strict limit on how long a pipe can run.
  4. While Pipes was working now, very soon after the restart - with the time limit applied - the cpu load spiked again to 100% and stayed there. Nice change: This time Pipes as a whole continued to work somewhat. This improvement had to be caused by the timeout limiting the total load, but it was not as effective as expected. Via rbspy (big recommendation!) a flamegraph gathered from the running process revealed that a whole pipe could still run a lot longer than the target timeout, as long as a single blocks did not finish. Additional timeout checks during block execution followed, before that the check was only applied at block transitions.
  5. With those changes Pipes stayed stable over night, though some bigger pipes will now give no or less results.

This is the gathered flamegraph minus some details (it’s an svg, click on it to get the full sized version):

/blog/upload/rbspy-2020-12-23-rkl5ZWksYv.flamegraph.svg

Letting a single pipe run that long was a mistake of the original design. Pipes is threaded via the puma webserver, but not highly concurrent via multiple processes. That means that it needed just some of those long running pipes and the ruby program would be overloaded. Also, the nginx webserver would terminate the connection anyway after some seconds, continuing with the pipe would not help in avoiding a 504 error (though after one successful run the result would be cached for a bit). Implementing a timeout on the ruby side was the correct thing to do, even though it means that some pipes that worked before will not work anymore.

Outlook

In the next year, I want to improve this situation.

First measure, and that offer starts right now, is a new cheaper supporter plan that provides more pipes than the free plan and raises the pipe time limit to the original level.
Pipes has to manage the server resources better, limiting the Pipes made under a free plan a bit stronger than the fewer paid pipes will help with that. At the same time, Pipes is supposed to be a useful web infrastructure for everyone, so the limitations should not be too severe. The paid plans we had so far were targeted at professionals using Pipes for commercial projects, that does certainly not cover everyone interested in Pipes. The new plan is likely more attractive for those not in that category.

Next step will be to work on a new architecture of the software, with the goal of better distributing the workload caused by pipes on multiple processor cores and also separating the website from that pipe backend. That way, the ruby program could easier keep longer running pipes around and better keep load spikes under control. There is however sadly no guarantee that a new architecture will come together and actually be helpful, operations like that can always fail. But for now I am confident that it can succeed in providing an improvement.

Finally, moving to a stronger server will help all users by each pipe finishing faster. That would mean higher project costs though. I hope additional income from paid plans can balance that. It would be awesome if you could support me there, be it via the new supporter plan or one of the other two!

onli,