Staled-While-Revalidate to overcome downtimes

Simplifying existing solutions and creating new ones are the most important tasks an Engineer is supposed to do in a Company.

A simpler solution is more effective than a complex solution.

It often involves fewer assumptions.

This fact sounds obvious when read.

But in practice is much harder to detect.

And the experience may go away silently if you never reflect on the things you do.

In this post, I'll explain a specific situation in which I was able to identify this.

When code is needed, I'll use JS-like pseudocode and sequence diagrams.

Structure of the post:

Introduction: Context

The problem with the initial state

Initial cache pattern

Solution 1: Load simulation / SSG

Solution 2: Fire and Forget / SWR

Current situation

Conclusions

That said, let's jump into it!

CONTEXT

I'm working on a marketing website, which is visited daily by more than 250k users. The servers receive more than 15M requests per day, according to the CDN stats.

When you have this kind of load, being up and replying fast is super important, and being stable when providing that performance through time and locations also is. This notion is nothing new, as it is one of the main goals of Site Reliability Engineering (SRE).

For this reason, it often makes sense to have a cache layer between the headless content management system, or just the CMS, and the Web server.

Also, it is worth mentioning that we always have different kinds of alerts and monitoring tools in place (Opsgenie from Pingdom, plus Prometheus and Grafana), plus a bunch of metrics to understand the website's performance and load. On top of that, we had the CDN usage stats.

THE PROBLEM WITH THE INITIAL STATE

So, we had a cache.

But there was a problem with it.

When we cleared it, we had micro-downtimes when we deployed a new version of the app or because one editor needed their content to be updated right away and requested a cache clear.

Note: Without having monitoring tools in place, it would be much harder to detect since it would require manual tests, plus random chance, and then spreading this through time to find the pattern of the problem. It is important to have always monitoring and observability setup in your app.

INITIAL CACHE PATTERN

The initial version of the website was oriented to providing an editorial platform's while also providing a basic solution for caching.

Because of this, we had an initial implementation, which was ignorant of the micro downtimes problems with the high load.

This is how a sequence diagram may look like

sequenceDiagram participant Browser participant Server participant Redis participant CMS Browser->>Server: request '/' resource Server->>Redis: request cached content Redis-->>Server: I don't have the content in my cache Server->>CMS: request content CMS-->>Server: sure, here you go Server-->>Browser: here is the HTML Server->>Redis: put content

Clients usually call the cache like this

cache.set(key, async () => {
  const value = cms.get()
  return value
})

And the cache.set implementation was roughly something like this

function set(key, getFunction) {
  let value = cache.get('key');
  if (!value) {
    value = await getFunction();
    cache.set('key', value); // fire and forget
  }
  return value
}

This approach was enough for some time, but then some problems started to arise.

Because after a cache clear the keys are not found, we wait for the CMS response, and the Node.JS process waits to reply to current pending requests.

And we had thousands of requests pending.

So they all wait.

They wait at least 600 ms, because, in the most basic API request, we hit 3 CMS endpoints (with cache, of course). Each one of those takes around 200ms each. Some requests went up to 30 seconds or even more when something went out of control on the third party.

In addition, at some point, our servers could not accept more requests and started to fail with the status codes 503 - Service unavailable, Socket timeout, or 403 - Bad Gateway. As a result, the throughput went dramatically low, and you could even check the downtime from a browser.

BUT... WHY?

The CMS call is an API call, which means it goes through HTTP, with encryption on top and pointing to a different infrastructure (CMS provider).

We can then assume that calling an API endpoint located in a different infrastructure is obviously slower than calling a Redis database on our own.

We also checked that in a dashboard.

This dashboard has shown us how much time a request took on each function call. Redis database read access is often lower than 25 ms and, in some rare scenarios, up to 75 ms.

At the same time, third parties were around 200 ms and up to more than 10 seconds from time to time (it happened every two or three weeks in our particular case).

We had to do something to improve this.

SOLUTION 1: LOAD SIMULATION / SSG

So, what can we do about it?

Well, having the content in the cache in the first place sounded like an option to us.

So, what if we periodically simulate our users and put everything in cache?

This solution looks something like this

sequenceDiagram participant Cronjob participant Browser participant Server participant Redis participant CMS Cronjob->>Server: request all routes Server->>Redis: request content Redis-->>Server: here you gook, loaded Browser->>Server: request '/' route Server->>Redis: request content Redis-->>Server: there you go Server-->Browser: here you go

We saw some light there, and we started to move in this direction.

We called it Proactive load but is essentially the same as SSG.

The idea was to trigger a command via a Cron Job each hour or so, in which we'll load the most popular content, if not all, in the cache.

Because we pull the data from a CMS, it is very similar to how Static Site Generation (SSG) works.

It is just that instead of generating HTML on files, we'll load a bunch of objects in the cache.

When I started programming this, I had to add many new interface methods here and there to manipulate the cache keys and generate all the combinations of the data, which implied lots of new methods and data sources.

It didn't feel very well, to be honest. The solution started to smell too complex.

In the meantime, a few other things happened.

We use Next.js for the website, and because of that, I was following some releases there. I saw that they released a new feature on the framework, supported by their CDN, called Staled While Revalidate. A pattern that I knew from theory but not so much in practice. They advertised it to be even better than SSG (however, shortly after they also released SSG and ISR as well)

Also, I was reading parts of the book "Reactive design patterns".

In this book, I learned a few things about how reactive systems are often superior because they tend to be more efficient.

Now with this in mind, let's think back about the notion of "smell to complexity". It seems to be related to recent learnings.

To summarize, with this approach, we'll be running a job each hour and loading some content in the cache that maybe never is going to be consumed. So it won't solve the initial issue in the worst scenario unless we always generate everything.

Then I remembered the whole thing of "Stale While Revalidate" and decided to take a look at it.

SOLUTION 2: FIRE AND FORGET / SWR

Our implementation for Staled While Revalidate solution looks something like this

sequenceDiagram participant Browser participant Server participant Redis participant CMS Browser->>Server: request '/' route Server->>Redis: request content from layer 1 Redis-->>Server: I don't have the content Server->>Redis: request content from layer 2 Redis-->>Server: here you go Server->>CMS: request content, in a sync way if not found CMS-->>Server: here you go Server-->>Browser: here you go Server-->>CMS: request content, even when found, but async CMS-->>Server: here you go Server->>Redis: set content on both layers

There are many changes here.

First and foremost, it adds a new cache layer with higher TTL (much higher).

The second layer reduces the probability of the need to go to the CMS while making the users wait and causing micro-downtimes.

It looks like it has the same problem, but probability makes the need to go to the CMS so tiny that it is not a problem anymore.

Now, implementing this is super easy. First, we change the cache algorithm, and that's almost it.

There is no need for simulated requests because it uses the actual user event, making it reactive instead of proactive.

Now there is a downside: requests that take the content from layer 2 are technically getting old content. But at least they are getting content instead of waiting and exhausting our server’s capacity. And the content is not very old neither (only as old as the last request that triggered the Revalidation function).

I said almost above because we had a CDN in front of our server. And because the CDN is in charge of caching, when returning a staled response, you need to tell them and/or the browser to cache the response only for a few seconds (more or less the delay of the request to the CMS) instead of the regular expiration, which may be 6 hours for example. Otherwise, the CDN would continue replying with the staled content for those 6 hours. For this, the cache-control header is used.

CURRENT SITUATION

The current situation is that we don't have any more micro-downtimes when clearing the cache.

No more micro-downtimes is great because when the system is unstable, it causes performance and business implications and brings fear to developers who are afraid to deploy new changes or clear the cache, which is bad for any engineering organization.

CONCLUSIONS

So, how is the reactive approach simpler than the proactive?

First of all, the implementation is more straightforward. It involves fewer new pieces, while it reuses most of the current existing pieces.

Also, it uses the original user event instead of assuming that an automated user with a clunky visit pattern (CRON + script) could generate something useful.

In this particular case, for me, the complex implementation was the first sign.

My favorite takeaways are:

Sometimes, subtracting things from Solutions makes them better. In this case, one implicit assumption was to assume we would do a good job by simulating the user requests.

In reactive systems, it is essential to distinguish between original events and synthetic onesSynthetic events could require a lot of work to get right, while original events occur externally, which doesn't require ANY effort.

Keep it up with reading and learning; it helps you solve problems

Have observability in place, get to know your app at runtime Logs, metrics, dashboard, traces

Thanks to Dan, Dimi, and Vlad for their collaboration during this process.