Cache miss-storm: Dealing with concurrency when caching invalidates for high-traffic sites

Technology CommunityCategory: Software ArchitectureCache miss-storm: Dealing with concurrency when caching invalidates for high-traffic sites
VietMX Staff asked 3 months ago

For a high traffic website, there is a method (say getItems()) that gets called frequently. To prevent going to the DB each time, the result is cached. However, thousands of users may be trying to access the cache at the same time, and so locking the resource would not be a good idea, because if the cache has expired, the call is made to the DB, and all the users would have to wait for the DB to respond. What would be a good strategy to deal with this situation so that users don’t have to wait?

The problem is the so-called Cache miss-storm (Cache Stampede or Dogpile) – a scenario in which a lot of users trigger regeneration of the cache, hitting in this way the DB.

To prevent this, first you have to set soft and hard expiration date. Lets say the hard expiration date is 1 day, and the soft 1 hour. The hard is one actually set in the cache server, the soft is in the cache value itself (or in another key in the cache server). The application reads from cache, sees that the soft time has expired, set the soft time 1 hour ahead and hits the database. In this way the next request will see the already updated time and won’t trigger the cache update – it will possibly read stale data, but the data itself will be in the process of regeneration.

Next point is: you should have procedure for cache warm-up, e.g. instead of user triggering cache update, a process in your application to pre-populate the new data.

The worst case scenario is e.g. restarting the cache server, when you don’t have any data. In this case you should fill cache as fast as possible and there’s where a warm-up procedure may play vital role. Even if you don’t have a value in the cache, it would be a good strategy to “lock” the cache (mark it as being updated), allow only one query to the database, and handle in the application by requesting the resource again after a given timeout.