Designing A Time-Based Queue for Serverless

Building a serverless-friendly queue system continues to be an adventure. Recently, we had to say goodbye to our Redis queues and replace them with something more durable. The journey there involved a lot about time storage, visibility windows, and how the most popular queue solutions solve these problem. So, if you were curious what makes Taskless reliable, the short answer is a database optimized for time-based queries. For a longer answer, we need to go on a journey.

First, it's Time for Events

The first rule of the JAMStack is that everything is an event. Because servers aren't standing by and waiting for your every request, events are the only way for a serverless provider to know your code needs to run. In practical terms, when you make a request to a Next.js app or AWS lambda, that small chunk of code is invoked in response to the URL request, executed, sends the reply, and then shuts down. This is well and good for operations like "forward this blog post to a friend" or "leave a comment about how you definitely wouldn't forward this blog post to a friend," given both examples begin with an obvious interaction. But what about an operation like "send this to my inbox tomorrow morning"?

It turns out, time ruins everything.

Time sucks, and if you don't believe me, go read Zack Holman's excellent piece on the matter and return once you've come to a similar conclusion. Now that we agree the concept of time was a mistake, let's talk about sending something to your inbox "tomorrow". To avoid headaches, we're going to make some poor assumptions about what tomorrow means. In Luxon, we can express this arbitrary definition of "tomorrow" as DateTime.now().zone("America/Los_Angeles").plus({days: 1}).startOf("day").set({hour: 8}). Bam. 8am "tomorrow" at of this exact moment. All we need to do is "event" at that time, and we're set!

Store the Time in the Queue Itself - aka The SQS Experiment

The purest of serverless ambitions would be to never persist anything under our ownership. To that end, Amazon SQS and lambdas are an excellent fit. The platform has limitations, but those could be worked around with a bit of duct tape.

First, the maximum delay on a lambda is 15 minutes, so SQS would need to send the job to a lambda at the end of the maximum delay, which would only report success if it could requeue the job for another 15 minutes. We'd repeat this ad-nauseum until we reached our magical "8am tomorrow" requirement and then run our job.

This worked really well (and was insanely cheap). Each job would require a maximum of 96 SQS calls per day in a pending state, and it's only $0.50 per million requests. That's 10,000 jobs per month, delaying every job up to a maximum of one day, with just AWS tools. In practice, the system has a lot of opportunity to fail. Specifically 672 opportunities for every day the job sits in SQS and must be constantly requeued,

Honestly, if SQS didn't have an arbitrary 15 minute delay, we'd be done. But it doesn't so we're not.

Tables, Indexed by Time

It turns out that databases are really good at time stuff, and not just because ISO-8601 is both readable and easy to sort by. But also because really smart people made time (and its storage) one of the primitives in their db systems.

Databases are also really good and indexing non-time things. So imagine you have a table

columntype
idany
failedboolean = false
visibletime, nullable
deletedtime, nullable
ackany, nullable, unique (on non-null)
payloadtext, nullable

Now it's easy to describe the operations in order to take, process, and report on jobs.

  • Your next jobs are at visible <= now() AND deleted IS NULL
  • Your failed jobs are at deleted IS NOT NULL AND failed = TRUE
  • Your completed jobs are at deleted IS NOT NULL AND failed = FALSE
  • You can retry a job by changing its visible time
  • You can complete/fail a job by setting deleted and failed in the same transaction
  • A unique ack value lets you find a single job in the table so that you are complete/fail/retrying only one specific job

As an added bonus, using just-a-database means that we have separated our queue (what do we need to run) from our worker (the actual running). The pattern was simple enough that I bundled it up as DocMQ. It has a persistent worker if that's something you want, or you can just run DocMQ for adding jobs to your queue database and process them on your own time.

Time to Return to Events

The beautiful part about using a database is that once it's in the database, it's there. We don't need persistence unless we want it, and polling is good enough for most use cases. All we need to do is poll in response to an event; that event is time itself changed. 🤯

Stick with me here. Imagine if we have a cron job via Netlify Scheduled Functions or lambda cron that runs once per minute (or per-second if you'd like to YOLO). A single query tells you what capacity you need:

SELECT count(*) FROM your_table WHERE visible <= now() AND deleted IS NULL

You get 20 rows back, spin up 20 serverless functions, each to claim 1 row:

UPDATE your_table SET ack = gen_random_uuid(), visible = visible + interval '30 seconds'
WHERE visible <= now() AND deleted IS NULL
LIMIT 1

💥Instant scalability. If you need more granularity, tighten the cron. If you need sub-second control then you probably do want a server. Real talk though, if you're sending emails the latency isn't your job system.

How This Relates to Taskless

At Taskless we ran on the cron job / polling design until we were invoking an unreasonable number of serverless functions on the minute and customers wanted sub-minute responses. Since our scheduled function infrastructure didn't let us go to per-second, we switched to persistence. In short, we ran the servers so you could go 100% serverless. 🎉

Because DocMQ is on GitHub, you can replicate it yourself. There's even a postgres prototype that will get merged down soon in case you're a fan of Neon, Supabase, or other serverless-friendly postgres instances.

Some Quick Answers

Added because these came up in casual conversation

  • Does the DB matter? - No, as long as your DB understands time
  • Biggest drawback? - Performance. A relational db will be slower than an in-memory solution in almost every use case. Unless you're building a real-time application such as chat, the latency of a DB touching disk is not going to be why things are slow
  • What DB does Taskless use? - We use MongoDB Atlas currently. We like their time series support