Simpler serverless queues using AWS step functions

·

3 min read

The Problem: Knowing When All Jobs Are Done

Imagine you have a large-scale job as part of a workflow. One of my customers, for example, runs a highly isolated SaaS application that during application deployment runs database jobs for thousands of tenants. Once everything is finished, the system needs to send a callback to the SaaS app's control plane.

To efficiently process thousands of jobs, a queue-worker pattern is ideal. AWS SQS is a natural choice for queuing jobs, while Lambda or an ECS container acts as the worker. This setup allows for automatic scaling, retries, and dead-letter queues in case of failures.

However, one big challenge arises: how do you know when the queue is truly empty and you can proceed to the next step? Surprisingly, this is trickier than expected. Here are a few options:

Possible Solutions (and Their Problems)

  1. 1. Using SQS Metrics

    SQS does not provide a definitive metric for when a queue is completely empty. Since it is a distributed system, it cannot guarantee that all messages have been fully processed. This makes it unreliable for workflow coordination.

    2. Tracking Processed Jobs

    You could track the number of completed jobs versus the total expected count (e.g., in DynamoDB or the application’s control plane). However, this requires extra logic, which can be error-prone. Handling duplicate message processing, for instance, adds complexity.

    3. Using Step Functions as a Queue (Yes, Really!)

    At first, this might sound odd—why replace SQS with Step Functions? But Step Functions excel at workflow coordination, and they offer a built-in mechanism to track when all jobs have completed.

Step Functions 'Map' State: A Smarter Queue

The real issue is that the queue is part of a workflow—you need to act after all jobs have finished. This is exactly what AWS Step Functions are designed for.

Step Functions include a Map State, which allows you to pass in a list of jobs and process each one in a separate workflow. Once all (or a defined portion) of the jobs complete, Step Functions automatically proceeds to the next step.

Things to Consider:

  1. Payload Size Limits

    • If you're dealing with millions of jobs, you can't pass all data directly into the Map State due to size constraints.

    • Solution: Store job data in S3 and reference it from the Step Function.

  2. Cost

    • Step Functions (especially in Standard Mode) charge per state transition, and for millions of jobs, this can cost thousands of dollars. Much more than SQS! But for some use-cases it might be a better fit even though it is more expensive.

    • Cost-saving approaches:

      • Batch jobs within the processor step (though this makes handling partial failures harder).

      • Use Express Workflows, which are cheaper but come with limitations: jobs must complete within 10 minutes, and parallelism is capped at 30 workers.

Added Benefits of Step Functions

  • Controlled Concurrency: Prevent high down-stream load (e.g. to a database) by setting a max concurrency limit with one simple setting.

  • Built-in Debugging & Visualization: Unlike SQS, Step Functions provide a clear execution history.

  • Queue-like Features: You can still configure retries, concurrency scaling, and dead-letter queues.

Conclusion

If your queue is part of a workflow and you need a clear indicator of when all jobs are complete, Step Functions can be a better alternative to SQS. They simplify orchestration, provide built-in visibility, and offer useful features like controlled concurrency and error handling.

For workloads requiring strict cost control or ultra-high scale, you might still need SQS. But if workflow clarity and job tracking matter, Step Functions are a powerful alternative.