How to Build a Concurrency Limiting Job Scheduler Using AWS StepFunction and Lambda — Part 1
A serverless solution to build a job scheduler that executes several jobs by maintaining max concurrency (Part 1/2).
When working with large number of long running jobs, sometimes it becomes important to control the concurrency so that we don’t overwhelm the downstream systems. In this article I’ll describe a serverless solution to build a job scheduler that executes several jobs by maintaining max concurrency so that no more than
N
number of jobs are running concurrently.
Recently, I came across a very interesting problem at work while assisting one of our teams in building a data lake. They had 100+ AppFlow Jobs pulling data from Salesforce. They had to be run each day but we could not run them all at once since it would overwhelm the downstream system. It turns out if we ran more than eight AppFlows concurrently, we started getting errors.
The most straightforward approach would have been to break down the number of jobs in batches of eight and then schedule them to run at specific time intervals using AWS EventBridge. But here are the challenges :
- How do we come up with the time interval between two subsequent batches? If all the jobs took roughly the same amount of time to finish, this number would be easy to come up with. In reality, that’s not the case. We must assume that each job could take a variable amount of time, in which case the interval should be
max(job1,job2,job3..,job8)
- Scalability is another issue. Right now there are 100+ jobs. What if this number grows to 200+ or 500+. You can easily see how complex the scheduling would become. Add to it the overhead of updating EventBridge rules to keep up with new additions.
- Finally, this approach is not optimal. If one job within the batch completes in 1 minute and the last one takes about an hour, we would still have waited an hour before launching the next batch. Thus, we are hardly hitting the optimal max concurrency level.
Can we do better? To hit max concurrency, we need a pool of jobs where the size of the pool represents max concurrency. Once the pool is maxed out, we stop releasing new jobs unless an existing job finishes.
Note: One important assumption that I am making is that all jobs have equal priority and can be executed in any order.
So, how do we build such a job scheduler? Our goal is to build it so that it is flexible, reusable, and cost-efficient. Here is how I designed it using AWS StepFunctions and Lambda.
The first step is to generalize the concept of Job. A Job can be any long-running activity. It could be AppFlow like in our case or it could be an ECS Task, AWS Glue Job, AWS Batch Job, or a custom application. The two main things that we are concerned with are :
- How do we execute the job? — The API to launch the Job. I expect the API to return some sort of context that can be used to track the status of Job. For example —
executionId
- How do we track the status of the job? — The API allows us to check if the job has finished successfully/failed/in progress. I expect the API to accept the job context and return the status.
interface JobContext {}
interface JobParams {}enum JobStatus { SUCCESS,FAILED,IN_PROGRESS }
interface Job { JobContext start( JobParams jobParams ) ; JobStatus getStatus(JobContext jobContext );}
Now, let’s look at how our Scheduler would work:
- The input to our scheduler is a list of Jobs that we want to execute. The input contains
JobParams
associated with each Job. - From the list of available Jobs, we take the first
N
jobs and execute them concurrently. We start a job by invokingjob.start(params)
which returnsjobContext
. We savejobContext
along with the other metadata associated with the job. - Then we wait for a reasonable duration of time say 5 minutes. This is our polling interval.
- After the polling interval is over, we do the following: Check to see if any of the running jobs have finished. When checking for the current status, we call
job.getStatus(jobContext)
. If yes, then we launch more jobs from the queue to maintain the max concurrencyN
. - We continue polling until all jobs have finished and there are no more outstanding jobs.
That’s pretty much it. The implementation of Scheduler is agnostic of the type of Job we are running and thus can be extended to handle a variety of Job types.
We have reached the end of the first part of the blog where I wanted to talk about the use-case and high-level design of the concurrency limiting Scheduler. In the second part, I’ll describe the implementation(with code samples) that leverages AWS StepFunction and Lambda to execute AWS AppFlow Jobs.
Until then, stay tuned!
More content at plainenglish.io. Sign up for our free weekly newsletter here.