Making Sense of Azure Durable Functions

Stateful Workflows on top of Stateless Serverless Cloud Functions—this is the essence of the Azure Durable Functions library. That's a lot of fancy words in one sentence, and they might be hard for the majority of readers to understand.

Please join me on the journey where I'll try to explain how those buzzwords fit together. I will do this in 3 steps:

  • Describe the context of modern cloud applications relying on serverless architecture;
  • Identify the limitations of basic approaches to composing applications out of the simple building blocks;
  • Explain the solutions that Durable Functions offer for those problems.

Microservices

Traditionally, server-side applications were built in a style which is now referred to as Monolith. If multiple people and teams were developing parts of the same application, they mostly contributed to the same code base. If the code base were structured well, it would have some distinct modules or components, and a single team would typically own each module:

Monolith

Multiple components of a monolithic application

Usually, the modules would be packaged together at build time and then deployed as a single unit, so a lot of communication between modules would stay inside the OS process.

Although the modules could stay loosely coupled over time, the coupling almost always occurred on the level of the data store because all teams would use a single centralized database.

This model works great for small- to medium-size applications, but it turns out that teams start getting in each other's way as the application grows since synchronization of contributions takes more and more effort.

As a complex but viable alternative, the industry came up with a revised service-oriented approach commonly called Microservices. The teams split the big application into "vertical slices" structured around the distinct business capabilities:

Microservices

Multiple components of a microservice-based application

Each team then owns a whole vertical—from public communication contracts, or even UIs, down to the data storage. Explicitly shared databases are strongly discouraged. Services talk to each other via documented and versioned public contracts.

If the borders for the split were selected well—and that's the most tricky part—the contracts stay stable over time, and thin enough to avoid too much chattiness. This gives each team enough autonomy to innovate at their best pace and to make independent technical decisions.

One of the drawbacks of microservices is the change in deployment model. The services are now deployed to separate servers connected via a network:

Distributed Systems

Challenges of communication between distributed components

Networks are fundamentally unreliable: they work just fine most of the time, but when they fail, they fail in all kinds of unpredictable and least desirable manners. There are books written on the topic of distributed systems architecture. TL;DR: it's hard.

A lot of the new adopters of microservices tend to ignore such complications. REST over HTTP(S) is the dominant style of connecting microservices. Like any other synchronous communication protocol, it makes the system brittle.

Consider what happens when one service becomes temporary unhealthy: maybe its database goes offline, or it's struggling to keep up with the request load, or a new version of the service is being deployed. All the requests to the problematic service start failing—or worse—become very slow. The dependent service waits for the response, and thus blocks all incoming requests of its own. The error propagates upstream very quickly causing cascading failures all over the place:

Cascading Failures

Error in one component causes cascading failures

The application is down. Everybody screams and starts the blame war.

Event-Driven Applications

While cascading failures of HTTP communication can be mitigated with patterns like a circuit breaker and graceful degradation, a better solution is to switch to the asynchronous style of communication as the default. Some kind of persistent queueing service is used as an intermediary.

The style of application architecture which is based on sending events between services is known as Event-Driven. When a service does something useful, it publishes an event—a record about the fact which happened to its business domain. Another service listens to the published events and executes its own duty in response to those facts:

Event-Driven Application

Communication in event-driven applications

The service that produces events might not know about the consumers. New event subscribers can be introduced over time. This works better in theory than in practice, but the services tend to get coupled less.

More importantly, if one service is down, other services don't catch fire immediately. The upstream services keep publishing the events, which build up in the queue but can be stored safely for hours or days. The downstream services might not be doing anything useful for this particular flow, but it can stay healthy otherwise.

However, another potential issue comes hand-in-hand with loose coupling: low cohesion. As Martin Fowler notices in his essay What do you mean by "Event-Driven":

It's very easy to make nicely decoupled systems with event notification, without realizing that you're losing sight of the larger-scale flow.

Given many components that publish and subscribe to a large number of event types, it's easy to stop seeing the forest for the trees. Combinations of events usually constitute gradual workflows executed in time. A workflow is more than the sum of its parts, and understanding of the high-level flow is paramount to controlling the system behavior.

Hold this thought for a minute; we'll get back to it later. Now it's time to talk cloud.

Cloud

The birth of public cloud changed the way we architect applications. It made many things much more straightforward: provisioning of new resources in minutes instead of months, scaling elastically based on demand, and resiliency and disaster recovery at the global scale.

It made other things more complicated. Here is the picture of the global Azure network:

Azure Network

Azure locations with network connections

There are good reasons to deploy applications to more than one geographical location: among others, to reduce network latency by staying close to the customer, and to achieve resilience through geographical redundancy. Public Cloud is the ultimate distributed system. As you remember, distributed systems are hard.

There's more to that. Each cloud provider has dozens and dozens of managed services, which is the curse and the blessing. Specialized services are great to provide off-the-shelf solutions to common complex problems. On the flip side, each service has distinct properties regarding consistency, resiliency and fault tolerance.

In my opinion, at this point developers have to embrace the public cloud and apply the distributed system design on top of it. If you agree, there is an excellent way to approach it.

Serverless

The slightly provocative term serverless is used to describe cloud services that do not require provisioning of VMs, instances, workers, or any other fixed capacity to run custom applications on top of them. Resources are allocated dynamically and transparently, and the cost is based on their actual consumption, rather than on pre-purchased capacity.

Serverless is more about operational and economical properties of the system than about the technology per se. Servers do exist, but they are someone else's concern. You don't manage the uptime of serverless applications: the cloud provider does.

On top of that, you pay for what you use, similar to the consumption of other commodity resources like electricity. Instead of buying a generator to power up your house, you just purchase energy from the power company. You lose some control (e.g., no way to select the voltage), but this is fine in most cases. The great benefit is no need to buy and maintain the hardware.

Serverless compute does the same: it supplies standard services on a pay-per-use basis.

If we talk more specifically about Function-as-a-Service offerings like Azure Functions, they provide a standard model to run small pieces of code in the cloud. You zip up the code or binaries and send it to Azure; Microsoft takes care of all the hardware and software required to run it. The infrastructure automatically scales up or down based on demand, and you pay per request, CPU time and memory that the application consumed. No usage—no bill.

However, there's always a "but". FaaS services come with an opinionated development model that applications have to follow:

  • Event-Driven: for each serverless function you have to define a specific trigger—the event type which causes it to run, be it an HTTP endpoint or a queue message;

  • Short-Lived: functions can only run up to several minutes, and preferably for a few seconds or less;

  • Stateless: as you don't control where and when function instances are provisioned or deprovisioned, there is no way to store data within the process between requests reliably; external storage has to be utilized.

Frankly speaking, the majority of existing applications don't really fit into this model. If you are lucky to work on a new application (or a new module of it), you are in better shape.

A lot of the serverless applications may be designed to look somewhat similar to this example from the Serverless360 blog:

Serviceful Serverless Application

Sample application utilizing "serviceful" serverless architecture

There are 9 managed Azure services working together in this app. Most of them have a unique purpose, but the services are all glued together with Azure Functions. An image is uploaded to Blob Storage, an Azure Function calls Vision API to recognize the license plate and send the result to Event Grid, another Azure Function puts that event to Cosmos DB, and so on.

This style of cloud applications is sometimes referred to as Serviceful to emphasize the heavy usage of managed services "glued" together by serverless functions.

Creating a comparable application without any managed services would be a much harder task, even more so, if the application has to run at scale. Moreover, there's no way to keep the pay-as-you-go pricing model in the self-service world.

The application pictured above is still pretty straightforward. The processes in enterprise applications are often much more sophisticated.

Remember the quote from Martin Fowler about losing sight of the large-scale flow. That was true for microservices, but it's even more true for the "nanoservices" of cloud functions.

I want to dive deeper and give you several examples of related problems.

Challenges of Serverless Composition

For the rest of the article, I'll define an imaginary business application for booking trips to software conferences. In order to go to a conference, I need to buy tickets to the conference itself, purchase the flights, and book a room at a hotel.

In this scenario, it makes sense to create three Azure Functions, each one responsible for one step of the booking process. As we prefer message passing, each Function emits an event which the next function can listen for:

Conference Booking Application

Conference booking application

This approach works, however, problems do exist.

Flexible Sequencing

As we need to execute the whole booking process in sequence, the Azure Functions are wired one after another by configuring the output of one function to match with the event source of the downstream function.

In the picture above, the functions' sequence is hard-defined. If we were to swap the order of booking the flights and reserving the hotel, that would require a code change—at least of the input/output wiring definitions, but probably also the functions' parameter types.

In this case, are the functions really decoupled?

Error Handling

What happens if the Book Flight function becomes unhealthy, perhaps due to the outage of the third-party flight-booking service? Well, that's why we use asynchronous messaging: after the function execution fails, the message returns to the queue and is picked up again by another execution.

However, such retries happen almost immediately for most event sources. This might not be what we want: an exponential back-off policy could be a smarter idea. At this point, the retry logic becomes stateful: the next attempt should "know" the history of previous attempts to make a decision about retry timing.

There are more advanced error-handling patterns too. If executions failures are not intermittent, we may decide to cancel the whole process and run compensating actions against the already completed steps.

An example of this is a fallback action: if the flight is not possible (e.g., no routes for this origin-destination combination), the flow could choose to book a train instead:

Fallback On Error

Fallback after 3 consecutive failures

This scenario is not trivial to implement with stateless functions. We could wait until a message goes to the dead-letter queue and then route it from there, but this is brittle and not expressive enough.

Parallel Actions

Sometimes the business process doesn't have to be sequential. In our reservation scenario, there might be no difference whether we book a flight before a hotel or vice versa. It could be desirable to run those actions in parallel.

Parallel execution of actions is easy with the pub-sub capabilities of an event bus: both functions should subscribe to the same event and act on it independently.

The problem comes when we need to reconcile the outcomes of parallel actions, e.g., calculate the final price for expense reporting purposes:

Fan-out / Fan-in

Fan-out / fan-in pattern

There is no way to implement the Report Expenses block as a single Azure Function: functions can't be triggered by two events, let alone correlate two related events.

The solution would probably include two functions, one per event, and the shared storage between them to pass information about the first completed booking to the one who completes last. All this wiring has to be implemented in custom code. The complexity grows if more than two functions need to run in parallel.

Also, don't forget the edge cases. What if one of the function fails? How do you make sure there is no race condition when writing and reading to/from the shared storage?

Missing Orchestrator

All these examples give us a hint that we need an additional tool to organize low-level single-purpose independent functions into high-level workflows.

Such a tool can be called an Orchestrator because its sole mission is to delegate work to stateless actions while maintaining the big picture and history of the flow.

Azure Durable Functions aims to provide such a tool.

Introducing Azure Durable Functions

Azure Functions

Azure Functions is the serverless compute service from Microsoft. Functions are event-driven: each function defines a trigger—the exact definition of the event source, for instance, the name of a storage queue.

Azure Functions can be programmed in several languages. A basic Function with a Storage Queue trigger implemented in C# would look like this:

[FunctionName("MyFirstFunction")]
public static void QueueTrigger(
    [QueueTrigger("myqueue-items")] string myQueueItem, 
    ILogger log)
{
    log.LogInformation($"C# function processed: {myQueueItem}");
}

The FunctionName attribute exposes the C# static method as an Azure Function named MyFirstFunction. The QueueTrigger attribute defines the name of the storage queue to listen to. The function body logs the information about the incoming message.

Durable Functions

Durable Functions is a library that brings workflow orchestration abstractions to Azure Functions. It introduces a number of idioms and tools to define stateful, potentially long-running operations, and manages a lot of mechanics of reliable communication and state management behind the scenes.

The library records the history of all actions in Azure Storage services, enabling durability and resilience to failures.

Durable Functions is open source, Microsoft accepts external contributions, and the community is quite active.

Currently, you can write Durable Functions in 3 programming languages: C#, F#, and Javascript (Node.js). All my examples are going to be in C#. For Javascript, check this quickstart and these samples. For F# see the samples, my walkthrough and stay tuned for another article soon.

Workflow building functionality is achieved by the introduction of two additional types of triggers: Activity Functions and Orchestrator Functions.

Activity Functions

Activity Functions are simple stateless single-purpose building blocks that do just one task and have no awareness of the bigger workflow. A new trigger type, ActivityTrigger, was introduced to expose functions as workflow steps, as I explain below.

Here is a simple Activity Function implemented in C#:

[FunctionName("BookConference")]
public static ConfTicket BookConference([ActivityTrigger] string conference)
{
    var ticket = BookingService.Book(conference);
    return new ConfTicket { Code = ticket };
}

It has a common FunctionName attribute to expose the C# static method as an Azure Function named BookConference. The name is important because it is used to invoke the activity from orchestrators.

The ActivityTrigger attribute defines the trigger type and points to the input parameter conference which the activity expects to get for each invocation.

The function can return a result of any serializable type; my sample function returns a simple property bag called ConfTicket.

Activity Functions can do pretty much anything: call other services, load and save data from/to databases, and use any .NET libraries.

Orchestrator Functions

The Orchestrator Function is a unique concept introduced by Durable Functions. Its sole purpose is to manage the flow of execution and data among several activity functions.

Its most basic form chains multiple independent activities into a single sequential workflow.

Let's start with an example which books a conference ticket, a flight itinerary, and a hotel room one-by-one:

Sequential Workflow

3 steps of a workflow executed in sequence

The implementation of this workflow is defined by another C# Azure Function, this time with OrchestrationTrigger:

[FunctionName("SequentialWorkflow")]
public static async Task Sequential([OrchestrationTrigger] DurableOrchestrationContext context)
{
    var conference = await context.CallActivityAsync<ConfTicket>("BookConference", "ServerlessDays");
    var flight = await context.CallActivityAsync<FlightTickets>("BookFlight", conference.Dates);
    await context.CallActivityAsync("BookHotel", flight.Dates);
}

Again, attributes are used to describe the function for the Azure runtime.

The only input parameter has type DurableOrchestrationContext. This context is the tool that enables the orchestration operations.

In particular, the CallActivityAsync method is used three times to invoke three activities one after the other. The method body looks very typical for any C# code working with a Task-based API. However, the behavior is entirely different. Let's have a look at the implementation details.

Behind the Scenes

Let's walk through the lifecycle of one execution of the sequential workflow above.

When the orchestrator starts running, the first CallActivityAsync invocation is made to book the conference ticket. What actually happens here is that a queue message is sent from the orchestrator to the activity function.

The corresponding activity function gets triggered by the queue message. It does its job (books the ticket) and returns the result. The activity function serializes the result and sends it as a queue message back to the orchestrator:

Durable Functions: Message Passing

Messaging between the orchestrator and the activity

When the message arrives, the orchestrator gets triggered again and can proceed to the second activity. The cycle repeats—a message gets sent to Book Flight activity, it gets triggered, does its job, and sends a message back to the orchestrator. The same message flow happens for the third call.

Stop-resume behavior

As discussed earlier, message passing is intended to decouple the sender and receiver in time. For every message in the scenario above, no immediate response is expected.

On the C# level, when the await operator is executed, the code doesn't block the execution of the whole orchestrator. Instead, it just quits: the orchestrator stops being active and its current step completes.

Whenever a return message arrives from an activity, the orchestrator code restarts. It always starts with the first line. Yes, this means that the same line is executed multiple times: up to the number of messages to the orchestrator.

However, the orchestrator stores the history of its past executions in Azure Storage, so the effect of the second pass of the first line is different: instead of sending a message to the activity it already knows the result of that activity, so await returns this result back and assigns it to the conference variable.

Because of these "replays", the orchestrator's implementation has to be deterministic: don't use DateTime.Now, random numbers or multi-thread operations; more details here.

Event Sourcing

Azure Functions are stateless, while workflows require a state to keep track of their progress. Every time a new action towards the workflow's execution happens, the framework automatically records an event in table storage.

Whenever an orchestrator restarts the execution because a new message arrives from its activity, it loads the complete history of this particular execution from storage. Durable Context uses this history to make decisions whether to call the activity or return the previously stored result.

The pattern of storing the complete history of state changes as an append-only event store is known as Event Sourcing. Event store provides several benefits:

  • Durability—if a host running an orchestration fails, the history is retained in persistent storage and is loaded by the new host where the orchestration restarts;
  • Scalability—append-only writes are fast and easy to spread over multiple storage servers;
  • Observability—no history is ever lost, so it's straightforward to inspect and analyze even after the workflow is complete.

Here is an illustration of the notable events that get recorded during our sequential workflow:

Durable Functions: Event Sourcing

Log of events in the course of orchestrator progression

Billing

Azure Functions on the serverless consumption-based plan are billed per execution + per duration of execution.

The stop-replay behavior of durable orchestrators causes the single workflow "instance" to execute the same orchestrator function multiple times. This also means paying for several short executions.

However, the total bill usually ends up being much lower compared to the potential cost of blocking synchronous calls to activities. The price of 5 executions of 100 ms each is significantly lower than the cost of 1 execution of 30 seconds.

By the way, the first million executions per month are at no charge, so many scenarios incur no cost at all from Azure Functions service.

Another cost component to keep in mind is Azure Storage. Queues and Tables that are used behind the scenes are charged to the end customer. In my experience, this charge remains close to zero for low- to medium-load applications.

Beware of unintentional eternal loops or indefinite recursive fan-outs in your orchestrators. Those can get expensive if you leave them out of control.

Error-handling and retries

What happens when an error occurs somewhere in the middle of the workflow? For instance, a third-party flight booking service might not be able to process the request:

Error Handling

One activity is unhealthy

This situation is expected by Durable Functions. Instead of silently failing, the activity function sends a message containing the information about the error back to the orchestrator.

The orchestrator deserializes the error details and, at the time of replay, throws a .NET exception from the corresponding call. The developer is free to put a try .. catch block around the call and handle the exception:

[FunctionName("SequentialWorkflow")]
public static async Task Sequential([OrchestrationTrigger] DurableOrchestrationContext context)
{
    var conf = await context.CallActivityAsync<ConfTicket>("BookConference", "ServerlessDays");
    try
    {
        var itinerary = MakeItinerary(/* ... */);
        await context.CallActivityAsync("BookFlight", itinerary);
    }
    catch (FunctionFailedException)
    {
        var alternativeItinerary = MakeAnotherItinerary(/* ... */);
        await context.CallActivityAsync("BookFlight", alternativeItinerary);
    }
    await context.CallActivityAsync("BookHotel", flight.Dates);
}

The code above falls back to a "backup plan" of booking another itinerary. Another typical pattern would be to run a compensating activity to cancel the effects of any previous actions (un-book the conference in our case) and leave the system in a clean state.

Quite often, the error might be transient, so it might make sense to retry the failed operation after a pause. It's a such a common scenario that Durable Functions provides a dedicated API:

var options = new RetryOptions(
    firstRetryInterval: TimeSpan.FromMinutes(1),                    
    maxNumberOfAttempts: 5);
options.BackoffCoefficient = 2.0;

await context.CallActivityWithRetryAsync("BookFlight", options, itinerary);

The above code instructs the library to

  • Retry up to 5 times
  • Wait for 1 minute before the first retry
  • Increase delays before every subsequent retry by the factor of 2 (1 min, 2 min, 4 min, etc.)

The significant point is that, once again, the orchestrator does not block while awaiting retries. After a failed call, a message is scheduled for the moment in the future to re-run the orchestrator and retry the call.

Sub-orchestrators

Business processes may consist of numerous steps. To keep the code of orchestrators manageable, Durable Functions allows nested orchestrators. A "parent" orchestrator can call out to child orchestrators via the context.CallSubOrchestratorAsync method:

[FunctionName("CombinedOrchestrator")]
public static async Task CombinedOrchestrator([OrchestrationTrigger] DurableOrchestrationContext context)
{
    await context.CallSubOrchestratorAsync("BookTrip", serverlessDaysAmsterdam);
    await context.CallSubOrchestratorAsync("BookTrip", serverlessDaysHamburg);
}

The code above books two conferences, one after the other.

Fan-out / Fan-in

What if we want to run multiple activities in parallel?

For instance, in the example above, we could wish to book two conferences, but the booking order might not matter. Still, when both bookings are completed, we want to combine the results to produce an expense report for the finance department:

Parallel Calls

Parallel calls followed by a final step

In this scenario, the BookTrip orchestrator accepts an input parameter with the name of the conference and returns the expense information. ReportExpenses needs to receive both expenses combined.

This goal can be easily achieved by scheduling two tasks (i.e., sending two messages) without awaiting them separately. We use the familiar Task.WhenAll method to await both and combine the results:

[FunctionName("ParallelWorkflow")]
public static async Task Parallel([OrchestrationTrigger] DurableOrchestrationContext context)
{
    var amsterdam = context.CallSubOrchestratorAsync("BookTrip", serverlessDaysAmsterdam);
    var hamburg   = context.CallSubOrchestratorAsync("BookTrip", serverlessDaysHamburg);

    var expenses = await Task.WhenAll(amsterdam, hamburg);

    await context.CallActivityAsync("ReportExpenses", expenses);
}

Remember that awaiting the WhenAll method doesn't synchronously block the orchestrator. It quits the first time and then restarts two times on reply messages received from activities. The first restart quits again, and only the second restart makes it past the await.

Task.WhenAll returns an array of results (one result per each input task), which is then passed to the reporting activity.

Another example of parallelization could be a workflow sending e-mails to hundreds of recipients. Such fan-out wouldn't be hard with normal queue-triggered functions: simply send hundreds of messages. However, combining the results, if required for the next step of the workflow, is quite challenging.

It's straightforward with a durable orchestrator:

var emailSendingTasks =
    recepients
    .Select(to => context.CallActivityAsync<bool>("SendEmail", to))
    .ToArray();

var results = await Task.WhenAll(emailSendingTasks);

if (results.All(r => r)) { /* ... */ }

Making hundreds of roundtrips to activities and back could cause numerous replays of the orchestrator. As an optimization, if multiple activity functions complete around the same time, the orchestrator may internally process several messages as a batch and restart the orchestrator function only once per batch.

Other Concepts

There are many more patterns enabled by Durable Functions. Here is a quick list to give you some perspective:

  • Waiting for the first completed task in a collection (rather than all of them) using the Task.WhenAny method. Useful for scenarios like timeouts or competing actions.
  • Pausing the workflow for a given period or until a deadline.
  • Waiting for external events, e.g., bringing human interaction into the workflow.
  • Running recurring workflows, when the flow repeats until a certain condition is met.

Further explanation and code samples are in the docs.

Conclusion

I firmly believe that serverless applications utilizing a broad range of managed cloud services are highly beneficial to many companies, due to both rapid development process and the properly aligned billing model.

Serverless tech is still young; more high-level architectural patterns need to emerge to enable expressive and composable implementations of large business systems.

Azure Durable Functions suggests some of the possible answers. It combines the clarity and readability of sequential RPC-style code with the power and resilience of event-driven architecture.

The documentation for Durable Functions is excellent, with plenty of examples and how-to guides. Learn it, try it for your real-life scenarios, and let me know your opinion—I'm excited about the serverless future!

Acknowledgments

Many thanks to Katy Shimizu, Chris Gillum, Eric Fleming, KJ Jones, William Liebenberg, Andrea Tosato for reviewing the draft of this article and their valuable contributions and suggestions. The community around Azure Functions and Durable Functions is superb!

From 0 to 1000 Instances: How Serverless Providers Scale Queue Processing

Originally published at Binaris Blog

Whenever I see a "Getting Started with Function-as-a-Service" tutorial, it usually shows off a synchronous HTTP-triggered scenario. In my projects, though, I use a lot of asynchronous functions triggered by a queue or an event stream.

Quite often, the number of messages passing through a queue isn't uniform over time. I might drop batches of work now and then. My app may get piles of queue items arriving from upstream systems that were down or under maintenance for an extended period. The system might see some rush-hour peaks every day or only a few busy days per month.

This is where serverless tech shines: You pay per execution, and then the promise is that the provider takes care of scaling up or down for you. Today, I want to put this scalability under test.

The goal of this article is to explore queue-triggered serverless functions and hopefully distill some practical advice regarding asynchronous functions for real projects. I will be evaluating the problem:

  • Across Big-3 cloud providers (Amazon, Microsoft, Google)
  • For different types of workloads
  • For different performance tiers

Let's see how I did that and what the outcome was.

DISCLAIMER. Performance testing is hard. I might be missing some crucial factors and parameters that influence the outcome. My interpretation might be wrong. The results might change over time. If you happen to know a way to improve my tests, please let me know, and I will re-run them and re-publish the results.

Methodology

In this article I analyze the execution results of the following cloud services:

  • AWS Lambda triggered via SQS queues
  • Azure Function triggered via Storage queues
  • Google Cloud Function triggered via Cloud Pub/Sub

All functions are implemented in Javascript and are running on GA runtime.

At the beginning of each test, I threw 100,000 messages into a queue that was previously idle. Enqueuing never took longer than one minute (I sent the messages from multiple clients in parallel).

I disabled any batch processing, so each message was consumed by a separate function invocation.

I then analyzed the logs (AWS CloudWatch, Azure Application Insights, and GCP Stackdriver Logging) to generate charts of execution distribution over time.

How Scaling Actually Works

To understand the experiment better, let's look at a very simplistic but still useful model of how cloud providers scale serverless applications.

All providers handle the increased load by scaling out, i.e., by creating multiple instances of the same application that execute the chunks of work in parallel.

In theory, a cloud provider could spin up an instance for each message in the queue as soon as the messages arrive. The backlog processing time would then stay very close to zero.

In practice, allocating instances is not cheap. The Cloud provider has to boot up the function runtime, hit a cold start, and waste expensive resources on a job that potentially will take just a few milliseconds.

So the providers are trying to find a sweet spot between handling the work as soon as possible and using resources efficiently. The outcomes differ, which is the point of my article.

AWS

AWS Lambda defines scale out with a notion of Concurrent Executions. Each instance of your AWS Lambda is handling a single execution at any given time. In our case, it's processing a single SQS message.

It's helpful to think of a function instance as a container working on a single task. If execution pauses or waits for an external I/O operation, the instance is on hold.

The model of concurrent executions is universal to all trigger types supported by Lambdas. An instance doesn't work with event sources directly; it just receives an event to work on.

There is a central element in the system, let's call it "Orchestrator". The Orchestrator is the component talking to an SQS queue and getting the messages from it. It's then the job of the Orchestrator and related infrastructure to provision the required number of instances for working on concurrent executions:

Model of AWS Lambda Scale-Out

Model of AWS Lambda Scale-Out

As to scaling behavior, here is what the official AWS docs say:

AWS Lambda automatically scales up ... until the number of concurrent function executions reaches 1000 ... Amazon Simple Queue Service supports an initial burst of 5 concurrent function invocations and increases concurrency by 60 concurrent invocations per minute.

GCP

The model of Google Cloud Functions is very similar to what AWS does. It runs a single simultaneous execution per instance and routes the messages centrally.

I wasn't able to find any scaling specifics except the definition of Function Quotas.

Azure

Experiments with Azure Functions were run on Consumption Plan —the dynamically scaled and billed-per-execution runtime. The concurrency model of Azure Functions is different from the counterparts of AWS/GCP.

Function App instance is closer to a VM than a single-task container. It runs multiple concurrent executions in parallel. Equally importantly, it pulls messages from the queue on its own instead of getting them pushed from a central Orchestrator.

There is still a central coordinator called Scale Controller, but its role is a bit more subtle. It connects to the same data source (the queue) and needs to determine how many instances to provision based on the metrics from that queue:

Model of Azure Function Scale-Out

Model of Azure Function Scale-Out

This model has pros and cons. If one execution is idle, waiting for some I/O operation such as an HTTP request to finish, the instance might become busy processing other messages, thus being more efficient. Running multiple executions is useful in terms of shared resource utilization, e.g., keeping database connection pools and reusing HTTP connections.

On the flip side, the Scale Controller now needs to be smarter: to know not only the queue backlog but also how instances are doing and at what pace they are processing the messages. It's probably achievable based on queue telemetry though.

Let's start applying this knowledge in practical experiments.

Pause-the-World Workload

My first serverless function is aimed to simulate I/O-bound workloads without using external dependencies to keep the experiment clean. Therefore, the implementation is extremely straightforward: pause for 500 ms and return.

It could be loading data from a scalable third-party API. It could be running a database query. Instead, it just runs setTimeout.

I sent 100k messages to queues of all three cloud providers and observed the result.

AWS

AWS Lambda allows multiple instance sizes to be provisioned. Since the workload is neither CPU- nor memory-intensive, I was using the smallest memory allocation of 128 MB.

Here comes the first chart of many, so let's learn to read it. The horizontal axis shows time in minutes since all the messages were sent to the queue.

The line going from top-left to bottom-right shows the decreasing queue backlog. Accordingly, the left vertical axis denotes the number of items still-to-be-handled.

The bars show the number of concurrent executions crunching the messages at a given time. Every execution logs the instance ID so that I could derive the instance count from the logs. The right vertical axis shows the instance number.

AWS Lambda processing 100k SQS messages with "Pause" handler

AWS Lambda processing 100k SQS messages with "Pause" handler

It took AWS Lambda 5.5 minutes to process the whole batch of 100k messages. For comparison, the same batch processed sequentially would take about 14 hours.

Notice how linear the growth of instance count is. If I apply the official scaling formula:

Instance Count = 5 + Minutes * 60 = 5 + 5.5 * 60  = 335

We get a very close result! Promises kept.

GCP

Same function, same chart, same instance size of 128 MB of RAM—but this time for Google Cloud Functions:

Google Cloud Function processing 100k Pub/Sub messages with "Pause" handler

Google Cloud Function processing 100k Pub/Sub messages with "Pause" handler

Coincidentally, the total amount of instances, in the end, was very close to AWS. The scaling pattern looks entirely different though: Within the very first minute, there was a burst of scaling close to 300 instances, and then the growth got very modest.

Thanks to this initial jump, GCP managed to finish processing almost one minute earlier than AWS.

Azure

Azure Function doesn't have a configuration for allocated memory or any other instance size parameters.

The shape of the chart for Azure Functions is very similar, but the instance number growth is significantly different:

Azure Function processing 100k queue messages with "Pause" handler

Azure Function processing 100k queue messages with "Pause" handler

The total processing time was a bit faster than AWS and somewhat slower than GCP. Azure Function instances process several messages in parallel, so it takes much less of them to do the same amount of work.

Instance number growth seems far more linear than bursty.

What we learned

Based on this simple test, it's hard to say if one cloud provider handles scale-out better than the others.

It looks like all serverless platforms under stress are making decisions at the resolution of 5-15 seconds, so the backlog processing delays are likely to be measured in minutes. It sounds quite far from the theoretical "close to zero" target but is most likely good enough for the majority of applications.

Crunching Numbers

That was an easy job though. Let's give cloud providers a hard time by executing CPU-heavy workloads and see if they survive!

This time, each message handler calculates a Bcrypt hash with a cost of 10. One such calculation takes about 200 ms on my laptop.

AWS

Once again, I sent 100k messages to an SQS queue and recorded the processing speed and instance count.

Since the workload is CPU-bound, and AWS allocates CPU shares proportionally to the allocated memory, the instance size might have a significant influence on the result.

I started with the smallest memory allocation of 128 MB:

AWS Lambda (128 MB) processing 100k SQS messages with "Bcrypt" handler

AWS Lambda (128 MB) processing 100k SQS messages with "Bcrypt" handler

This time it took almost 10 minutes to complete the experiment.

The scaling shape is pretty much the same as last time, still correctly described by the formula 60 * Minutes + 5. However, because AWS allocates a small fraction of a full CPU to each 128 MB execution, one message takes around 1,700 ms to complete. Thus, the total work increased approximately by the factor of 3 (47 hours if done sequentially).

At the peak, 612 concurrent executions were running, nearly double the amount in our initial experiment. So, the total processing time increased only by the factor of 2—up to 10 minutes.

Let's see if larger Lambda instances would improve the outcome. Here is the chart for 512 MB of allocated memory:

AWS Lambda (512 MB) processing 100k SQS messages with "Bcrypt" handler

AWS Lambda (512 MB) processing 100k SQS messages with "Bcrypt" handler

And yes it does. The average execution duration is down to 400 ms: 4 times less, as expected. The scaling shape still holds, so the entire batch was done in less than four minutes.

GCP

I executed the same experiment on Google Cloud Functions. I started with 128 MB, and it looks impressive:

Google Cloud Function (128 MB) processing 100k Pub/Sub messages with "Bcrypt" handler

Google Cloud Function (128 MB) processing 100k Pub/Sub messages with "Bcrypt" handler

The average execution duration is very close to Amazon's: 1,600 ms. However, GCP scaled more aggressively—to a staggering 1,169 parallel executions! Scaling also has a different shape: It's not linear but grows in steep jumps. As a result, it took less than six minutes on the lowest CPU profile—very close to AWS's time on a 4x more powerful CPU.

What will GCP achieve on a faster CPU? Let's provision 512 MB. It must absolutely crush the test. Umm, wait, look at that:

Google Cloud Function (512 MB) processing 100k Pub/Sub messages with "Bcrypt" handler

Google Cloud Function (512 MB) processing 100k Pub/Sub messages with "Bcrypt" handler

It actually... got slower. Yes, the average execution time is 4x lower: 400 ms, but the scaling got much less aggressive too, which canceled the speedup.

I confirmed it with the largest instance size of 2,048 MB:

Google Cloud Function (2 GB) processing 100k Pub/Sub messages with "Bcrypt" handler

Google Cloud Function (2 GB) processing 100k Pub/Sub messages with "Bcrypt" handler

CPU is fast: 160 ms average execution time, but the total time to process 100k messages went up to eight minutes. Beyond the initial spike at the first minute, it failed to scale up any further and stayed at about 110 concurrent executions.

It seems that GCP is not that keen to scale out larger instances. It's probably easier to find many small instances available on the pool rather than a similar number of giant instances.

Azure

A single invocation takes about 400 ms to complete on Azure Function. Here is the burndown chart:

Azure Function processing 100k queue messages with "Bcrypt" handler

Azure Function processing 100k queue messages with "Bcrypt" handler

Azure spent 21 minutes to process the whole backlog. The scaling was linear, similarly to AWS, but with a much slower pace regarding instance size growth, about 2.5 * Minutes.

As a reminder, each instance could process multiple queue messages in parallel, but each such execution would be competing for the same CPU resource, which doesn't help for the purely CPU-bound workload.

Practical Considerations

Time for some conclusions and pieces of advice to apply in real serverless applications.

Serverless is great for async data processing

If you are already using cloud services, such as managed queues and topics, serverless functions are the easiest way to consume them.

Moreover, the scalability is there too. When was the last time you ran 1,200 copies of your application?

Serverless is not infinitely scalable

There are limits. Your functions won't scale perfectly to accommodate your spike—a provider-specific algorithm will determine the scaling pattern.

If you have large spikes in queue workloads, which is quite likely for medium- to high-load scenarios, you can and should expect delays up to several minutes before the backlog is fully digested.

All cloud providers have quotas and limits that define an upper boundary of scalability.

Cloud providers have different implementations

AWS Lambda seems to have a very consistent and well-documented linear scale growth for SQS-triggered Lambda functions. It will happily scale to 1,000 instances, or whatever other limit you hit first.

Google Cloud Functions has the most aggressive scale-out strategy for the smallest instance sizes. It can be a cost-efficient and scalable way to run your queue-based workloads. Larger instances seem to scale in a more limited way, so a further investigation is required if you need those.

Azure Functions share instances for multiple concurrent executions, which works better for I/O-bound workloads than for CPU-bound ones. Depending on the exact scenario that you have, it might help to play with instance-level settings.

Don't forget batching

For the tests, I was handling queue messages in the 1-by-1 fashion. In practice, it helps if you can batch several messages together and execute a single action for all of them in one go.

If the destination for your data supports batched operations, the throughput will usually increase immensely. Processing 100,000 Events Per Second on Azure Functions is an excellent case to prove the point.

You might get too much scale

A month ago, Troy Hunt published a great post Breaking Azure Functions with Too Many Connections. His scenario looks very familiar: He uses queue-triggered Azure Functions to notify subscribers about data breaches. One day, he dropped 126 million items into the queue, and Azure scaled out, which overloaded Mozilla's servers and caused them to go all-timeouts.

Another consideration is that non-serverless dependencies limit the scalability of your serverless application. If you call a legacy HTTP endpoint, a SQL database, or a third-party web service—be sure to test how they react when your serverless function scales out to hundreds of concurrent executions.

Stay tuned for more serverless performance goodness!

Azure Functions V2 Is Released, How Performant Is It?

Azure Functions major version 2.0 was released into GA a few days back during Microsoft Ignite. The runtime is now based on .NET Core and thus is cross-platform and more interoperable. It has a nice extensibility story too.

In theory, .NET Core runtime is more lean and performant. But last time I checked back in April, the preview version of Azure Functions V2 had some serious issues with cold start durations.

I decided to give the new and shiny version another try and ran several benchmarks. All tests were conducted on Consumption plan.

TL;DR: it's not perfect just yet.

Cold Starts

Cold starts happen when a new instance handles its first request, see my other posts: one, two, three.

Hello World

The following chart gives a comparison of V1 vs V2 cold starts for the two most popular runtimes: .NET and Javascript. The dark bar shows the most probable range of values, while the light ones are possible but less frequent:

Cold Starts V1 vs V2: .NET and Javascript

Apparently, V2 is slower to start for both runtimes. V2 on .NET is slower by 10% on average and seems to have higher variation. V2 on Javascript is massively slower: 2 times on average, and the slowest startup time goes above 10 seconds.

Dependencies On Board

The values for the previous chart were calculated for Hello-World type of functions with no extra dependencies.

The chart below shows two more Javascript functions, this time with a decent number of dependencies:

  • Referencing 3 NPM packages - 5MB zipped
  • Referencing 38 NPM packages - 35 MB zipped

Cold Starts V1 vs V2: Javascript with NPM dependencies

V2 clearly loses on both samples, but V2-V1 difference seems to be consistently within 2.5-3 seconds for any amount of dependencies.

All the functions were deployed with the Run-from-Package method which promises faster startup times.

Java

Functions V2 come with a preview of a new runtime: Java / JVM. It utilizes the same extensibility model as Javascript, and thus it seems to be a first-class citizen now.

Cold starts are not first-class though:

Cold Starts Java

If you are a Java developer, be prepared for 20-25 seconds of initial startup time. That will probably be resolved when the Java runtime becomes generally available:

Queue Processor

Cold starts are most problematic for synchronous triggers like HTTP requests. They are less relevant for queue-based workloads, where scale out is of higher importance.

Last year I ran some tests around the ability of Functions to keep up with variable queue load: one, two.

Today I ran two simple tests to compare the scalability of V1 vs. V2 runtimes.

Pause-and-Go

In my first tests, a lightweight Javascript Function processed messages from an Azure Storage Queue. For each message, it just pauses for 500 msec and then completes. This is supposed to simulate I/O-bound Functions.

I've sent 100,000 messages to the queue and measured how fast they went away. Batch size (degree of parallelism on each instance) was set to 16.

Processing Queue Messages with Lightweight I/O Workload

Two lines show the queue backlogs of two runtimes, while the bars indicate the number of instances working in parallel at a given minute.

We see that V2 was a bit faster to complete, probably due to more instances provisioned to it at any moment. The difference is not big though and might be statistically insignificant.

CPU at Work

Functions in my second experiment are CPU-bound. Each message invokes calculation of a 10-stage Bcrypt hash. On a very quiet moment, 1 such function call takes about 300-400 ms to complete, consuming 100% CPU load on a single core.

Both Functions are precompiled .NET and both are using Bcrypt.NET.

Batch size (degree of parallelism on each instance) was set to 2 to avoid too much fighting for the same CPU. Yet, the average call duration is about 1.5 seconds (3x slower than possible).

Processing Queue Messages with CPU-bound Workload

The first thing to notice: it's the same number of messages with comparable "sequential" execution time, but the total time to complete the job increased 3-fold. That's because the workload is much more demanding to the resources of application instances, and they struggle to parallelize work more aggressively.

V1 and V2 are again close to each other. One more time, V2 got more instances allocated to it most of the time. And yet, it seemed to be consistently slower and lost about 2.5 minutes on 25 minutes interval (~10%).

HTTP Scalability

I ran two similar Functions — I/O-bound "Pause" (~100 ms) and CPU-bound Bcrypt (9 stages, ~150ms) — under a stress test. But this time they were triggered by HTTP requests. Then I compared the results for V1 and V2.

Pause-and-Go

The grey bars on the following charts represent the rate of requests sent and processed within a given minute.

The lines are percentiles of response time: green lines for V2 and orange lines for V1.

Processing HTTP Requests with Lightweight I/O Workload

Yes, you saw it right, my Azure Functions were processing 100,000 messages per minute at peak. That's a lot of messages.

Apart from the initial spike at minutes 2 and 3, both versions performed pretty close to each other.

50th percentile is flat close to the theoretic minimum of 100 ms, while the 95th percentile fluctuates a bit, but still mostly stays quite low.

Note that the response time is measured from the client perspective, not by looking at the statistics provided by Azure.

CPU Fanout

How did CPU-heavy workload perform?

To skip ahead, I must say that the response time increased much more significantly, so my sample clients were not able to generate request rates of 100k per minute. They "only" did about 48k per minute at peak, which still seems massive to me.

I've run the same test twice: one for Bcrypt implemented in .NET, and one for Javascript.

Processing HTTP Requests with .NET CPU-bound Workload

V2 had a real struggle during the first minute, where response time got terribly slow up to 9 seconds.

Looking at the bold-green 50th percentile, we can see that it's consistently higher than the orange one throughout the load growth period of the first 10 minutes. V2 seemed to have a harder time to adjust.

This might be explainable by slower growth of instance count:

Instance Count Growth while Processing HTTP Requests with .NET CPU-bound Workload

This difference could be totally random, so let's look at a similar test with Javascript worker. Here is the percentile chart:

Processing HTTP Requests with Javascript CPU-bound Workload

The original slowness of the first 3 minutes is still there, but after that time V2 and V1 are on-par.

On-par doesn't sound that great though if you look at the significant edge in the number of allocated instances, in favor of V2 this time:

Instance Count Growth while Processing HTTP Requests with Javascript CPU-bound Workload

Massive 147 instances were crunching Bcrypt hashes in Javascript V2, and that made it a bit faster to respond than V1.

Conclusion

As always, be reluctant to make definite conclusions based on simplistic benchmarks. But I see some trends which might be true as of today:

  • Performance of .NET Functions is comparable across two versions of Functions runtimes;
  • V1 still has a clear edge in the cold start time of Javascript Functions;
  • V2 is the only option for Java developers, but be prepared to very slow cold starts;
  • Scale-out characteristics seem to be independent of the runtime version, although there are blurry signs of V2 being a bit slower to ramp up or slightly more resource hungry.

I hope this helps in your serverless journey!

Serverless: Cold Start War

Serverless cloud services are hot. Except when they are not :)

AWS Lambda, Azure Functions, Google Cloud Functions are all similar in their attempt to enable rapid development of cloud-native serverless applications.

Auto-provisioning and auto-scalability are the killer features of those Function-as-a-Service cloud offerings. No management required, cloud providers will deliver infrastructure for the user based on the actual incoming load.

One drawback of such dynamic provisioning is a phenomenon called "cold start". Basically, applications that haven't been used for a while take longer to startup and to handle the first request.

Cloud providers keep a bunch of generic unspecialized workers in stock. Whenever a serverless application needs to scale up, be it from 0 to 1 instances, or from N to N+1 likewise, the runtime will pick one of the spare workers and will configure it to serve the named application:

Cold Start

This procedure takes time, so the latency of the application event handling increases. To avoid doing this for every event, the specialized worker will be kept intact for some period of time. When another event comes in, this worker will stand available to process it as soon as possible. This is a "warm start":

Warm Start

The problem of cold start latency was described multiple times, here are the notable links:

The goal of my article today is to explore how cold starts compare:

  • Across Big-3 cloud providers (Amazon, Microsoft, Google)
  • For different languages and runtimes
  • For smaller vs larger applications (including dependencies)
  • How often cold starts happen
  • What can be done to optimize the cold starts

Let's see how I did that and what the outcome was.

DISCLAIMER. Performance testing is hard. I might be missing some important factors and parameters that influence the outcome. My interpretation might be wrong. The results might change over time. If you happen to know a way to improve my tests, please let me know and I will re-run them and re-publish the results.

Methodology

All tests were run against HTTP Functions because that's where cold start matters the most.

All the functions were returning a simple JSON reporting their current instance ID, language etc. Some functions were also loading extra dependencies, see below.

I did not rely on execution time reported by a cloud provider. Instead, I measured end-to-end duration from the client perspective. This means that durations of HTTP gateway (e.g. API Gateway in case of AWS) are included into the total duration. However, all calls were made from within the same region, so network latency should have minimal impact:

Test Setup

Important note: I ran all my tests on GA (generally available) versions of services/languages, so e.g. Azure tests were done with version 1 of Functions runtime (.NET Framework), and GCP tests were only made for Javascript runtime.

When Does Cold Start Happen?

Obviously, cold start happens when the very first request comes in. After that request is processed, the instance is kept alive in case subsequent requests arrive. But for how long?

The answer differs between cloud providers.

To help you read the charts in this section, I've marked cold starts with blue color dots, and warm starts with orange color dots.

Azure

Here is the chart for Azure. It shows the values of normalized request durations across different languages and runtime versions (Y-axis) depending on the time since the previous request in minutes (X-axis):

Azure Cold Start Threshold

Clearly, an idle instance lives for 20 minutes and then gets recycled. All requests after 20 minutes threshold hit another cold start.

AWS

AWS is more tricky. Here is the same kind of chart, relative durations vs time since the last request, measured for AWS Lambda:

AWS Cold Start vs Warm Start

There's no clear threshold here... For this sample, no cold starts happened within 28 minutes after the previous invocation. Afterward, the frequency of cold starts slowly rises. But even after 1 hour of inactivity, there's still a good chance that your instance is alive and ready to take requests.

This doesn't match the official information that AWS Lambdas stay alive for just 5 minutes after the last invocation. I reached out to Chris Munns, and he confirmed:

A couple learning points here:

  • AWS is working on improving cold start experience (and probably Azure/GCP do too)
  • My results might not be reliably reproducible in your application since it's affected by recent adjustments

GCP

Google Cloud Functions left me completely puzzled. Here is the same chart for GCP cold starts (again, orange dots are warm and blue ones are cold):

GCP Cold Start vs Warm Start

This looks totally random to me. A cold start can happen in 3 minutes after the previous request, or an instance can be kept alive for the whole hour. The probability of a cold start doesn't seem to depend on the interval, at least just by looking at this chart.

Any ideas about what's going on are welcome!

Parallel requests

Cold starts happen not only when the first instance of an application is provisioned. The same issue will happen whenever all the provisioned instances are busy handling incoming events, and yet another event comes in (at scale out).

As far as I'm aware, this behavior is common to all 3 providers, so I haven't prepared any comparison charts for N+1 cold starts. Yet, be aware of them!

Reading Candle Charts

In the following sections, you will see charts that represent statistical distribution of cold start time as measured during my experiments. I repeated experiments multiple times and then grouped the metric values, e.g. by the cloud provider or by language.

Each group will be represented by a "candle" on the chart. This is how you should read each candle:

How to Read Cold Start Charts

Memory Allocation

AWS Lambda and Google Cloud Functions have a setting to define the memory size that gets allocated to a single instance of a function. A user can select a value from 128MB to 2GB and above at creation time.

More importantly, the virtual CPU cycles get allocated proportionally to this provisioned memory size. This means that an instance of 512 MB will have twice as much CPU speed as an instance of 256MB.

Does this affect the cold start time?

I've run a series of tests to compare cold start latency across the board of memory/CPU sizes. The results are somewhat mixed.

AWS Lambda Javascript doesn't seem to have significant differences. This probably means that not so much CPU load is required to start a Node.js "Hello World" application:

AWS Javascript Cold Start by Memory

AWS Lambda .NET Core runtime does depend on memory size though. Cold start time drops dramatically with every increase in allocated memory and CPU:

AWS C# Cold Start by Memory

GCP Cloud Functions expose a similar effect even for Javascript runtime:

GCP Javascript Cold Start by Memory

In contrast to Amazon and Google, Microsoft doesn't ask to select a memory limit. Azure will charge Functions based on the actual memory usage. More importantly, it will always dedicate a full vCore for a given Function execution.

It's not exactly apples-to-apples, but I chose to fix the memory allocations of AWS Lambda and GCF to 1024 MB. This feels the closest to Azure's vCore capacity, although I haven't tried a formal CPU performance comparison.

Given that, let's see how the 3 cloud providers compare in cold start time.

Javascript Baseline

Node.js is the only runtime supported in production by Google Cloud Functions right now. Javascript is also probably by far the most popular language for serverless applications across the board.

Thus, it makes sense to compare the 3 cloud providers on how they perform in Javascript. The base test measures the cold starts of "Hello World" type of functions. Functions have no dependencies, so deployment package is really small.

Here are the numbers for cold starts:

Cold Start for Basic Javascript Functions

AWS is clearly doing the best job here. GCP takes the second place, and Azure is the slowest. The rivals are sort of close though, seemingly playing in the same league so the exact disposition might change over time.

How Do Languages Compare?

I've written Hello World HTTP function in all supported languages of the cloud platforms:

  • AWS: Javascript, Python, Java, Go and C# (.NET Core)
  • Azure: Javascript and C# (precompiled .NET assembly)
  • GCP: Javascript

Azure kind of supports much more languages, including Python and Java, but they are still considered experimental / preview, so the cold starts are not fully optimized. See my previous article for exact numbers.

Same applies to Python on GCP.

The following chart shows some intuition about the cold start duration per language. The languages are ordered based on mean response time, from lowest to highest:

Cold Start per Language per Cloud and Language

AWS provides the richest selection of runtimes, and 4 out of 5 are faster than the other two cloud providers. C# / .NET seems to be the least optimized (Amazon, why is that?).

Does Size Matter?

OK, enough of Hello World. A real-life function might be more heavy, mainly because it would depend on other third-party libraries.

To simulate such scenario, I've measured cold starts for functions with extra dependencies:

  • Javascript referencing 3 NPM packages - 5MB zipped
  • Javascript referencing 38 NPM packages - 35 MB zipped
  • C# function referencing 5 NuGet packages - 2 MB zipped
  • Java function referencing 5 Maven packages - 15 MB zipped

Here are the results:

Cold Start Dependencies

As expected, the dependencies slow the loading down. You should keep your Functions lean, otherwise, you will pay in seconds for every cold start.

However, the increase in cold start seems quite low, especially for precompiled languages.

A very cool feature of GCP Cloud Functions is that you don't have to include NPM packages into the deployment archive. You just add package.json file and the runtime will restore them for you. This makes the deployment artifact ridiculously small, but doesn't seem to slow down the cold starts either. Obviously, Google pre-restores the packages in advance, before the actual request comes in.

Avoiding Cold Starts

The overall impression is that cold start delays aren't that high, so most applications can tolerate them just fine.

If that's not the case, some tricks can be implemented to keep function instances warm. The approach is universal for all 3 providers: once in X minutes, make an artificial call to the function to prevent it from expiring.

Implementation details will differ since the expiration policies are different, as we explored above.

For applications with higher load profile, you might want to fire several parallel "warming" requests in order to make sure that enough instances are kept in warm stock.

For further reading, have a look at my Cold Starts Beyond First Request in Azure Functions and AWS Lambda Warmer as Pulumi Component.

Conclusions

Here are some lessons learned from all the experiments above:

  • Be prepared for 1-3 seconds cold starts even for the smallest Functions
  • Different languages and runtimes have roughly comparable cold start time within the same platform
  • Minimize the number of dependencies, only bring what's needed
  • AWS keeps cold starts below 1 second most of the time, which is pretty amazing
  • All cloud providers are aware of the problem and are actively optimizing the cold start experience
  • It's likely that in middle term these optimizations will make cold starts a non-issue for the vast majority of applications

Do you see anything weird or unexpected in my results? Do you need me to dig deeper into other aspects? Please leave a comment below or ping me on twitter, and let's sort it all out.

Stay tuned for more serverless perf goodness!

AWS Lambda Warmer as Pulumi Component

Out of curiosity, I'm currently investigating cold starts of Function-as-a-Service platforms of major cloud providers. Basically, if a function is not called for several minutes, the cloud instance behind it might be recycled, and then the next request will take longer because a new instance will need to be provisioned.

Recently, Jeremy Daly posted a nice article about the proper way to keep AWS Lambda instances "warm" to (mostly) prevent cold starts with minimal overhead. Chris Munns endorsed the article, so we know it's the right way.

The amount of actions to be taken is quite significant:

  • Define a CloudWatch event which would fire every 5 minutes
  • Bind this event as another trigger for your Lambda
  • Inside the Lambda, detect whether current invocation is triggered by our CloudWatch event
  • If so, short-circuit the execution and return immediately; otherwise, run the normal workload
  • (Bonus point) If you want to keep multiple instances alive, do some extra dancing with calling itself N times in parallel, provided by an extra permission to do so.

Pursuing Reusability

To simplify this for his readers, Jeremy was so kind to

  • Create an NPM package which you can install and then call from a function-to-be-warmed
  • Provide SAM and Serverless Framework templates to automate Cloud Watch integration

Those are still two distinct steps: writing the code (JS + NPM) and provisioning the cloud resources (YAML + CLI). There are some drawbacks to that:

  • You need to change two parts, which don't look like each other
  • They have to work in sync, e.g. Cloud Watch event must provide the right payload for the handler
  • There's still some boilerplate for every new Lambda

Pulumi Components

Pulumi takes a different approach. You can blend the application code and infrastructure management code into one cohesive cloud application.

Related resources can be combined together into reusable components, which hide repetitive stuff behind code abstractions.

One way to define an AWS Lambda with Typescript in Pulumi is the following:

const handler = (event: any, context: any, callback: (error: any, result: any) => void) => {
    const response = {
        statusCode: 200,
        body: "Cheers, how are things?"
      };

    callback(null, response);
};

const lambda = new aws.serverless.Function("my-function", { /* options */ }, handler);

The processing code handler is just passed to infrastructure code as a parameter.

So, if I wanted to make reusable API for an "always warm" function, how would it look like?

From the client code perspective, I just want to be able to do the same thing:

const lambda = new mylibrary.WarmLambda("my-warm-function", { /* options */ }, handler);

CloudWatch? Event subscription? Short-circuiting? They are implementation details!

Warm Lambda

Here is how to implement such component. The declaration starts with a Typescript class:

export class WarmLambda extends pulumi.ComponentResource {
    public lambda: aws.lambda.Function;

    // Implementation goes here...
}

We expose the raw Lambda Function object, so that it could be used for further bindings and retrieving outputs.

The constructor accepts the same parameters as aws.serverless.Function provided by Pulumi:

constructor(name: string,
        options: aws.serverless.FunctionOptions,
        handler: aws.serverless.Handler,
        opts?: pulumi.ResourceOptions) {

    // Subresources are created here...
}

We start resource provisioning by creating the CloudWatch rule to be triggered every 5 minutes:

const eventRule = new aws.cloudwatch.EventRule(`${name}-warming-rule`, 
    { scheduleExpression: "rate(5 minutes)" },
    { parent: this, ...opts }
);

Then goes the cool trick. We substitute the user-provided handler with our own "outer" handler. This handler closes over eventRule, so it can use the rule to identify the warm-up event coming from CloudWatch. If such is identified, the handler short-circuits to the callback. Otherwise, it passes the event over to the original handler:

const outerHandler = (event: any, context: aws.serverless.Context, callback: (error: any, result: any) => void) =>
{
    if (event.resources && event.resources[0] && event.resources[0].includes(eventRule.name.get())) {
        console.log('Warming...');
        callback(null, "warmed!");
    } else {
        console.log('Running the real handler...');
        handler(event, context, callback);
    }
};

That's a great example of synergy enabled by doing both application code and application infrastructure in a single program. I'm free to mix and match objects from both worlds.

It's time to bind both eventRule and outerHandler to a new serverless function:

const func = new aws.serverless.Function(
    `${name}-warmed`, 
    options, 
    outerHandler, 
    { parent: this, ...opts });
this.lambda = func.lambda;            

Finally, I create an event subscription from CloudWatch schedule to Lambda:

this.subscription = new serverless.cloudwatch.CloudwatchEventSubscription(
    `${name}-warming-subscription`, 
    eventRule,
    this.lambda,
    { },
    { parent: this, ...opts });

And that's all we need for now! See the full code here.

Here is the output of pulumi update command for my sample "warm" lambda application:

     Type                                                      Name                            Plan
 +   pulumi:pulumi:Stack                                       WarmLambda-WarmLambda-dev       create
 +    samples:WarmLambda                                       i-am-warm                       create
 +      aws-serverless:cloudwatch:CloudwatchEventSubscription  i-am-warm-warming-subscription  create
 +        aws:lambda:Permission                                i-am-warm-warming-subscription  create
 +        aws:cloudwatch:EventTarget                           i-am-warm-warming-subscription  create
 +      aws:cloudwatch:EventRule                               i-am-warm-warming-rule          create
 +      aws:serverless:Function                                i-am-warm-warmed                create
 +         aws:lambda:Function                                 i-am-warm-warmed                create

7 Pulumi components and 4 AWS cloud resources are provisioned by one new WarmLambda() line.

Multi-Instance Warming

Jeremy's library supports warming several instances of Lambda by issuing parallel self-calls.

Reproducing the same with Pulumi component should be fairly straightforward:

  • Add an extra constructor option to accept the number of instances to keep warm
  • Add a permission to call Lambda from itself
  • Fire N calls when warming event is triggered
  • Short-circuit those calls in each instance

Note that only the first item would be visible to the client code. That's the power of componentization and code reuse.

I didn't need multi-instance warming, so I'll leave the implementation as exercise for the reader.

Conclusion

Obligatory note: most probably, you don't need to add warming to your AWS Lambdas.

But whatever advanced scenario you might have, it's likely that it is easier to express the scenario in terms of general-purpose reusable component, rather than a set of guidelines or templates.

Happy hacking!

Mikhail Shilkov I'm Mikhail Shilkov, a software developer and architect, a Microsoft Azure MVP, Russian expat living in the Netherlands. I am passionate about cloud technologies, functional programming and the intersection of the two.

LinkedIn@mikhailshilkovGitHubStack Overflow