Cold Starts Beyond First Request in Azure Functions

In my previous article I've explored the topic of Cold Starts in Azure Functions. Particularly, I've measured the cold start delays per language and runtime version.

I received some follow-up questions that I'd like to explore in today's post:

  • Can we avoid cold starts except the very first one by keeping the instance warm?
  • Given one warm instance, if two requests come at the same time, will one request hit a cold start because existing instance is busy with the other?
  • In general, does a cold start happen at scale-out when a new extra instance is provisioned?

Again, we are only talking Consumption Plan here.

Theory

Azure Functions are running on instances provided by Azure App Service. Each instance is able to process several requests concurrently, which is different comparing to AWS Lambda.

Thus, the following could be true:

  • If we issue at least 1 request every 20 minutes, the first instance should stay warm for long time
  • Simultaneous requests don't cause cold start unless the existing instance gets too busy
  • When runtime decides to scale out and spin up a new instance, it could do so in the background, still forwarding incoming requests to the existing warm instance(s). Once the new instance is ready, it could be added to the pool without causing cold starts
  • If so, cold starts are mitigated beyond the very first execution

Let's put this theory under test!

Keeping Always Warm

I've tested a Function App which consists of two Functions:

  • HTTP Function under test
  • Timer Function which runs every 10 minutes and does nothing but logging 1 line of text

I then measured the cold start statistics similar to all the tests from my previous article.

During 2 days I was issuing infrequent requests to the same app, most of them would normally lead to a cold start. Interestingly, even though I was regularly firing the timer, Azure switched instances to serve my application 2 times during the test period:

Infrequent Requests to Azure Functions with "Keep It Warm" Timer

I can see that most responses are fast, so timer "warmer" definitely helps.

The first request(s) to a new instance are slower than subsequent ones. Still, they are faster than normal full cold start time, so it could be related to HTTP stack loading.

Anyway, keeping Functions warm seems a viable strategy.

Parallel Requests

What happens when there is a warm instance, but it's already busy with processing another request? Will the parallel request be delayed, or will it be processed by the same warm instance?

I tested with a very lightweight function, which nevertheless takes some time to complete:

public static async Task<HttpResponseMessage> Delay500([HttpTrigger] HttpRequestMessage req)
{
    await Task.Delay(500);
    return req.CreateResponse(HttpStatusCode.OK, "Done");
}

I believe it's an OK approximation for an IO-bound function.

The test client then issued 2 to 10 parallel requests to this function and measured the response time for all requests.

It's not the easiest chart to understand in full, but note the following:

  • Each group of bars are for requests sent at the same time. Then there goes a pause about 20 seconds before the next group of requests gets sent

  • The bars are colored by the instance which processed that request: same instance - same color

Azure Functions Response Time to Batches of Simultaneous Requests

Here are some observations from this experiment:

  • Out of 64 requests, there were 11 cold starts

  • Same instance can process multiple simultaneous requests, e.g. one instance processed 7 out of 10 requests in the last batch

  • Nonetheless, Azure is eager to spin up new instances for multiple requests. In total 12 instances were created, which is even more than max amount of requests in any single batch

  • Some of those instances were actually never reused (gray-ish bars in batched x2 and x3, brown bar in x10)

  • The first request to each new instance pays the full cold start price. Runtime doesn't provision them in background while reusing existing instances for received requests

  • If an instance handled more than one request at a time, response time invariably suffers, even though the function is super lightweight (Task.Delay)

Conclusion

Getting back to the experiment goals, there are several things that we learned.

For low-traffic apps with sporadic requests it makes sense to setup a "warmer" timer function firing every 10 minutes or so to prevent the only instance from being recycled.

However, scale-out cold starts are real and I don't see any way to prevent them from happening.

When multiple requests come in at the same time, we might expect some of them to hit a new instance and get slowed down. The exact algorithm of instance reuse is not entirely clear.

Same instance is capable of processing multiple requests in parallel, so there are possibilities for optimization in terms of routing to warm instances during the provisioning of cold ones.

If such optimizations happen, I'll be glad to re-run my tests and report any noticeable improvements.

Stay tuned for more serverless perf goodness!

Azure Functions: Cold Starts in Numbers

Auto-provisioning and auto-scalability are the killer features of Function-as-a-Service cloud offerings, and Azure Functions in particular.

One drawback of such dynamic provisioning is a phenomenon called "Cold Start". Basically, applications that haven't been used for a while take longer to startup and to handle the first request.

The problem is nicely described in Understanding Serverless Cold Start, so I won't repeat it here. I'll just copy a picture from that article:

Cold Start

Based on the 4 actions which happen during a cold start, we may guess that the following factors might affect the cold start duration:

  • Language / execution runtime
  • Azure Functions runtime version
  • Application size including dependencies

I ran several sample functions and tried to analyze the impact of these factors on cold start time.

Methodology

All tests were run against HTTP Functions, because that's where cold start matters the most.

All the functions were just returning "Hello, World" taking the "World" value from the query string. Some functions were also loading extra dependencies, see below.

I did not rely on execution time reported by Azure. Instead, I measured end-to-end duration from client perspective. All calls were made from within the same Azure region, so network latency should have minimal impact:

Test Setup

When Does Cold Start Happen?

Obviously, cold start happens when the very first request comes in. After that request is processed, the instance is kept alive in case subsequent requests arrive. But for how long?

The following chart gives the answer. It shows values of normalized request durations across different languages and runtime versions (Y axis) depending on the time since the previous request in minutes (X axis):

Cold Start Threshold

Clearly, an idle instance lives for 20 minutes and then gets recycled. All requests after 20 minutes threshold hit another cold start.

How Do Languages Compare?

I'll start with version 1 of Functions runtime, which is the production-ready GA version as of today.

I've written Hello World HTTP function in all GA languages: C#, F# and Javascript, and I added Python for comparison. C#/F# were executed both in the form of script, and as a precompiled .NET assembly.

The following chart shows some intuition about the cold start duration per language. The languages are ordered based on mean response time, from lowest to highest. 65% of request durations are inside the vertical bar (1-sigma interval) and 95% are inside the vertical line (2-sigma):

Cold Start V1 per Language

Somewhat surprisingly, precompiled .NET is exactly on par with Javascript. Javascript "Hello World" is really lightweight, so I expected it to win, but I was wrong.

C# Script is slower but somewhat comparable. F# Script presented a really negative surprise though: it's much slower. It's even slower than experimental Python support where no performance optimization would be expected at all!

Functions Runtime: V1 vs V2

Version 2 of Functions runtime is currently in preview and not suitable for production load. That probably means they haven't done too much performance optimization, especially from cold start standpoint.

Can we see this on the chart? We sure can:

Cold Start V1 vs V2

V2 is massively slower. The fastest cold starts are around 6 seconds, but the slowest can come up to 40-50 seconds.

Javascript is again on-par with precompiled .NET.

Java is noticeably slower, even though the deployment package is just 33kB, so I assume I didn't overblow it.

Does Size Matter?

OK, enough of Hello World. A real-life function might be more heavy, mainly because it would depend on other third-party libraries.

To simulate such scenario, I've measured cold starts for a .NET function with references to Entity Framework, Automapper, Polly and Serilog.

For Javascript I did the same, but referenced Bluebird, lodash and AWS SDK.

Here are the results:

Cold Start Dependencies

As expected, the dependencies slow the loading down. You should keep your Functions lean, otherwise you will pay in seconds for every cold start.

An important note for Javascript developers: the above numbers are for Functions deployed after Funcpack preprocessor. The package contained the single js file with Webpack-ed dependency tree. Without that, the mean cold start time of the same function is 20 seconds!

Conclusions

Here are some lessons learned from all the experiments above:

  • Be prepared for 1-3 seconds cold starts even for the smallest Functions
  • Stay on V1 of runtime until V2 goes GA unless you don't care about perf
  • .NET precompiled and Javascript Functions have roughly same cold start time
  • Minimize the amount of dependencies, only bring what's needed

Do you see anything weird or unexpected in my results? Do you need me to dig deeper on other aspects? Please leave a comment below or ping me on twitter, and let's sort it all out.

There is a follow-up post available: Cold Starts Beyond First Request in Azure Functions

Awesome F# Exchange 2018

I'm writing this post in the train to London Stensted, on my way back from F# Exchange 2018 conference.

F# Exchange is a yearly conference taking place in London, and 2018 edition was the first one for me personally. I also had an honour to speak there about creating Azure Functions with F#.

Impression

F# is still relatively niche language, so the conference is not overcrowded, but that gives it a special feeling of family gathering. There were 162 participants this year, and I have an impression that every one of them is extremely friendly, enthusiastic and just plain awesome.

The conference itself had 2 tracks of 45-minute talks and 60-minute keynotes. Most talks were of high quality, and the topics ranging from compiler internals to fun applications like music generation, car racing and map drawing.

Both Don Syme, the creator of F#, and Philip Carter, F# program manager, were there and gave keynotes, but they were careful enough not to draw too much attention on Microsoft and let the community speak loud.

Corridor Track

But the talks were just a part of the story. For me, the conference started in the evening before the first day at the speakers drinks party, and only finished at 1 a.m. after the second day (the pubs in London are lovely).

I spoke to so many great people, I learnt a lot, and had fun too. I've never seen so many F# folks at the same place, and I guess there must be something about F# which attracts the right kind of people to it.

And of course it's so much fun to meet face-to-face all those twitter, slack, github and Channel 9 persona's and to see that they are actually real people :)

My Talk

The talk I gave was called "Azure F#unctions". It was not a hard-core F# talk, but people seemed to be genuinely interested in the topic.

A decent amount of attendees are already familiar with Azure Functions, and many either run them in production or plan to do so.

The reference version conflict problem is very well known and raises a lots of questions or concerns. This even leads to workarounds like transpiling F# Functions to Javascript with Fable. Yikes.

Durable Functions seem to be sparkling a lot of initial interest. I'll be definitely spending more time to play with them, and maybe to make F# story more smooth.

Functions were mentioned in Philip's keynote as one of the important areas for F# application, which is cool. We should spend some extra effort to make the documentation and onboarding story as smooth as possible.

Call to Action

Skills Matter is the company behind the conference. Carla, Nicole and others did a great job preparing the event; everything went smooth, informal and fun.

The videos are already online at Skillscasts (requires free signup).

F# Exchange 2019 super early bird tickets are for sale now and until Monday April 9, go get one and join F# Exchange in London next year!

I'm already missing you all.

Azure Durable Functions in F#

Azure Functions are designed for stateless, fast-to-execute, simple actions. Typically, they are triggered by an HTTP call or a queue message, then they read something from the storage or database and return the result to the caller or send it to another queue. All within several seconds at most.

However, there exists a preview of Durable Functions, an extension that lets you write stateful functions for long-running workflows. Here is a picture of one possible workflow from the docs:

Fan-out Fan-in Workflow

Such workflows might take arbitrary time to complete. Instead of blocking and waiting for all that period, Durable Functions use the combination of Storage Queues and Tables to do all the work asynchronously.

The code still feels like one continuous thing because it's programmed as a single orchestrator function. So, it's easier for a human to reason about the functionality without the complexities of low-level communication.

I won't describe Durable Functions any further, just go read documentation, it's nice and clean.

Language Support

As of February 2018, Durable Functions are still in preview. That also means that language support is limited:

Currently C# is the only supported language for Durable Functions. This includes orchestrator functions and activity functions. In the future, we will add support for all languages that Azure Functions supports.

I was a bit disappointed that F# is not an option. But actually, since Durable Functions support precompiled .NET assembly model, pretty much anything doable in C# can be done in F# too.

The goal of this post is to show that you can write Durable Functions in F#. I used precompiled .NET Standard 2.0 F# Function App running on 2.0 preview runtime.

Orchestration Functions

The stateful workflows are Azure Functions with a special OrchestrationTrigger. Since they are asynchronous, C# code is always based on Task and async-await. Here is a simple example of orchestrator in C#:

public static async Task<List<string>> Run([OrchestrationTrigger] DurableOrchestrationContext context)
{
    var outputs = new List<string>();

    outputs.Add(await context.CallActivityAsync<string>("E1_SayHello", "Tokyo"));
    outputs.Add(await context.CallActivityAsync<string>("E1_SayHello", "Seattle"));
    outputs.Add(await context.CallActivityAsync<string>("E1_SayHello", "London"));

    // returns ["Hello Tokyo!", "Hello Seattle!", "Hello London!"]
    return outputs;
}

F# has its own preferred way of doing asynchronous code based on async computation expression. The direct refactoring could look something like

let Run([<OrchestrationTrigger>] context: DurableOrchestrationContext) = async {
  let! hello1 = context.CallActivityAsync<string>("E1_SayHello", "Tokyo")   |> Async.AwaitTask
  let! hello2 = context.CallActivityAsync<string>("E1_SayHello", "Seattle") |> Async.AwaitTask
  let! hello3 = context.CallActivityAsync<string>("E1_SayHello", "London")  |> Async.AwaitTask
  return [hello1; hello2; hello3]
} |> Async.StartAsTask   

That would work for a normal HTTP trigger, but it blows up for the Orchestrator trigger because multi-threading operations are not allowed:

Orchestrator code must never initiate any async operation except by using the DurableOrchestrationContext API. The Durable Task Framework executes orchestrator code on a single thread and cannot interact with any other threads that could be scheduled by other async APIs.

To solve this issue, we need to keep working with Task directly. This is not very handy with standard F# libraries. So, I pulled an extra NuGet package TaskBuilder.fs which provides a task computation expression.

The above function now looks very simple:

let Run([<OrchestrationTrigger>] context: DurableOrchestrationContext) = task {
  let! hello1 = context.CallActivityAsync<string>("E1_SayHello", "Tokyo")
  let! hello2 = context.CallActivityAsync<string>("E1_SayHello", "Seattle")
  let! hello3 = context.CallActivityAsync<string>("E1_SayHello", "London")
  return [hello1; hello2; hello3]
}       

And the best part is that it works just fine.

SayHello function is Activity trigger based, and no special effort is required to implement it in F#:

[<FunctionName("E1_SayHello")>]
let SayHello([<ActivityTrigger>] name) =
  sprintf "Hello %s!" name

More Examples

Durable Functions repository comes with a set of 4 samples implemented in C#. I took all of those samples and ported them over to F#.

You've already seen the first Hello Sequence sample above: the orchestrator calls the activity function 3 times and combines the results. As simple as it looks, the function will actually run 3 times for each execution, saving state before each subsequent call.

The second Backup Site Content sample is using this persistence mechanism to run a potentially slow workflow of copying all files from a given directory to a backup location. It shows how multiple activities can be executed in parallel:

let tasks = Array.map (fun f -> backupContext.CallActivityAsync<int64>("E2_CopyFileToBlob", f)) files
let! results = Task.WhenAll tasks

The third Counter example demos a potentially infinite actor-like workflow, where state can exist and evolve for indefinite period of time. The key API calls are based on OrchestrationContext:

let counterState = counterContext.GetInput<int>()
let! command = counterContext.WaitForExternalEvent<string>("operation")

The final elaborate Phone Verification workflow has several twists, like output binding for activity (ICollector is required instead of C#'s out parameter), third-party integration (Twilio to send SMSs), recursive sub-function to loop through several attempts and context-based timers for reliable timeout implementation.

So, if you happen to be an F# fan, you can still give Durable Functions a try. Be sure to leave your feedback, so that the library could get even better before going GA.

Load Testing Azure SQL Database by Copying Traffic from Production SQL Server

Azure SQL Database is a managed service that provides low-maintenance SQL Server instances in the cloud. You don't have to run and update VMs, or even take backups and setup failover clusters. Microsoft will do administration for you, you just pay an hourly fee.

So, let's say you decide this value proposition is a good reason to migrate away from your existing self-hosted SQL Server database running in production and replace it with Azure SQL Database.

You do the functional testing and eventually everything works like charm. The next set of questions is going to be related to Database performance level:

  • Which tier / how many DTU's should I provision?
  • How much will it cost?
  • Will it be able to handle my current production load?

DTUs

Even if you collect all the specs of the hardware behind your existing SQL Server, you can't directly use that knowledge to choose the right Azure SQL Database size.

The sizes are measured in Database Transaction Units (DTUs). These are abstract units of measure which don't necessarily mean much on their own. Within a given tier (Standard / Premium), doubling the DTU amount will double the max throughput.

That doesn't really help to plan for workload migrations.

There are some ways to estimate the DTU requirements by measuring metrics like CPU and IOPS on your existing server. Have a look at DTU Calculator: it consists of a data collector and an online converter from metric values to DTUs.

While useful as a first approximation, I'm reluctant to provision Azure SQL Database size solely based on such estimates.

My answer to the problem is: Measure It!

Synthetic Tests

Go get a backup of your existing production database and Export / Import it into Azure SQL Database. Pick the size based on your gut feel, run a load test, evaluate the results, adjust the size, repeat.

If you know your workload really well, you can create a synthetic test:

  • Create a script or scenario which resembles the real production load
  • Run it for a given period of time
  • Measure the DTU's consumed

Unfortunately, I'm yet to see a non-trivial database where I could manually create such script and be reasonably sure that it reflects the reality. Most of the time the load is consumer-driven, changes over time and heavily depends on exact query parameter values.

Which brings me to the need of replaying the actual production workload on Azure SQL Database.

Trace and Replay

SQL Server comes with a marvelous suite of tools refined over years of its existence. It includes the tools to capture and replay the queries, so I started with those.

SQL Server Profiler has a trace template called TSQL_Replay:

This template records information required to replay the trace. Use this template to perform iterative turning, such as benchmark testing.

This sounded like what I needed, so I ran the profiler with this template to save a short trace.

Afterwards, it is possible to use the same SQL Server Profiler to replay the trace against another target database. So the process looks like this:

Replaying Traffic with SQL Server Profiler

Unfortunately, this didn't go very well:

  • Azure SQL Database is not supported by the tooling. The replay kind of runs, but it throws lots of errors like reading from non-existent system tables, trying to switch between databases and so on

  • Related or not to the previous item, but replay went terribly slow. It seemed to slow down exponentially over time

  • The trace file itself was of huge size. Because the template tries to record pretty much everything, tracing 5 minutes on production produced 10 GB of XML

  • Replay was not real-time: you first record, then you replay. This might not be a big issue for many databases, but some of our queries have time parameter, and results would change if I replay the trace 1 hour later

Just to give you a rough idea, our production database-under-study is handling about 1000 RPC calls per second (mostly stored procedures).

Custom Trace & Replay

Since off-the-shelf solution didn't work for me, I decided to come up with my own custom tool chain. Here is the idea:

Replaying Traffic with SQL Server Profiler

There are two custom steps that I implemented:

  1. Run a console app which would host a custom trace server. The trace server receives SQL commands and sends them to Azure Event Hubs in batches

  2. Create an Azure Function application triggered by the Event Hub. Each function call gets one SQL command to execute and runs it against Azure SQL database that we are trying to load-test

This setup worked remarkably well for me: I got the real-time replay of SQL commands from production SQL Server to Azure SQL Database.

The rest of the article describes my setup so that you could reproduce it for your workload.

Azure SQL Database

Ideally, you want your copy of the database to be as fresh as possible, so that the query plans and results match.

Some ideas to accomplish this are given in SQL Server database migration to SQL Database in the cloud.

Premium RS tier is great for testing, because it is much cheaper than Premium tier, while it provides the same level of performance.

Event Hubs

I used Azure Event Hubs as messaging middleware between Trace Server and Replay Function App.

I started with Azure Storage Queues, but the server wasn't able to send messages fast enough, mostly due to lack of batching.

Event Hubs match naturally my choice of Azure Functions: Functions have a built-in trigger with dynamic scaling out of the box.

So, I just created a new Event Hub via the portal, with 32 partitions allocated.

Trace Definition File

In order to run a custom Trace Server, you still need a trace definition file. The built-in template TSQL_Replay mentioned above could work, but it's subscribed to way too many events and columns.

Instead, I produced my own trace template with minimal selection. To do that, open SQL Server Profiler, then navigate to File -> Templates -> New Template, give it a name and then on Events Selection tab exclude everything except exactly the commands that you want to replay.

We use stored procedures for pretty much everything, so my selection looked just like this:

SQL Profiler Template

For the first few runs, I advise you to restrict the trace even further. Click Column Filters button, select TextData there and set Like filter to a single stored procedure, e.g. matching the pattern %spProductList%.

This way you can debug your whole replay chain without immediately overloading any part of it with huge stream of commands.

Once done, save the tdf file to disk. An example of such trace definition file can be found in my github.

Trace Server

My trace server is a simple C# console application.

Create a new console app and reference a NuGet package Microsoft.SqlServer.SqlManagementObjects. Mine is of version 140.17218.0 (latest as of today).

Unfortunately, this NuGet package is not fully self-contained. In order to run a profiling session, you have to install SQL Server Profiler tool on the machine where you want to run the trace server.

Chances are that you already have it there, but be sure to update to the matching version: mine works with 17.4 / 14.0.17213.0 but refused to work with older versions.

Now we can implement our trace server as a console application. The main method looks like this:

static void Main(string[] args) // args: <db server name> <db name> <trace file>
{
    // 1. Run trace server
    var connectionInfo = new SqlConnectionInfo(args[0])
    {
        DatabaseName = args[1],
        UseIntegratedSecurity = true
    };
    var trace = new TraceServer();
    trace.InitializeAsReader(connectionInfo, args[2]);

    // 2. Continuously read traces and send them to event hubs
    var tokenSource = new CancellationTokenSource();
    var readerTask = Task.Factory.StartNew(() => ReadTrace(trace, tokenSource.Token), tokenSource.Token);
    var senderTask = Task.Factory.StartNew(() => SendToEventHubs(tokenSource.Token), tokenSource.Token);

    // 3. Stop the trace
    Console.WriteLine("Press any key to stop...");
    Console.ReadKey();
    tokenSource.Cancel();
    Task.WaitAll(readerTask, senderTask);
}

The first block initializes SQL connection using command line arguments and integrated security, and then starts the Trace Server.

Because of the large volume, I made trace reader and event sender to work on separate threads. They talk to each other via a concurrent queue:

private static readonly ConcurrentQueue<string> eventQueue = new ConcurrentQueue<string>();

Finally, when operator presses any key, the cancellation is requested and the reader and sender get shut down.

Trace Reader task is a loop crunching though trace data and sending the SQL statements (with some exclusions) to the concurrent in-memory queue:

private static void ReadTrace(TraceServer trace, CancellationToken token)
{ 
    while (trace.Read() && !token.IsCancellationRequested)
    {
        var eventClass = trace["EventClass"].ToString();
        if (string.Compare(eventClass, "RPC:Completed") == 0)
        {
            var textData = trace["TextData"].ToString();
            if (!textData.Contains("sp_reset_connection")
                && !textData.Contains("sp_trace")
                && !textData.Contains("sqlagent"))
            {
                eventQueue.Enqueue(textData);
            }
        }
    }

    trace.Stop();
    trace.Close();
}

Event Sender is dequeueing SQL commands from in-memory queue to collect batches of events. As soon as a batch fills up, it gets dispatched to Event Hub:

private static void SendToEventHubs(CancellationToken token)
{
    var client = EventHubClient.CreateFromConnectionString(EventHubsConnectionString);
    var batch = client.CreateBatch();
    while (!token.IsCancellationRequested)
    {
        if (!eventQueue.TryDequeue(out string sql))
        {
            Thread.Sleep(10);
            continue;
        }

        var eventData = new EventData(Encoding.UTF8.GetBytes(sql));
        if (!batch.TryAdd(eventData) && batch.Count > 0)
        {
            client.SendAsync(batch.ToEnumerable())
                .ContinueWith(OnAsyncMethodFailed, token, TaskContinuationOptions.OnlyOnFaulted, TaskScheduler.Default);
            batch = client.CreateBatch();
            batch.TryAdd(eventData);
        }
    }
}

If your trace doesn't produce so many messages, you will probably want to periodically send out the batches even before they get full, just to keep that process closer to real time.

Note that sender does not await SendAsync call. Instead, we only subscribe to failures via OnAsyncMethodFailed callback to print it to console:

private static void OnMyAsyncMethodFailed(Task task)
{
    Console.WriteLine(task.Exception?.ToString() ?? "null error");
}

And that concludes the implementation of the Trace Server. SQL commands now go to Event Hub, to be picked up by Trace Replay.

Trace Replay Function App

To replay those traces against the target Azure SQL Database, I could make another console application which would contain EventProcessorHost to receive and process SQL commands.

However, under high load a single machine might not be able to keep up with executing all those commands in real time.

Instead, I decided to distribute such Replay App over multiple machines. To deploy a DDoS network, if you will :)

And I don't have to build, find, configure and synchronize all those servers myself, since we are living in the world of serverless.

Azure Functions are the perfect tool for this job. Once you start the trace server, Function App will start scaling up based on the amount of events in Event Hub, and will expand until it catches up with the workload.

But as long as you don't run the trace server, it won't consume any servers and won't cost you a dime.

Here is the implementation of Trace Replay Azure Function:

public static class Replay
{
    [FunctionName("Replay")]
    public static void Run(
        [EventHubTrigger("sqltrace", Connection = "EventHubsConn")] string sql,
        TraceWriter log)
    {
        var commandName = sql
            .Split(null)
            .SkipWhile(r => r != "exec" && r != "sp_executesql")
            .FirstOrDefault(r => !r.Contains("exec")) ?? "<empty>";

        var stopwatch = new Stopwatch();
        stopwatch.Start();

        try
        {
            using (var sqlConnection = new SqlConnection(AzureSqlConnectionString))
            using (var cmd = new SqlCommand())
            {
                sqlConnection.Open();

                cmd.CommandText = sql;
                cmd.CommandType = CommandType.Text;

                cmd.Connection = sqlConnection;

                int count = 0;
                using (var reader = cmd.ExecuteReader())
                {
                    while (reader.Read())
                    {
                        count++;
                    }
                }

                log.Info($"Processed {commandName} in {stopwatch.ElapsedMilliseconds} ms with {count} rows");
            }
        }
        catch (Exception ex)
        {
            log.Error($"Error in {commandName} in {stopwatch.ElapsedMilliseconds} {ex.Message}");
            throw;
        }
    }
}

It's super simple: the function gets a SQL statement, executes it with SqlCommand class and logs the result with timing and returned row count. And that's everything required to start bombarding my Azure SQL Database.

Evaluating Results

The purpose of this whole exercise was to evaluate whether a provisioned DTU level is enough to stand the load comparable to existing production.

So, after I ran the test, I could browse through the DTU usage chart in Azure portal to get overall usage statistics.

I've also spent quite some time analyzing the usage breakdown as reported by sp_BlitzCache from Responder Kit. Please note that it's not officially supported for Azure SQL Database, but it seems to work reasonably well.

Be sure to re-run your experiments multiple times, at different days and time intervals.

The full code sample can be found in my github.

I hope Azure SQL Database will perform to your expectations and within your budget. But hope is not a good strategy, so go ahead and try it out!

Happy DDoS-ing!

Mikhail Shilkov I'm Mikhail Shilkov, a software developer. I enjoy F#, C#, Javascript and SQL development, reasoning about distributed systems, data processing pipelines, cloud and web apps. I blog about my experience on this website.

LinkedIn@mikhailshilkovGitHubStack Overflow