Finding Lost Events in Azure Application Insights

One of the ways we use Azure Application Insights is tracking custom application-specific events. For instance, every time a data point from an IoT device comes in, we log an AppInsights event. Then we are able to aggregate the data and plot charts to derive trends and detect possible anomalies.

And recently we found such anomaly, which looked like this:

Amount of Events on Dashboard Chart

This is a chart from our Azure dashboard, which shows the total amount of events of specific type received per day.

The first two "hills" are two weeks, so we can clearly see that we get more events on business days compared to weekends.

But then something happened on May 20: we started getting much less events, and the hill pattern disappeared, days looks much more alike.

We haven't noticed any other problems in the system, but the trend looked quite bothering. Are we loosing data?

I headed towards Analytics console of Application Insights to dig deeper. Here is the query that reproduces the problem:

customEvents
| where name == "EventXReceived"
| where timestamp >= ago(22d)
| project PointCount = todouble(customMeasurements["EventXReceived_Count"]), timestamp
| summarize EventXReceived = sum(PointCount) by bin(timestamp, 1d)
| render timechart

and I got the same chart as before:

Trend on Application Insights Analytics

I checked the history of our source code repository and deployments and I figured out that we upgraded the version of Application Insights SDK from version 2.1 to version 2.3.

My guess at this point was that Application Insights started sampling our data instead of sending all events to the server. After reading Sampling in Application Insights article, I came up with the following query to see the sampling rate:

customEvents
| where name == "EventXReceived"
| where timestamp >= ago(22d)
| summarize 100/avg(itemCount) by bin(timestamp, 1d) 
| render areachart

and the result is self-explanatory:

Sampling Rate

Clearly, the sampling rate dropped from 100% down to about 30% right when the anomaly started. The sampling-adjusted query (note itemCount multiplication)

customEvents
| where name == "EventXReceived"
| where timestamp >= ago(22d)
| project PointCount = todouble(customMeasurements["EventXReceived_Count"]) * itemCount, timestamp
| summarize EventXReceived = sum(PointCount) by bin(timestamp, 1d)
| render timechart

puts us back to the point when results make sense:

Adjusted Trend on Application Insights Analytics

The third week's Thursday was bank holiday in several European countries, so we got a drop there.

Should Azure dashboard items take sampling into account - to avoid confusing people and to show more useful charts?

Mikhail.io Upgraded to HTTPS and HTTP/2

Starting today, this blog has switched to HTTPS secure protocol:

HTTPS

While there's not that much to secure on my blog, HTTPS is still considered to be a good practice for any site in 2017. One of the benefits that we can get from it is the usage of HTTP/2 protocol:

HTTP/2

This should be beneficial to any reader which uses a modern browser!

Thanks to CloudFlare for providing me with free HTTPS and HTTP/2 support.

Reliable Consumer of Azure Event Hubs

Azure Event Hubs is a log-based messaging system-as-a-service in Azure cloud. It's designed to be able to handle huge amount of data, and naturally supports multiple consumers.

Event Hubs and Service Bus

While Event Hubs are formally part of Azure Service Bus family of products, in fact its model is quite different.

"Traditional" Service Bus service is organized around queues (subscriptions are just queues with the topic being the source of messages). Each consumer can peek messages from the queue, do the required processing and then complete the message to remove it from the queue, or abort the processing. Abortion will leave the message at the queue, or will move it to the Dead Letter Queue. Completion/abortion are granular per message; and the status of each message is managed by the Service Bus broker.

Service Bus Processors

Event Hubs service is different. Each Hub represnts a log of messages. Event producer appends data to the end of the log, and consumers can read this log, but they can't remove or change the status of events there.

Each event has an offset associated with it. And the only operation that is supported for consumers is "give me some messages starting at the offset X".

Event Hub Processors

While this approach might seem simplistic, it actually makes consumers more powerful:

  • The messages do not disappear from the Hub after being processed for the first time. So, if needed, the consumer can go back and re-process older events again;

  • Multiple consumers are always supported, out of the box. They just read the same log, after all;

  • Each consumer can go at its own pace, drop and resume processing whenever needed, with no effect on other consumers.

There are some disadvantages too:

  • Consumers have to manage their own state of the processing progress, i.e. they have to save the offset of the last processed event;

  • There is no way to mark any specific event as failed to be able to reprocess it later. There's no notion of Dead Letter Queue either.

Event Processor Host

To overcome the first complication, Microsoft provides the consumer API called EventProcessorHost. This API has an implementation of consumers based on checkpointing. All you need to do is to provide a callback to process a batch of events, and then call CheckpointAsync method, which saves the current offset of the last message into Azure Blob Storage. If the consumer restarts at any point in time, it will read the last checkpoint to find the current offset, and will then continue processing from that point on.

It works great for some scenarios, but the event delivery/processing guarantees are relatively low in this case:

  • Any failures are ignored: there's no retry or Dead Letter Queue

  • There are no transactions between event hub checkpoints and the data sinks that the processor works with (i.e. data stores where processed messages end up at)

In this post I want to focus on a way to process events with higher consistency requirements, in particular:

  • Event Hub processor modifies data in a SQL Database, and such processing is transactional per batch of messages

  • Each event should be (successfully) processed exactly once

  • If event processing failed, it should be marked as failed and kept available to be reprocessed at later point in time

While end-to-end exactly-once processing would require changes of the producers too, we will only focus on consumer side in this post.

Transactional Checkpoints in SQL

If checkpoint information is stored in Azure Blobs, there is no obvious way to implement distributed transactions between SQL Database and Azure Storage.

However, we can override the default checkpointing mechanism and implement our own checkpoints based on a SQL table. This way each checkpoint update can become part of a SQL transaction and be committed or rolled back with normal guarantees provided by SQL Server.

Here is a table that I created to hold my checkpoints:

CREATE TABLE EventHubCheckpoint (
  Topic varchar(100) NOT NULL,
  PartitionID varchar(100) NOT NULL,
  SequenceNumber bigint NOT NULL,
  Offset varchar(20) NOT NULL,
  CONSTRAINT PK_EventHubCheckpoint PRIMARY KEY CLUSTERED (Topic, PartitionID)
)

For each topic and partition of Event Hubs, we store two values: sequence number and offset, which together uniquely identify the consumer position.

Conveniently, Event Host Processor provides an extensibility point to override the default checkpoint manager with a custom one. For that we need to implement ICheckpointManager interface to work with our SQL table.

The implementation mainly consists of 3 methods: CreateCheckpointIfNotExistsAsync, GetCheckpointAsync and UpdateCheckpointAsync. The names are pretty much self-explanatory, and my Dapper-based implementation is quite trivial. You can find the code here.

For now, I'm ignoring the related topic of lease management and corresponding interface ILeaseManager. It's quite a subject on its own; for the sake of simplicity I'll assume we have just one consumer process per partition, which makes proper lease manager redundand.

Dead Letter Queue

Now, we want to be able to mark some messages as failed and to re-process them later. To make Dead Letters transactional, we need another SQL table to hold the failed events:

CREATE TABLE EventHubDeadLetter (
  Topic varchar(100) NOT NULL,
  PartitionID varchar(100) NOT NULL,
  SequenceNumber bigint NOT NULL,
  Offset varchar(20) NOT NULL,
  FailedAt datetime NOT NULL,
  Error nvarchar(max) NOT NULL,
  CONSTRAINT PK_EventHubDeadLetter PRIMARY KEY CLUSTERED (Topic, PartitionID)
)

This table looks very similar to EventHubCheckpoint that I defined above. That is because they are effectively storing pointers to events in a hub. Dead Letters have two additional columns to store error timestamp and text.

There is no need to store the message content, because failed events still sit in the event hub anyway. You could still log it for diagnostics purpose - just make an extra varbinary column.

There's no notion of dead letters in Event Hubs SDK, so I defined my own interface IDeadLetterManager with a single AddFailedEvents method:

public interface IDeadLetterManager
{
    Task AddFailedEvents(IEnumerable<DeadLetter<EventData>> deadLetters);
}

public class DeadLetter<T>
{
    public T Data { get; set; }
    public DateTime FailureTime { get; set; }
    public Exception Exception { get; set; }
}

Dapper-based implementation is trivial again, you can find the code here.

Putting It Together: Event Host

My final solution is still using EventHostProcessor. I pass SQLCheckpointManager into its constructor, and then I implement IEventProcessor's ProcessEventsAsync method in the following way:

  1. Instantiate a list of items to store failed events
  2. Start a SQL transaction
  3. Loop through all the received events in the batch
  4. Process each item inside a try-catch block
  5. If exception happens, add the current event to the list of failed events
  6. After all items are processed, save failed events to Dead Letter table
  7. Update the checkpoint pointer
  8. Commit the transaction

The code block that illustrates this workflow:

public async Task ProcessEventsAsync(
    PartitionContext context, 
    IEnumerable<EventData> eventDatas)
{
    // 1. Instantiate a list of items to store failed events
    var failedItems = new List<DeadLetter<EventData>>();

    // 2. Start a SQL transaction
    using (var scope = new TransactionScope())
    {
        // 3. Loop through all the received events in the batch
        foreach (var eventData in eventDatas)
        {
            try
            {
                // 4. Process each item inside a try-catch block
                var item = this.Deserialize(eventData);
                await this.DoWork(item);
            }
            catch (Exception ex)
            {
                // 5. Add a failed event to the list
                failedItems.Add(new DeadLetter<EventData>(eventData, DateTime.UtcNow, ex));
            }
        }

        if (failedItems.Any())
        {
            // 6. Save failed items to Dead Letter table
            await this.dlq.AddFailedEvents(failedItems);
        }

        // 7. Update the checkpoint pointer
        await context.CheckpointAsync();

        // 8. Commit the transaction
        scope.Complete();
    }
}

Conclusion

My implementation of Event Hubs consumer consists of 3 parts: checkpoint manager that saves processing progress per partition into a SQL table; dead letter manager that persists information about processing errors; and an event host which uses both to provide transactional processing of events.

The transaction scope is limited to SQL Server databases, but it might be sufficient for many real world scenarios.

Why F# and Functional Programming Talk at .NET Development Nederland Meetup

On May 8th 2017 I gave a talk at the .NET Development Nederland group in Amsterdam.

Here are the slides for the people who were there and want to revisit the covered topics.

Why Learn F# and Functional Programming

Link to full-screen HTML slides: Why Learn F# and Functional Programming

Slides on SlideShare:

Useful links:

F# for Fun and Profit

F# Foundation

Real-World Functional Programming, With examples in F# and C# Book

Thanks for attending my talk! Feel free to post any feedback in the comments.

Visualizing Dependency Tree from DI Container

So you are a C# developer. And you need to read the code and understand its structure. Maybe you've just joined the project, or it's your own code you wrote 1 year ago. In any case, reading code is hard.

Luckily, some good thought was applied to this particular piece of code. It's all broken down into small classes (they might even be SOLID!), and all the dependencies are injected via constructors. It looks like it's your code indeed.

So, you figured out that the entry point for your current use case is the class called ThingService. It's probably doing something with Thing's and that's what you need. The signature of the class constructor looks like this:

public ThingService(
    IGetThings readRepository,
    ISaveThing saveRepository,
    IParseAndValidateExcel<Thing, string> fileParser,
    IThingChangeDetector thingChangeDetector,
    IMap<Thing, ThingDTO> thingToDtoMapper,
    IMap<int, ThingDTO, Thing> dtoToThingMapper)

OK, so we clearly have 6 dependencies here, and they are all interfaces. We don't know where those interfaces are implemented, but hey - we've got the best tooling in the industry, so right click on IGetThings, then Go To Implementation.

public DapperThingRepository(
    ICRUDAdapter adapter,
    IDatabaseConnectionFactory connectionFactory,
    IMap<Thing, ThingRow> thingRowMapper,
    IMap<ThingRow, Thing> thingMapper)

Now we know that we get Thing from Dapper, so probably from a SQL database. Let's go one level deeper and check where those Mappers are implemented. Right click, Go To Implementation... But instead of navigating to another code file you see

Find Symbol Result - 28 matches found

Oh, right, looks like we use IMap<T, U> in more places. OK, we'll find the right one later, let's first check the connection factory... Right click, Go To Implementation. Nah:

The symbol has no implementation

What? But the application works! Ah, IDatabaseConnectionFactory comes from an internal library, so most probably the implementation is also inside that library.

Clearly, navigation doesn't go that well so far.

Dependency Graph

When code reading gets tricky, usually an image can boost the understanding. The picture below actually shows the graph of class dependencies from our example:

Class Dependency Graph

Each node is a class, each arrow is a dependency - an interface injected into the constructor.

Just by looking at the picture for a minute of two you can start seeing some structure, and get at least the high-level opinion about the application complexity and class relationships.

Picture is also a great way of communication. Once you understand the structure, you can explain it to a colleague much easier with boxes and lines on the screen in addition to a plain wall of code.

You can enrich such picture with comments at the time of writing and leave it to your future self or anyone who would read the code in 2 years time.

But now the question is - what's the easiest way to draw such dependency graph?

DI Container

The assumption of this post is that a dependency injection (DI) container of some kind is used in the project. If so, chances are that you can get such dependency graph from the container registrations.

My example is based on Simple Injector DI container which is used by ourselves. So, further on I will explain how to draw a dependency graph from Simple Injector container.

My guess is that any mature DI library will provide you with such possibility, mostly because the dependency graphs are built internally by any container during its normal operations.

Implementation

The implementation idea of dependency graph visualization is quite simple, as the biggest chunk of work is done by Simple Injector itself. Here are the steps:

  1. Run all your DI registrations as you do in the actual application. This will initialize Container to the desired state.

  2. Define which class should be the root of the dependency tree under study. You can refine later, but you need to start somewhere.

  3. Call GetRegistration method of DI container for the selected type. An instance of InstanceProducer type is returned.

  4. Call GetRelationships method of the instance producer to retrieve all interface/class pairs that the given type depends on. Save each relation into your output list.

  5. Navigate through each dependency recursively to load further layers of the graph. Basically, do the depth-first search and save all found relations.

  6. Convert the list of found relations into GraphViz textual graph description.

  7. Use a tool like WebGraphviz do the actual visualization by converting text to picture.

There are several potential pitfalls on the way, like cyclic graphs, decorator registrations etc. To help you avoid those I've created a small library to automate steps 3 to 6 from the list above. See my SimpleInjector.Visualization github repo and let me know if you find it useful.

Conclusion

People are good at making sense of visual representations - use that skill to improve understanding and communication within your development team.

Dependency injection practice requires a lot of ceremony to set it up and running. Leverage this work for the best: check what kind of insights you can get from that setup. Dependency graph visualization is one example of such leverage, but there might be other gems in there.

Just keep searching!

Mikhail Shilkov I'm Mikhail Shilkov, a software developer. I enjoy F#, C#, Javascript and SQL development, reasoning about distributed systems, data processing pipelines, cloud and web apps. I blog about my experience on this website.

LinkedIn@mikhailshilkovGitHubStack Overflow