Thursday, September 14, 2017

Distributed Telemetry at Scale

image

In Designing Azure Metadata Service I elaborated on how we run Azure Instance Metadata Service (IMDS) at massive scale. Running at this scale in 36 regions (at the time of writing) of the world, on incredible number of machines is a hard problem to solve in terms of monitoring and collecting telemetry. Unlike other centralized services it is not as simple as connecting it to a single telemetry pipeline and get done with it.

We need to ensure that

  1. We do not collect too much data (cost/latency)
  2. We do not collect too less (hard to debug issues)
  3. Data collection is fast
  4. We are able to drill down into specific issues and areas of problem
  5. Do all of the above when running in 36 regions of the world
  6. Continue to do all of the above as Azure continues it’s phenomenal growth

To meet all the goals we take a three pronged approach. We break out telemetry to 3 paths

  1. Hot-path: Minimal numeric data that can be uploaded super fast (few second delayed) that we can use for monitoring our service and alert in case anomaly is detected
  2. Warm-path: More richer textual data that are few minute delayed and we can use this to drill down into issues remotely in case hot-path flagged an issue
  3. Cold-path: This gives us full fidelity data to monitor

image

Hot-Path

Even though we run on so many places we want to ensure that we have near real time alerting and monitoring and can quickly catch if something bad is happening. For that we use performance and functionality counters. These counters measure the type of response we are giving back, their latencies, data size etc. All of them are numeric and track each call in progress. We then have high speed uploaders in each machine with backends that can collect these. Then we attach these counters with alerts at per cluster level. We can catch latency issues, failures with few seconds delays. These counters only tell us if something is going bad and not why they are doing so. We have 10s of such numeric high speed telemetry coming from each IMDS instance.

Here’s a snapshot of one such counter in our dashboard showing latency at 90th percentile.

image

In addition we have external dial-tone services that keep pinging IMDS to ensure the services are up everywhere. If there is no response then likely there has been some crash or other deadlocks. We measure the dial-tone as part of our up-time and also have alerts tied to this.

image

Warm-Path

If hot-path counter driven alerts tell us something has gone wrong and an on-call engineer is awaken, the next steps of business is to quickly figure out what’s going on. For that we use our warm-path pipeline. This pipeline uploads informational and error level logging. Due to volume the data is delayed by few minutes. The query granularity can also slow down fetching them. So one of the focus of the hot-path counters is that it can narrow down the location of problem to cluster level/machine level.

The alert directly filters the logs being uploaded to a cluster/machine and brings up all logs. In most cases they are sufficient for us to detect issues. In case that doesn’t work we need to go into the detailed logs.

Cold-Path

Every line of logs (error/info/verbose) our service creates is stored locally on the machines with a certain retention policies. We have built tools so that given an alert an engineer can run a command from his dev box to fetch the log directly from that machine, wherever in the world the machine with the log exists. For hard to debug issues this is the last recourse.

However, more cooler is that we use our CosmosDB offering as a document store and store all error and info logs into that. This ensures the logs remain query-able for a long time (months) for reporting and analysis. We also run jobs that read the logs from these cosmos streams and then shove it into Kusto as structured data. Kusto is also available to users with the more fancier name of Azure Application Insights Analytics. I was floored with the insight we can get with this pipeline. We upload close to 8 terabytes of log data a day into cosmos and still able to query all data over months in a few seconds

Here’s a quick peek into seeing what kind of responses IMDS is handing out.

image

A look into the kinds of queries coming in.

image

Distribution of IMDS version being asked for.

image

We can extract patterns from the logs, run regex matching and all sorts of cool filters and at the same time be able to render data across our fleet in seconds.

No comments: