Around IT in 256 seconds

#48: Distributed tracing: find bottlenecks in complex systems

September 07, 2021 | 3 Minute Read

Life used to be simple. In a traditional monolithic application, when a failure occurred, you could easily find the problem. When an exception bubbles up, it appears throughout all stack frames. You can easily examine which methods or functions were invoked from each other. You can see application layers involved. Moreover, it’s fairly easy to profile performance bottlenecks. Answering these questions becomes much harder when there are multiple systems involved.

In a distributed system, a failure in one component may be caused by problems several services away. For example, your Python backend returned HTTP 503 to the browser. What happened? If all you had was this Python application, looking through the logs should be enough. It’s all visible on the stack trace. Moreover, you can see which functions led to the invocation of a broken code. No wonder in Python it’s often called traceback. You can trace back what happened!

But what if this Python backend is just a gateway, orchestrating tens of other services? You go to your logs. You see that an error was caused by another error when calling, let’s say, user-service. OK… So you go to the logs of a user-service, whatever it is. Indeed, it was calling token-service and it timed-out waiting for a response. Looking through the logs of a token-service shows, sadly… nothing. I mean, there are thousands of logs per second and no errors. Seriously. You carefully correlate logs by time. You ask yourself, why, on earth, token-service uses Australian time zone? Nevermind.

Just by pure luck you realize that token-service makes a call to token-verifier, OAuth authentication endpoint and some SQL database. The SQL query is apparently slow. token-service itself returns successfully. However, too late. user-service no longer waits for its response, timing out. That leads to an error in Python backend and broken frontend. As you can see, this process is insanely tedious and relies on luck and intuition.

Now, imagine every request, starting from the Python all the way down to the database, had some unique identifier. Let’s call that a span ID. All requests within one transaction also share a single trace ID. These two IDs are forwarded inside every HTTP request/response. In fact, it can work for other technologies, like message brokers. So, what’s the point? First of all, assume you collect all application logs in one central place like Kibana. Knowing a trace ID where an error occurred you can find all logs from all services that contributed to that error. As if all services were invoked within one unique thread.

But it gets better. Each trace consists of multiple spans. Span can have a parent span, start and end timestamps. Therefore, you can reconstruct the hierarchy of callers easily. Even better, you can see which spans run concurrently, and which ones are exceptionally slow. In our example it would’ve been obvious what is the root cause by simply visualising the trace and spans. Because spans are collected from multiple machines, it’s called distributed tracing.

Keep in mind that collecting and aggregating traces in a busy system is no easy task. So, often only a fraction of traces are stored. Moreover, specialized software and databases like Zipkin and Jaeger is used to quickly search through data.

That’s it, thanks for listening, bye!

More materials

Tags: jaeger, kibana, microservices, opentelemetry, opentracing, span, trace, zipkin

Be the first to listen to new episodes!

To get exclusive content: