Monday, February 20, 2017

Akka with Zipkin - Async Framework with Distributed Tracing for Microservices

Microservices architectural style is an approach to develop a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, like HTTP or AMQP. These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a bare minimum of centralized management of these services which may be written in different programming languages and use different data storage technologies. 

A tracing infrastructure for distributed microservices needs to record information about all the work done in a system, on behalf of a given initiator.

This blog tries to explain how to do the same.

What is Akka Framework?

Akka is a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven microservices applications on the JVM.
It is asynchronous and distributed by design. It provides high-level abstractions like Actors, Streams and Futures.


Actors give you:

  • Simple and high-level abstractions for distribution, concurrency and parallelism.
  • Asynchronous, non-blocking and highly performant message-driven programming model.
  • Very lightweight event-driven processes (several million actors per GB of heap memory).

With the asynchronous design of this framework, it provides with challenges of logging. A request coming to the server can be handled by multiple asynchronous actors. These actors might not even reside in the same JVM. Then how do we track such requests and how do calculate or visualize the latency across the actors. This is the problem of Distributed Tracing.

This problem is solved by Zipkin.

What is Zipkin?

Zipkin is a distributed tracing system. It helps gather timing data needed to troubleshoot latency problems in microservice architectures. It manages both the collection and lookup of this data. 

Applications are instrumented to report timing data to Zipkin. The Zipkin UI also presents a Dependency diagram showing how many traced requests went through each application. If you are troubleshooting latency problems or errors, you can filter or sort all traces based on the application, length of trace, annotation, or timestamp. 

Zipkin is based on Google Dapper paper. Link for the same is provided below. Dapper talks about Google’s production distributed systems tracing infrastructure, and describe how our design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met.


Using Zipkin system with Akka framework

Akka based actor system can be integrated with Zipkin for distributed tracing using an open-source project called akka-tracing available on Github.

At the time of writing this blog, there were not many integrations examples available of integrating Akka in Java with Zipkin using akka-tracing libraries. Hence I went out to write this blog and share my code on Github for reference.


Refer to my Akka Java project with Zipkin implementation on Github here - https://github.com/tuhingupta/akka-sample-tracing-java.git

Look at Readme file on Github for instructions of installing and running Zipkin and my project.

Zipkin Terminology


SpanThe basic unit of work.  Span’s are identified by a unique 64-bit ID for the span and another 64-bit ID for the trace the span is a part of. Spans also have other data, such as descriptions, timestamped events, key-value annotations (tags), the ID of the span that caused them, and process ID’s (normally IP address).

Trace: A set of spans forming a tree-like structure.

Annotation: is used to record existence of an event in time. Some of the core annotations used to define the start and stop of a request are:
  • cs - Client Sent - The client has made a request. This annotation depicts the start of the span.
  • sr - Server Received - The server side got the request and will start processing it. If one subtracts the cs timestamp from this timestamp one will receive the network latency.
  • ss - Server Sent - Annotated upon completion of request processing (when the response got sent back to the client). If one subtracts the sr timestamp from this timestamp one will receive the time needed by the server side to process the request.
  • cr - Client Received - Signifies the end of the span. The client has successfully received the response from the server side. If one subtracts the cs timestamp from this timestamp one will receive the whole time needed by the client to receive the response from the server.


Use case implemented in my project


The project I developed is available on Github. This tries to replicate a normal microservice scenario using Akka, where a user request is handled by ActorA. This actor creates the request and sends to second actor ActorB. This actor does its processing, updates the request and forwards the request to ExternalCallActor.  ExternalCallActor is the actor making an HTTP API call to external system (hypothetically). Once the response is received, the response is sent back to ActorA.

Now, if the actors are running on the same instance or different instances or seperate JVMs, since the request/response model is asynchronous (inherent to Akka), it would be difficult to trace various requests and their latencies. 

This is where Zipkin comes into action.  This software aggregates timing data that can be used to track down latency issues. When a request comes in the front door, Zipkin, a Java-based application, traces it as it goes through the system. Each request gets a unique identifier, which is passed along with the request to each microservice. For Zipkin to work, each microservice is instrumented with Zipkin library that the service then uses identify the request’s entry and exit ports. Libraries are available for C#, Java, JavaScript, Python, Go, Scala and Ruby.





Zipkin UI with log tracing from my project

Zipkin comes with a Web interface that shows the amount of traffic each microservice instance is getting. The log data can be filtered by application, length of trace, annotation, or timestamp.





You can drill down to an individual request and see its span and other data:




Reference Sites:
Github project repo - https://github.com/tuhingupta/akka-sample-tracing-java.git
Akka - http://doc.akka.io/docs/akka/2.4/intro/what-is-akka.html
Zipkin - http://zipkin.io/
Zipkin wiki - https://github.com/openzipkin/zipkin/wiki
Dapper - https://research.google.com/pubs/pub36356.html
akka-tracing - https://github.com/levkhomich/akka-tracing