Breaking Observability Boundaries And Eliminating Data Silos

Observability represents the ability to measure how well the internal state of a system can be inferred solely by its external outputs. Observability and the challenges associated with it is by no means a new problem and have existed for decades.

Uber’s microservice architecture circa mid-2018 from Jaeger [https://www.uber.com/en-CH/blog/microservice-architecture/]

Observability is a necessary precursor to ensure the reliability of cloud-native microservice architectures. In cloud-native space, a single request traverses multiple components, including gateways, sidecars, microservices, and infrastructure components. Service dependencies and request paths have become highly complex. While traditional distributed tracing systems have attempted to address this issue, they still face several limitations, including invasive instrumentation, incomplete tracing, blind spots and the need to instrument a variety of frameworks, languages, and libraries.

Coping with the challenges posed by the heterogeneity and diversity of microservices is not straightforward. Additionally, business growth means more teams and more people involved in delivering software solutions. If you are a software engineer, you probably know the saying, "Yeah, go ask the infra team if the issue is on their side... Oh, still no answer; they are busy, and now it has been two days, I will ask again...". Yes, no one loves debugging in a complex cloud infrastructure.

Today, telemetry data is scattered across different tools and systems, requiring teams to frequently jump between different solutions (Prometheus, Grafana, ElasticSearch, Zabbix...). Having application telemetry (e.g. HTTPs, Postgres or Redis calls) and infrastructure observability (e.g sidecar proxies, Kafka broker) in one unified platform is more essential than ever.

Furthermore, monitoring with topological relationship awareness is pivotal for both observability and security. In fact, having a clear view of how components are related and interact is critical to maintaining the system's health and understanding the usage patterns behavior of APIs and drifts.

The overarching goal of observability is to give a comprehensive view of the system, allowing one to see and understand how different parts of the system interact and behave, both between separate components (inter) and within individual components (intra).

In this post, we first introduce observability and how APM tools have evolved from simple monitoring tools to tracing-capable solutions. Then, we will delineate the current state-of-the-art observability tools, the problems they aim to solve, and the future challenges in this field. Specifically, we discuss OpenTelemetry, the standard for representing and communicating telemetry data and cutting-edge non-intrusive observability platforms like DeepFlow.

Defining Observability

Observability describes the ability to understand the current state of a system using telemetry data, including logs, metrics, and traces. By collecting and associating observability signals, you are apt to answer questions like where, what, when, and why something happened.

According to Wikipedia, A system is said to be observable if, for every possible evolution of state and control vectors, the current state can be estimated using only the information from outputs (physically, this generally corresponds to information obtained by sensors).

Four Golden Signals From Google’s SRE:

The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four

Four Golden Signals For Container Security From Isovalent :

Isovalent defines four safe golden observation signals for container security: Process Execution, Network Socket, File Access, and Layer 7 Network Identity.

Breaking Data Silos

There is significant challenges posed by heterogeneous data types, infrastructures and microservices architectures. In complex multi-cloud, the amount of broadcasted data increases exponentially, as does the number of monitored objects (services, containers, VMs, hardware ...). Observability becomes increasingly challenging, as systems become more complex.

Today, observable tools are scattered, requiring manual navigation and switching between different tools to troubleshoot issues with abnormal application, a daunting task that can take hours, if not days. Several open-source and commercial tools can be used to generate and manage telemetry data. For instance:

Metrics: Zabbix and Prometheus are typically used for determining the overall behavior of a system (e.g. Elasticsearch or K8S Nodes Health).
Traces: Jaeger and Zipkin can be used for depicting the entire request journey, as it traverse various components of a distributed system.
Logs: Elasticsearch or OpenSearch can ingest and index logs generated by a system when specific events occur.

Furthermore, various data models like traces, metrics and logs lack association standards. This inconsistency exacerbates data silos.

Conducting investigations with logs and even metrics is usually insufficient in complex environments.

Logs are not meant for tracing but only for signaling certain activities like publishing to a blob storage. They are often unbalanced (some applications log certain activities while others do not, especially when different teams work on different components).

Metrics indicate what is not working well or lagging, such as Kafka consumer lagging (very specific).

Distributed tracing reveals why something is not working. It provides a comprehensive view of the request journey, records the start and end times of operations, and can be used by tracing engines and data association with logs and metrics.

Focusing on each part alone is too much to handle for a limited number of personnel. Data sprawl and the lack of interoperability lead to data silos, exacerbating the problem of vendor lock-in, and reduce efficiency.

What we need is a unified system, all in one, that can handle traces, metrics, logs, and all other types of telemetry data from infrastructure to application monitoring.

With high-quality observable data, AI and ML algorithms can be used for correlation, trend prediction, and modeling program behaviors, turning data silos into critical insights [e.g. https://arxiv.org/pdf/2301.13415].

Elastic Observability is one example of such all-in-one observability platform.

https://www.elastic.co/guide/en/observability/current/observability-introduction.html

Difference Between Monitoring And Observability

Monitoring keeps track of a system’s health and alerts about operational failures, Specifically, monitoring tells you what happened with your workload and what is not working. It involves tracking predefined metrics and alerts to identify issues. Implementing monitoring is relatively easy as it focuses on specific metrics and thresholds.

Observability, on the other hand, tells you why something happened. It involves a deeper analysis. Implementing observability is more challenging and costly because it requires comprehensive data collection, correlation, and analysis across multiple layers of the stack.

Horizontal Observability

Horizontal observability refers to tracing a call and recording every service involved during the processing of a request, including microservices, infrastructure services like databases or streaming services (e.g., Kafka), and cloud services (e.g., Azure Blob Storage). It provides a high-level narrative of the request journey and can be used to construct a service topology.

Vertical Observability

Vertical observability, on the other hand, maps every application to the underlying infrastructure, such as containers, hosts or nodes, clusters, and AI infrastructure (e.g., GPUs). This approach provides significant context to the system's performance and behavior. For example, you could answer questions like which underlying host or node the service is running on and how much time a system call takes (typically done in profiling).

Comprehensive Observability

For complete observability, it’s essential to form data associations both horizontally and vertically.

Comprehensive visibility is achieved by associating data from the entire stack, including networks, applications, infrastructure, and other entities. By collecting telemetry data from heterogeneous sources and integrating it into a unified view, you can stay one step ahead of stability issues and security attacks.

Logging, Metrics and Traces

source: https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and-logging.html

Logging, tracing, and metrics are three key areas to achieve observability:

Logging handles discrete events (not necessarily associated with a user request), such as publishing data to a Blob storage. For instance, if you are using Azure Kubernetes Service (AKS), you are probably shipping your logs to Azure Log Analytics or ElasticSearch/OpenSearch. Logs can be combined with a trace and a span (using TraceID/SpanID).
Tracing deals with request-scoped information, encompassing anything within the processing of a request, such as the HTTP duration call to a remote database. There are several tracing products available in the market, examples include Spring Cloud Sleuth and OpenTelemetry.
Metrics surface information that is aggregatable (statistical information), like failed HTTP requests or the number of dropped events. If you are a developer, you are probably familiar with Prometheus. Metrics are typically used for alerting appropriate teams (e.g. AlertManager/Opsgenie), such as when high latency is being experienced or a Kafka consumer lagging.

Traces and Spans

A trace or distributed trace represents the whole journey of a request as it flows through complex, distributed systems (e.g. microservice and serverless applications).

A trace can be represented using a Directed Acyclic Graph (DAG) of spans (units of work or operations along with serving duration). Traces are typically visualised as waterfall diagrams.

Spans are what constitute a Trace. Each span is identified by a unique ID in the trace, called a Span ID, which correlates to a specific step (depending on the data model you are using, the whole trajectory in one service could be one span ; or it can be more granular like specific function calls starting a new span).

All paths in a request will share a unique Trace ID. The Span ID essentially gives you the sequence of operations and helps visualize the dependencies between different spans.

With context propagation, you can correlate different observation signals and create a cohesive story of the end-to-end request journey.

Trace Context Propagation

Distributed Tracing is not possible without context propagation. Context propagation helps in correlating telemetry data (e.g., spans) across distributed systems. There are different ways to propagate the trace identifier between services, typically using HTTP headers (though it depends on the protocol used in inter-microservice communication).

credit: https://blog.sentry.io/distributed-tracing-101-for-full-stack-developers/

The trace ID, along with other metadata (context) such as the span ID, can be propagated within the same application using code instrumentation (for invasive tracing solutions) and a thread-local variable or similar strategies. In a thread-per-request scenario, it is relatively straightforward to achieve in-app tracing, as every request will be processed in one thread. If a request needs to be sent to an upstream service, the thread will be blocked. Thus, the thread ID can be used for trace context, even at the kernel level (Hello eBPF).

Automated Instrumentation

Automatic Instrumentation (or out-of-the-box instrumentation) uses discovery at runtime and inject calls to the API and SDK into your application to generate telemetry data and send it to a suitable tracing backend.

Different methods are possible to achieve auto instrumentation including bytecode instrumentation, monkey patching, or eBPF (uprobes).

Auto-instrumentation usually covers popular frameworks and libraries like Apache Kafka clients (e.g. https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/supported-libraries.md).

While most popular libraries are covered, it is not always enough. Sometimes, you need to manually add metadata or code to generate specific spans within the application itself.

Manual Tracing Instrumentation

Manual tracing instrumentation involves adding code to the application to generate spans and export them to an observability backend. This approach is sometimes necessary when you need to trace spans in specific parts of your codebase and add business-specific metrics.

Usually, adding instrumentation from scratch is costly and time-consuming, and it is generally not recommended to deviate from established standards. Instead, one should use available API and SDK libraries.

For example, if a library does not include observability capabilities like OpenTelemetry out of the box, you can use OpenTelemetry’s APIs to add the necessary instrumentation. Refer to the OpenTelemetry documentation for more details.

A good rule of thumb is not to start with manual instrumentation initially, as low-level calls are often instrumented out of the box. You can always add manual instrumentation later if it improves observability.

Observability backend

Now that we covered what are observation signals, you need at some point to send this telemetry data somewhere to make it available for searching and analysis. A backend system is essential to store, process, analyze, and visualize observation signals collected from applications and infrastructure.

Examples of tracing backends include SigNoz, Jaeger, Prometheus and other commercial vendors. The span creator service can send data directly (e.g API), or it use an intermediate layer like the OpenTelemetry collector.

Now we have our services producing spans, which are sent to the collector and then forwarded to the trace store. The trace store indexes these spans by the trace ID and various other attributes, allowing you to run queries to find interesting behavior in your systems, and even train machine learning models for tasks such as novelty detection and the development of LLM agents.

Finally, Traces or spans sent to a telemetry backend can be controlled to reduce ingestion costs, a process referred to as sampling. It should be stressed that sampling leads to a loss of information.

Full Path Tracing

Building a call relationship between microservices along with the response time and their call path can significantly enhance your observability. At any point in time, you can see how requests propagate, making root cause detection much easier.

From a security standpoint, full path tracing can help you detect breaches. With high quality observation signals, you can build machine learning models to create a distributed profile and detect lateral movements. For example, if Service A suddenly starts calling a Service B but hasn’t done so in the last six days/weeks, this might indicate malicious activity or a new feature. Either way, you can provide feedback, and the model can be retrained to incorporate this information, keeping you aware of any unusual activity. Finally, it can reveal the structure of your organization, as systems are often influenced by Conway's Law.

Now, how can we trace requests in a distributed environment? What kind of telemetry do we have? We have logs. One can index all logs, use statistical analysis, and infer the call chain. This approach is clearly not accurate nor efficient, even though it is not code intrusive, as the application is not aware of being inspected. Another type of telemetry data we have is metrics, which are indicators of the health of the system, so they are also not suitable for tracing. What we can do is empower applications with the capability to create metadata along with signals to trace what happens from the moment the request lands in our environment. Usually, this is achieved using instrumentation, like OpenTelemetry, where specific methods like database queries and RPC calls are instrumented to generate signals and send them to a collector or a database for the tracing engine to operate on. Each microservice would consume a request, generate spans (representing a segment of the computing task with start/end time, parent span ID, trace ID), and add metadata to help track the operations. Finally, once you have a tracing engine, you can also correlate it with logs or metrics. To reduce overhead, sampling is used, and only a subset of traces is generated and sent to the telemetry backend. For security use cases, you need consistent sampling.

In-depth Observability: Profiling

Tracing provides a system-wide view of a request journey. Most observability platform todays offers a service map view based on the tracing data collected. Profiling on the other hand, are correlated to traces or spans and offers a more granular view. Profiling is very specific and goes deeper into the components and resources (CPU, Memory, I/O ...), covering things like stacktrace, function calls, threads, network and kernel layer calls. Profiling is typically done periodically or at last resort to identify performance impediments, source of latency etc.

https://www.deepflow.io/docs/about/features/

Traditional profiling solutions typically covers the user space stack, while modern solutions are based on eBPF, like DeepFlow, goes deeper and offers deep insights at any function both at the application and kernel level. As shown below, sampled stack trace generated by DeepFlow goes deeper horizontally. The function call stack covers not just application stack but also the time spent in libraries and kernel syscalls.

https://www.deepflow.io/docs/guide/ee-tenant/profiling/continue-profile/#continuous-profiling

A trip to OpenTelemetry (OTel)

To observe what happens within your application, you need to deploy sensors. For JVM applications, one typically uses a Java agent to instrument libraries and frameworks to gather telemetry data and send it to the observability backend, such as Jaeger and Zipkin.

Telemetry data includes logs, metrics, and traces. There is a broad consensus today on exposing a Prometheus endpoint to export metrics and integrate them into different dashboards. I will not talk about logs but will redirect our focus to traces.

One problem with traditional distributed tracing is that each tool has its own protocol for receiving the data, or in other words, each has its own SDK that needs to be used. This means that each time you need to change your observability backend, you have to change the frontend as well and start from scratch again (cost increases).

This lack of standardization creates a lack of data portability and quality and adds a burden on developers.

OpenTelemetry (Otel) addresses this issue by providing a set of standardized, vendor-independent SDKs, APIs, and tools for generating and exporting telemetry data to any observability backend that supports the OpenTelemetry Protocol (OTLP). This resolves the issues of data portability and standardization.

Many companies capitalized on OTel to build novel solutions, specifically for application runtime monitoring like Miggo Security https://www.miggo.io.

OpenTelemetry (OTel)

OpenTelemetry or OTel is an open-source collection of APIs, SDKs, and tools designed to make a system observable through instrumentation or code observability. It provides SDKs and tools for generating and exporting telemetry data to various telemetry backends such as Jaeger, Zipkin, Prometheus, or commercial solutions.

OpenTelemetry API

OpenTelemetry provides vendor-agnostic APIs for generating telemetry data from different sources. SDKs implement the OpenTelemetry API.

OpenTelemetry offers set of language-specific libraries to aid in observing code and shipping application telemetry to a suitable telemetry backend.

For Java, OpenTelemetry provides a Java agent JAR that can be attached to any Java 8+ application to dynamically inject and modify selected bytecode and capture telemetry from various popular libraries and frameworks.

https://opentelemetry.io/docs/collector/

OpenTelemetry Collector

OpenTelemetry includes a Collector that can receive, process, and export telemetry data to observability backends, thus remaining vendor agnostic.

This collector service helps offload data quickly and can handle additional tasks like retries, batching, encryption and adding metadata such as k8s metadata. Each service connects to the collector and sends spans over the network as they're generated. OpenTelemetry defines the OTLP protocol for sending telemetry data, which operates on top of gRPC or HTTP.

OpenTelemetry K8S Operator

In Kubernetes environments, the OpenTelemetry Operator for Kubernetes can be used to inject auto-instrumentation libraries for .NET, Java, Node.js, Python, and Go into your applications.

Reminder: OpenTelemetry Is Not A Telemetry Backend

OpenTelemetry is not an observability backend like Jaeger, Prometheus, or other commercial vendors. OpenTelemetry is focused on the generation, collection, management, and export of telemetry. A major goal of OpenTelemetry is that you can easily instrument your applications or systems, no matter their language, infrastructure, or runtime environment. Crucially, the storage and visualization of telemetry is intentionally left to other tools (https://opentelemetry.io/docs/what-is-opentelemetry/).

OTel Auto instrumentation

OTel provides what is known as zero-code instrumentation, where most RPC libraries are observed automatically, and you don't need to add manual code instrumentation unless you need to cover the observability of your library at its own level.

The way zero-code instrumentation is done depends on the underlying programming language. It includes bytecode manipulation (Java), monkey patching (Python), and eBPF uprobes (Go).

OTel will automatically do the work for you to inject calls to the OpenTelemetry API and SDK into your application.

Obviously, your application's code is not typically instrumented automatically, as OTel has no knowledge of it. To instrument your code, you'll need to use code-based instrumentation.

OTel registry can be found at https://opentelemetry.io/ecosystem/registry/.

OTel Java Auto instrumentation

OTel zero-code instrumentation in Java OpenTelemetry zero-code instrumentation is achieved using a Java agent JAR (not the same as a JVMTI Agent) or Spring Boot starter. Probes are positioned at the edges of the underlying application, such as inbound/outbound HTTP requests, database calls, and so on. Selected libraries and methods OpenTelemetry Java Instrumentation Libraries are instrumented. The code locations are usually high level, and well-known open-source frameworks and libraries are typically covered. The OTel Java agent is implemented using Byte Buddy OpenTelemetry Java Instrumentation with Byte Buddy.

OTel auto-instrumentation is achieved through different mechanisms, depending on the underlying library or framework dynamics (also if you are using a java agent or starter etc...).

For example, in Tomcat, jakarta.servlet.http.HttpServletRequest handler methods are observed before and after exit to generate spans Tomcat10ServerHandlerAdvice.java. In Spring Data, method interceptors can be used SpringDataInstrumentationModule.java. In Reactor Kafka, this can be done by adding a reactive subscriber on the stream InstrumentedKafkaReceiver.java, to handle messages received from the broker before being delivered to the application. Finally, in Spring Web, HTTP filters are used WebApplicationContextInstrumentation.java. This is to stress that once you are embedded within the application runtime, you can integrate deeply with libraries and frameworks to generate accurately an in-app trace consisting of spans, in contrast to non-invasive kernel-based tracing like eBPF, which fails due to various reasons we will discuss later.

In multi-threaded applications, specifically in the thread-per-request model, the OpenTelemetry Context is passed through thread-local variables ThreadLocalContextStorage.java. This way, each observed code can fetch it to generate and link other spans, forming an in-app trace.

For reactive frameworks using event loop thread modeling, like Spring WebFlux/Netty, things can be more complicated. This is because one platform thread processes several non-blocking tasks (i.e., requests) simultaneously, and achieving in-app tracing is not straightforward. In fact, one of the big technical challenges encountered when switching from an imperative programming perspective to a reactive programming mindset lies in how you deal with threading. Reactor context is similar to an immutable map or a key/value store. It is bound to each sequence and stored upward through the subscription. This feature is unique to Reactor and does not work with other reactive stream specifications. OTel stores the trace context inside the Reactor context, with otel-trace-context as a key. For more details, you can see how it works with Reactor Kafka here.

Regarding coroutines, OTel is planning to support Java Virtual Threads issue #9534. While Virtual Threads can still support thread-local variables, an application can spawn millions of virtual threads, and creating a thread-local variable for each one of them can be heap intensive.

OTel supports different propagators to surface distributed OpenTelemetry context across process boundaries to downstream services. The default propagator uses the headers specified by the W3C TraceContext specification. The headers are injected on the fly by OTel when using auto-instrumentation.

  @SuppressWarnings("unused")
  public static class HeadersAdvice {

    @Advice.OnMethodExit(onThrowable = Throwable.class, suppress = Throwable.class)
    public static void methodExit(@Advice.Return(readOnly = false) HttpHeaders headers) {
      headers = setter().inject(headers);
    }
  }

Example: https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/java-http-client/javaagent/src/main/java/io/opentelemetry/javaagent/instrumentation/httpclient/HttpHeadersInstrumentation.java

Sometimes, the code observability provided by OTel may not be enough, and you may need to add manual instrumentation to capture other telemetry data of business interest. For more details on how to achieve this with OTel, you can refer to the OpenTelemetry documentation on manual instrumentation setup.

Infrastructure Monitoring with OTel

OTel allows you to collect, visualize, and analyze telemetry data from your applications. You can also use it to cover infrastructure layers (metrics, traces, logs, events), including Kubernetes (Envoy, Istio), cloud (serverless), web servers/proxies (e.g., NGINX), and the health of the hosts, thereby unifying application and infrastructure monitoring in one platform. Examples of such solutions include Datadog and Elastic.

Distributed Tracing with OTel + Jaeger

In this section, we deploy a simple workload consisting of a front proxy, a Java Spring Boot application, a Quarkus application, and a Python FastAPI application. Additionally, we include two infrastructure services: Kafka and PostgreSQL.

All the experiments can be replicated from GitHub[https://github.com/mouadk/distributed-tracing-otel/tree/main]. The repository contains all necessary configuration files, deployment scripts, and code to set up the infrastructure services and deploy the applications. You can follow the instructions provided in the README file.

Deployment

We deploy an Opentelemetry collector and use Jaeger as telemetry backend. OpenTelemetry SDKs and agents collect traces, metrics, and logs from the applications and gateways.

Once the necessary components deployed, you can hit the envoy proxy using:

curl -X POST  http://0.0.0.0:10000/publish -H "Content-Type: application/json" -d '{"content": "magic content"}'

It will route the request to the Java application (Spring), which in turn will publish a Kafka message. The Python application registers a consumer that listens for messages published on a shared topic and stores these messages in a PostgreSQL database. Here is what the first request yields in terms of tracing:

I am using a Reactor Kafka publisher, which is not covered by OTel auto instrumentation (https://opentelemetry.io/docs/zero-code/java/spring-boot-starter/out-of-the-box-instrumentation/), this is why we don't see it above.

Now, let's hit the Python app through the proxy:

curl http://0.0.0.0:10000/messages

and see what it generates in the UI:

We definitely have more spans generated. However, as shown below, we don't see our Kafka consumer span. This is because there is no tracing for AIO Kafka: https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation#readme. As discussed earlier, this is one of the limitations of OTel auto instrumentation.

Finally, using the ingested spans, Jaeger can generate a DAG representing your distributed system, which in our case looks as follows:

Envoy generates request IDs (x-request-id) and trace headers (here traceparent), and this is how we are capable of assembling the spans (OTel will continue from the surfaced trace ID and create parent/child relationships).

https://github.com/envoyproxy/envoy/blob/121a541dd3fadef7131963f23e42a41e0c93e102/source/extensions/tracers/opentelemetry/tracer.cc#L28

You can confirm it by yourself.

Above, following the W3C Trace Context specification, the Trace ID is b1247e5f1a23dc1f29cd6522a22fd963 while the Span ID is fd07a6648d59e0aa. This is confirmed below.

http://localhost:16686/trace/b1247e5f1a23dc1f29cd6522a22fd963

However, if you remember our workload, it does include Kafka and PostgreSQL, and we don't see them in the DAG provided by Jaeger. This is because infrastructure components are not instrumented to generate spans, so we only operate at the library or framework level. If you look the service map generated for our workload by Deepflow (discussed later), it looks as follows:

the map is more comprehensive and Kubernetes-aware.

Non-Invasive Tracing Challenges (Kernel/eBPF Based)

Zero Code Instrumentation refers to the capability of achieving observability and distributed tracing without requiring application library instrumentation, either manual or automatic.

One drawback of traditional distributed tracing frameworks is their limited coverage. Spans are restricted inside OSS and focus on application-level components, often leaving blind spots such as network calls uncovered (at most, they may include sidecars like Envoy, as discussed earlier).

The aforementioned limitation makes it harder to locate problems precisely and in a timely manner. Even though there has been some work to inject spans from other infrastructure layers like Envoy, it is not enough. Usually, a laborious amount of effort is needed to maintain the SDK for different programming languages and ensure there are no conflicts (yep as shown below, conflicts happens):

Zero code instrumentation is typically achieved by placing probes on both user and kernel land using eBPF. The solution is said to be non-intrusive, meaning no modification, recompilation, or redeployment of the monitored application is required.

However, operating outside the application runtime comes with a set of challenges that every non-intrusive observability tool like DeepFlow, Nahida, needs to cope with:

Multithreading: Traditional microservices typically use a thread-per-request model. Once a socket is accepted (in a kernel thread), the request is handled by a worker thread (tied to another kernel thread) that monopolizes the entire platform thread (and thus OS thread) until the request is completed. Most modern frameworks move away from this model and use more efficient thread scheduling methods (N:M scheduling).
Thread Reuse: Creating threads, especially kernel threads, is expensive. Therefore, frameworks and libraries typically reuse threads. Once a request is completed, the underlying thread is returned to a thread pool, ready to serve other requests.
Non-Blocking I/O And Coroutines: Modern frameworks like Spring Boot/Webflux, FastAPI, and languages like Go use non-blocking I/O system calls, like select, to deal effectively with I/O-bound tasks. This means that each thread is not bound to only one request anymore, and a request can start being handled in one kernel thread while continuing on another one (work stealing algorithm). In fact, Java virtual threads and Go goroutines are all concerned here.
Encrypted traffic: While capturing HTTP traffic is relatively straightforward, inspecting TLS/HTTPS traffic presents more challenges, especially when TLS offloading does not occur at the load balancer or gateway level. In such cases, the data generated by the application enters the kernel encrypted. If the programming language dynamically links the OpenSSL libraries, it allows the possibility of hooking into shared libraries (e.g., openssl.so), as we can locate the precise points for instrumentation. Examples of such languages include Python and Golang. However, the Java runtime uses its own TLS implementation, complicating the placement of probes (Java is a virtualization technology at the application level !). Additionally, components like Envoy and Istio use statically linked SSL libraries, further complicating the process of inspecting TLS traffic, as we lack clear symbol information.

https://blog.px.dev/ebpf-openssl-tracing/

Existing non-intrusive observability tools use an agent-server-based model. The agent collects spans and then delegates to a server responsible for storing the spans in a database and assembling them into traces.

In the following sections, we will discuss several non-intrusive observability solutions, based on eBPF.

DeepFlow

DeepFlow is an open-source observability tool that uses eBPF to mount eBPF programs on some Linux system calls and user-defined functions, achieving zero code instrumentation for complex cloud infrastructures and cloud-native applications. It provides full-stack observability, including distributed tracing and continuous profiling (requires enterprise edition).

In DeepFlow, spans are created and auto-correlated from code to infrastructure.

More specifically, DeepFlow positions hooks over a set of ten system call application binary interfaces (ABIs). DeepFlow opts for system call ABIs over library functions (e.g., GNU C Library) to achieve a high degree of development generality and to access critical information such as the TCP sequence.

Selected system calls are further classified into ingress or egress.

DeepFlow monitors each ingress or egress call as it enters or exits, storing related information in an eBPF map, including process/thread ID, program name, socket ID, and TCP sequence. Collected spans, along with metadata, are sent to the database for assembly.

To propagate context, traditional distributed tracing frameworks like OTel insert context information into the headers or payloads of messages, enabling straightforward data association during assembly. This method is known as explicit context propagation.

DeepFlow, however, uses implicit context propagation, where messages do not travel with context information. The key insight of DeepFlow is that the information required for context propagation is already contained in network-related data (e.g., TCP sequence).

To associate spans within a single process, DeepFlow uses process IDs and thread IDs. Spans begin with a request (root span) and end with a response. Once a span is created, child spans (vertically/intra casual relationship) are associated using the process/thread ID. For languages like Golang, DeepFlow observes the creation of coroutines to establish parent-child relationships.

Intra-component causal relationships are achieved using a tag called systrace_id. Data does not travel with the injected tag; instead, it is recorded as a globally unique identifier in the generated span's metadata, which is used during span assembly (non-intrusive).

DeepFlow assumes a 1:1 relationship between the kernel thread and user thread, simplifying dramatically the process of vertical tracing. However, modern frameworks use coroutines to map a large number of virtual/lightweight threads to a small set of kernel threads.

To handle thread reuse, DeepFlow uses time sequences for trace partitioning.

To achieve inter-component causal relationships, DeepFlow utilizes the fact that network transmissions (Layer 2/3/4 forwarding) do not change the TCP sequence, using this information to associate spans generated from different components.

Additionally, DeepFlow supports ingesting traces from third-party distributed tracing frameworks like OpenTelemetry by parsing header fields.

The algorithm for trace assembly involves associating intra-component spans using systrace_ids and pseudo-thread IDs, cross-threaded intra-component spans using X-Request-IDs, and using TCP sequences for inter-component span association. Finally, trace IDs are used for third-party integration, resulting in a comprehensive trace consisting of horizontal and vertical tracing. The iterative algorithm used for trace assembling in DeepFlow is self-explanatory and is shown below:

DeepFlow: https://dl.acm.org/doi/pdf/10.1145/3603269.3604823

DeepFlow also supports tag-based correlation, such as Kubernetes resource tags, which are not injected directly into the messages (intrusive) but into the generated data/spans by DeepFlow (non-intrusive). To reduce the overhead from a large set of tags attached to a trace, DeepFlow employs smart encoding. You can refer to their published paper for more details: https://dl.acm.org/doi/pdf/10.1145/3603269.3604823.

To deal with SSL/TLS encrypted messages, DeepFlow adds a uprobe to capture them in user space libraries and view messages before encryption (for go applications).

DeepFlow has several limitations, including non-blocking IO, streaming, coroutines (e.g. Java Virtual Threads), which prevent it from guaranteeing the accuracy of reconstructed request causality in high-concurrency applications.(please refer to study done here: https://arxiv.org/pdf/2311.09032 for more details).

Nahida: https://arxiv.org/pdf/2311.09032

Additionally, DeepFlow utilizes eBPF to capture function call stacks of applications, a feature available in the enterprise edition. Unlike other application profiling tools, DeepFlow covers not just application functions but also the time spent in libraries and kernel syscalls.

https://www.deepflow.io/docs/features/continuous-profiling/auto-profiling/

To install Deepflow, use the following commands:

helm repo add deepflow https://deepflowio.github.io/deepflow
helm repo update deepflow # use `helm repo update` when helm < 3.7.0
cat << EOF > values-custom.yaml
global:
  allInOneLocalStorage: true
EOF
helm install deepflow -n deepflow deepflow/deepflow --create-namespace \
  -f values-custom.yaml

https://www.deepflow.io/docs/ce-install/all-in-one/

You should see all the pods running in deepflow namespace.

Several Grafana templates are created out of the box. To access Grafana UI, you can port-forward the deployment deepflow-grafana or access it as follows:

NODE_PORT=$(kubectl get --namespace deepflow -o jsonpath="{.spec.ports[0].nodePort}" services deepflow-grafana)
NODE_IP=$(kubectl get nodes -o jsonpath="{.items[0].status.addresses[0].address}")
echo -e "Grafana URL: http://$NODE_IP:$NODE_PORT  \nGrafana auth: admin:deepflow"

https://www.deepflow.io/docs/ce-install/all-in-one/

If you open the service map dashboard (Application - K8s Pod Map), you should see a view similar to the one shown below:

Compared to Jaeger, the map is more comprehensive and Kubernetes-aware. There is also a Distributed Tracing dashboard. However, even though requests are properly ingested:

I can't see spans or traces, likely because I am using asynchronous APIs, which Deepflow does not currently support. Due to time constraints, I can't investigate this further, especially since it doesn't handle application-level concurrency (event loops, coroutines, Java threads, etc.), as discussed in the previous post. If you know what I missed, please correct me.

Feel free to test it and let me know if you discover why I am only seeing traces related to Kafka (with empty trace IDs).

Nahida

In contrast to DeepFlow, Nahida injects custom trace contexts directly into the messages traveling across the network from the downstream to the upstream service.

Trace context propagation happens in the kernel, more specifically, during message transfer ring kernel functions such as sendfile, sendmsg, and recvmsg, where the context is injected into (or extracted from) the header of HTTP messages with the help of the eBPF helper function, bpf_msg_push_data.

Additionally, Nahida watches the user thread creation in the kernel to construct parent-child relationships, implementing intra-component causal relationships in multi-threaded applications.

Furthermore, Nahida has some limitations including the inability to cover messages in protocols other than HTTP, such as HTTPS and gRPC.

Finally, while they manage multithreaded applications by maintaining parent-child relationships, they cannot effectively handle coroutines (or non blocking syscalls). Nahida probes an eBPF program on the sending side and queries the trace context using the thread ID (or parent of that child thread) and injects it into the message to be sent, which is not enough to deal with coroutines, as user thread scheduling is entirely in the hand of the application runtime, like JDK scheduler. To address this , you either need to probe the collaborative user-state scheduler or utilize OpenTelemetry to help with the injection of the trace context. Deepak Nadig proposes a similar approach in his work, which can be found here: eBPF-Enhanced Complete Observability Solution for Cloud-native Microservices.

Grafana Beyla

Grafana Beyala is another non-intrusive (zero code) eBPF-based tracing tool. It is ideal for users who prefer Grafana for application observability but lack the time to adapt their workload by adding probes or instrumenting code.

Grafana Beyla generates spans using OpenTelemetry for common transactions and provides RED metrics (Rate-Errors-Duration) for both Linux HTTP/S and gRPC services. Full distributed tracing capabilities were added until recently. More details can be found in their blog post.

In short, Grafana Beyla uses eBPF probes for explicit context propagation. Currently, automatic header injection is only supported for Go programs. Similar to DeepFlow, they track goroutine parent-child relationships and their lifecycle to effectively handle asynchronous programming. Grafana Beyla team plan to expand support to more languages in the future.

For other languages, Grafana Beyla uses implicit context propagation. TCP connection tuples are used to identify each individual connection request. Similar to DeepFlow, the context is not injected but fetched from the other end via an eBPF map.

https://grafana.com/blog/2024/03/21/opentelemetry-distributed-tracing-with-ebpf-whats-new-in-grafana-beyla-1.3/

As discussed earlier, asynchronous requests require sophisticated tracking. Grafana Beyla team have partially addressed this by tracking the parent-child relationship of request threads using the sys_clone syscall. However, this approach is insufficient for handling thread reuse and N:M thread scheduling (e.g. Netty, Java Virtual Threads...). Library and language-specific tracking are necessary for these cases. While Go is already supported through goroutine scheduling tracking, they plan to extend support to other libraries and languages in the future.

Odigos

Odigos in an open source distributed tracing tool that uses both OpenTelemetry zero code instrumentation ( runtime languages e.g Java) and eBPF ( compiled languages e.g Go) to automatically instrument application libraries and frameworks covering python, Golang, .NET, java and nodejs.

Odigos generates traces in OTel format, promoting data interoperability and preventing vendor lock-in.

Pixie

Pixie is an eBPF based open source zero-code observability platform specially crafted for Kubernetes applications. It supports a wide range of protocols (including DNS, PostgreSQL, and MySQL) and collect data at different levels.

Pixies employs eBPF probes both at the user and kernel space (e.g. Linux syscalls) in order to collect data. It includes network, infrastructure monitoring, service performance, Database Query Profiling, Request Tracing and Continuous Application Profiling.

Similar to DeepFlow, Pixie position uprobes at the TLS library's API in order trace TLS/SSL Connections.

https://docs.px.dev/about-pixie/pixie-ebpf/

Additionally, Pixie provides an OpenTelemetry exporter that can be used to export your traces and telemetry using OpenTelemetry spec (https://docs.px.dev/reference/pxl/otel-export/) to other backends like Jaeger or Prometheus.

Pixie does not support profiling for some languages, such as Java, Ruby, and Python and only covers Go, C/C++, and Rust. They are planning to support it in the future (https://docs.px.dev/about-pixie/roadmap/).

LLM Observability with OTel

https://docs.datadoghq.com/fr/llm_observability/

So far, we discussed, how observability tools like OTel, use code stacking and instrument libraries and frameworks to generate telemetry data, including traces, and send it to an observability backend.

Companies have already started using Generative AI services including OpenAI, Azure OpenAI and Anthropic to develop for business innovations, including RAG (Retrieval-Augmented Generation) systems.

Implementing observability at the level of these services is crucial. In fact, governing LLM behaviour and usage is of upmost importance. You should be capable of answering questions like how much the LLMs are being used, average tokens per request, average cost per request, latency, GPU performance to ensure optimal performance and reliability. Additionally, you can detect attacks such as but not limited to Prompt Injection (Datadog provides a prompt injection scanner, which can detect attempts to deviate a LLM).

Datadog: https://docs.datadoghq.com/fr/llm_observability/

Finally, knowing which model you are using in your workload can very important, in case an open source model has been backdoored.

OpenLIT uses automatic instrumentation to generate traces and metrics. It covers a wide range of LLM providers, Vector DBs, frameworks and GPU performance stats for self-hosted LLMs. Integrating it into GenAI projects is effortless and require two lines of code:

import openlit

openlit.init()

OpenLLMetry is another LLM observability SDK, built on top of OpenTelemetry giving you complete observability over your LLM application. Similar to OpenLLMetry, including it into your project requires two lines of code :p:

from traceloop.sdk import Traceloop

Traceloop.init()

Infrastructure Tracing

Infrastructure components usually come with several exposed metrics out of the box, but sometimes these metrics are insufficient to identify the cause of alerted issues.

The root cause can often exist vertically, such as system call lagging or locking issues, which tools like OpenTelemetry cannot report. However, eBPF can address these issues, albeit with some limitations. For a detailed example, refer to this article by the Allegro team, where they achieved deep root cause analysis by instrumenting Kafka with eBPF. They discovered that file system writes (journal commits) were the root cause of slow produce requests, something traditional tracing can't automatically reveal, requiring daunting manual effort.

https://blog.allegro.tech/2024/03/kafka-performance-analysis.html

A comprehensive observability tool would correlate incoming requests with their respective system calls and function calls (maybe on demand/periodically as it can smash performance), providing deeper insights. While this is not straightforward, it highlights the types of tasks that are costly to debug and how we can improve observability to cover these blind spots. For a detailed example, I highly recommend reading this article from the Allegro team. It provides a good understanding of these challenges and how eBPF can aid in resolving them.

I don't think any tool we discussed is a silver bullet right now, but let's see what advancements come in the next few months.

We need an hybrid approach at some point!.

If I have overlooked something, please let me know.

Conclusion

Achieving observability is extremely challenging problem. The dynamic nature of microservices makes it extremely challenging and complex. Before OpenTelemetry (OTel), many distributed tracing solutions, including Jaeger and Zipkin, were available. However, the way observation data was generated, collected, and mapped was very specific to each solution, creating data silos and exacerbating the problem of data interoperability. OTel emerged as a standard for generating and communicating observation signals, including traces, logs, and metrics, improving data portability and lowering the costs of observability.

OTel is not just an open standard, it provides auto-instrumentation capability. In addition to their SDKs that can be used to generate spans, they also cover several popular frameworks out of the box and can generate spans automatically for you. The locations of the probes are typically placed at the edge of your application, covering things like database calls or remote procedure calls (RPCs). Additionally, context propagation is handled out of the box, making distributed tracing easier than ever. Furthermore, OTel provides a collector that can be used to route all application telemetry to the observability backend of your choice. Other layers like Cloud/Serverless and gateways (e.g., NGINX), service mesh (e.g., Envoy/Istio), can also be observed with OTel tracing.

OTel can be used to achieve distributed tracing, providing horizontal observability at the language and framework level. However, it does not cover other layers like networking, Linux kernel, hardware drivers, etc. Sometimes, application telemetry is not enough, and in this case, vertical observability may be needed. A platform where you can see the loaded libraries, opened/accessed files, processes, system calls, inter-component causality of your application... is pivotal for analyzing the runtime behaviors of applications, and some Cloud Application Detection Response (CADR) solutions already use this context to detect drifts at runtime (e.g., RAD Security).

Collecting, ingesting, and correlating high-quality observable data is crucial in today's data-driven world. High-quality telemetry data plays a vital role in creating detections and predicting potential breaches e.g. Miggo Security. One of Miggo's features is baselining, which first represents normal behavior, then detects exploits and intrusions in real-time, providing a robust defense mechanism against sophisticated threats like authentication bypasses that Runtime Application Self-Protection (RASP) alone cannot address. It should be stressed that, in contrast to RASP, solutions relying on exported security telemetry data using Otel can only detect and respond to threats. These solutions do not provide prevention capabilities, as the harm would already be done by the time a threat is detected.

As we discussed today, instead of having each tool ship observation signals from each layer separately, a unified platform is ideal to avoid data silos and prevent people from frequently jumping between several solutions (e.g., Zabbix, Prometheus, Grafana, ElasticSearch, Jaeger)

We discussed several platforms gauging the kernel by eBPF programs, referred as zero-intrusion distributed tracing solutions, which are less invasive for deployment. Most of these solutions implement implicit or blackbox context propagation in the Linux kernel. While eBPF is promising potential for executing observability code at run-time inside the kernel, without impeding system performance, stepping out of the application runtime comes with a set of challenges. The agent still needs to understand how the runtime serves requests, in order to deal with NIO and coroutines, which can lead to a lack of precision and accuracy in achieving comprehensive observability.

Both existing invasive and non-invasive systems have shortcomings, there is no silver bullet. Non-invasive systems suffer from inaccurate request causality, while other invasive systems like OTel are highly dependent on the underlying programming language and framework. Additionally, they are invasive and miss blind spots, unable to cope with complex modern cloud infrastructure.

Finally, capturing SSL traffic presents several challenges. While common SSL/TLS libraries can be intercepted before encryption using eBPF uprobes at the shared library level, this method is ineffective with statically linked libraries, such as Envoy, or runtime-specific implementations, like those in the JVM, where the exact locations to place the probes are unknown. Additionally, sharing private keys with an eBPF agent is not scalable. To address this, a hybrid approach is necessary. For example, in Java, using a Java agent or JVM TI agent to trigger the system probe can be effective, similar to how Runtime Application Self-Protection (RASP) is injected into the JVM process. Datadog employs a similar technique, as discussed in their presentation here.

Having a unified observability platform covering the entire stack, from infrastructure to applications, collecting, ingesting, and correlating all-round high quality telemetry data without the burden of instrumentation, is of paramount importance.

Manual trace analysis is time-consuming and not always scalable. Many approaches based on deep learning have been proposed for automating the detection of novelties among traces [DeepTraLog, MultimodalTrace, LogAnomaly, ServiceAnomaly, https://dl.acm.org/doi/pdf/10.1145/3543507.3583215]. Sequence and unified graph representations of telemetry data (logs, spans, metrics...) can be used to learn latent representations and detect anomalous traces. However, the benchmarks in those studies are not always complex microservice systems, the effectiveness are usually validated against simple microservices programs, which fails to reflect reality. Anyway, this is a separate topic that I will cover in a future post. Some solutions also fall short in modeling, particularly when dealing with highly multi-threaded programs.