Senior Machine Learning Engineer, Numerator
18 March 2023
Distributed tracing (hereafter tracing) is an observability tool
Tracing follows a request through a distributed system
Knowing all commutes could help you understand:
Traces follow a request through an entire system Spans make up traces Tags or attributes record information about spans Tags don’t have set schemas
Tracing is an open standard Client: Server: opentelemetry
(formerly opentracing
)jaeger
Fizzbuzz
x
:
x
is divisible by 3, return "Fizz"
x
is divisible by 5, return "Buzz"
x
is divisible by 15, return "FizzBuzz"
Introducing FBaaS - FizzBuzz as a Service
We have the hub
running at localhost:6000
It accepts JSON payloads {"number": x}
We’ll use a bash for loop to fizzbuzz numbers 1-15
Seems kind of slow… let’s trace it
(Available at branch tracing-example
)
From the tracing-example
branch, run
docker-compose --profile tracing up
to run the traced service
The hub
service is still exposed at localhost:6000
The jaeger
UI is now exposed at localhost:16686
After running some requests through the hub
, we can view their traces at localhost:16686
This is what’s called the traceview
Two ways to think about that question:
Programs need context to associate traces together
Tracing works by propagating HTTP headers through the system
{"traceparent": "f"00-{trace_id}-{span_id}-00"}
Check out the W3C standards for more information
opentelemetry
) sends the span to the the tracing backend (jaeger
)
It’s even possible to We’ll look at opentelemetry
has a rich collection of open source packagesautoinstrument
popular servers
FastAPI
Flask
diff
s to emphasize what implementing tracing does
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
OTLPSpanExporter,
)
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
resource = Resource(attributes={"service.name": SERVICE_NAME})
trace.set_tracer_provider(TracerProvider(resource=resource))
tracer = trace.get_tracer(__name__)
otlp_exporter = OTLPSpanExporter(endpoint="http://jaeger:4317", insecure=True)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
hub/main.py
)def call_remote_service(
number: float, service: Literal["fizzer", "buzzer"]
) -> bool:
url = get_service_address(service)
response = requests.post(url, json={"number": number})
response_payload = response.json()
return response_payload["result"]
app = FastAPI()
@app.post("/")
def fizzbuzz(nc: NumberContainer):
number = nc.number
fizz = call_remote_service(number, "fizzer")
buzz = call_remote_service(number, "buzzer")
...
hub/main.py
)+from opentelemetry.propagate import inject
+from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
def call_remote_service(
number: int, service: Literal["fizzer", "buzzer"]
)
+ headers = {"Content-Type": "application/json"}
+ inject(headers)
- response = requests.post(url, json={"number": number})
+ response = requests.post(url, json={"number": number}, headers=headers)
app = FastAPI()
+FastAPIInstrumentor.instrument_app(app)
buzzer/main.py
)from flask import Flask, jsonify, request
app = Flask(__name__)
@app.route("/", methods=["POST"])
def buzz():
x = request.json["number"]
buzz = bool(x % 5 == 0)
return jsonify({"result": buzz})
buzzer/main.py
)-from flask import Flask, jsonify, request
+import json
+
+from flask import Flask, make_response, request
+from opentelemetry.propagate import inject
+from opentelemetry.instrumentation.flask import FlaskInstrumentor
app = Flask(__name__)
+FlaskInstrumentor().instrument_app(app)
@app.route("/", methods=["POST"])
def buzz():
+ headers = {"Content-Type": "application/json"}
+ inject(headers)
x = request.json["number"]
buzz = bool(x % 5 == 0)
- return jsonify({"result": buzz})
+ return make_response(json.dumps({"result": buzz}), 200, headers)
fizzer/main.py
)from flask import Flask, jsonify, request
app = Flask(__name__)
@app.route("/", methods=["POST"])
def fizz():
x = request.json["number"]
fizz = bool(x % 3 == 0)
return jsonify({"result": fizz})
fizzer/main.py
)+from opentelemetry import trace
+from opentelemetry.context import Context
+from opentelemetry.propagate import inject
+from opentelemetry.trace.propagation import tracecontext
+
+FORMAT = tracecontext.TraceContextTextMapPropagator()
@app.route("/", methods=["POST"])
def fizz():
- x = request.json["number"]
- fizz = bool(x % 3 == 0)
- return jsonify({"result": fizz})
+ traceparent = request.headers.get("traceparent")
+ with tracer.start_as_current_span(
+ "/", context=FORMAT.extract({"traceparent": traceparent})
+ ) as fizzspan:
+ headers = {"Content-Type": "application/json"}
+ inject(headers)
+ x = request.json["number"]
+ fizz = bool(x % 3 == 0)
+ return make_response(json.dumps({"result": fizz}), 200, headers)
...
with tracer.start_as_current_span(
"/", context=FORMAT.extract({"traceparent": traceparent})
) as fizzspan:
user_agent = request.headers.get("user-agent")
fizzspan.set_attribute("http.user_agent", user_agent)
...
Let’s talk about results!
Tracing is clearly a complicated solution
This is a complicated problem
Positives:
Negatives:
Service meshes can identify latency
It’s possible to approximate tracing without header propagation, see Sachin Ashok and Vipul Harsh
Go beyond the traceview
Services can operate on traces (e.g. demarcating types of traffic)
Teams can use traces to directly analyze traffic across service paths
If traces are backed up to a SQL storage (or use a SQL-like tool), engineers can easily build custom analyses and tools
Please read Cindy Sridharan’s terrific post on tracing