Skip to main content

OTEL

The datalayer-otel service provides long-term observability storage and querying for the Datalayer Platform. It collects traces, metrics and logs via Apache Pulsar, stores them as Parquet files using Apache DataFusion, and exposes a REST + WebSocket API for querying.

Architecture

The datalayer-otel Helm chart deploys two components into the datalayer-otel namespace:

  1. OTEL Collector – An OpenTelemetry Collector Contrib instance that receives OTLP data (gRPC :4317, HTTP :4318) from instrumented services and exports it to Apache Pulsar topics.
  2. FastAPI Query Service – A Python FastAPI application that consumes OTLP data from Pulsar, stores it in Parquet files (via DataFusion), and serves a REST API for querying traces, metrics and logs.
  ┌──────────────┐   OTLP gRPC/HTTP   ┌────────────────────┐
│ Services │ ──────────────────► │ OTEL Collector │
│ (13 svcs) │ │ (datalayer-otel) │
└──────────────┘ └────────┬──────────┘
│ Pulsar
┌─────────▼──────────┐
│ Apache Pulsar │
│ otel-traces │
│ otel-metrics │
│ otel-logs │
└─────────┬──────────┘
│ consume
┌─────────▼──────────┐
│ FastAPI Service │
│ DataFusion Store │
│ (Parquet files) │
└─────────┬──────────┘
│ REST / WS
┌─────────▼──────────┐
│ UI / CLI clients │
└────────────────────┘
info

This is separate from the datalayer-observer stack (deployed in the datalayer-observer namespace) which runs Tempo, Prometheus and Loki for Grafana-based dashboards. The datalayer-otel service provides a dedicated Datalayer-native API for observability data, with SQL query capabilities via DataFusion.

Deployment

plane up datalayer-otel

Check the availability of the OTEL Pods.

kubectl get pods -n datalayer-otel

API Endpoints

The FastAPI query service exposes the following endpoints under /api/otel/v1:

EndpointMethodAuthDescription
/pingGETNoHealth check
/traces/GETJWTList / query traces
/metrics/GETJWTQuery metrics
/logs/GETJWTQuery logs
/query/sqlPOSTJWTRun ad-hoc SQL via DataFusion
/system/statsGETAdminStorage statistics
/wsWSTokenReal-time WebSocket stream

API docs are available at https://<RUN_HOST>/api/otel/v1/docs (Swagger) and /api/otel/v1/redoc (ReDoc).

CLI

The datalayer-otel package also ships a CLI for operating and querying the service:

datalayer-otel serve         # Start the FastAPI server
datalayer-otel traces # List / get traces
datalayer-otel metrics # Query metrics
datalayer-otel logs # Query logs
datalayer-otel query # Run ad-hoc SQL via DataFusion
datalayer-otel stats # Show storage statistics
datalayer-otel smoke-test # Send traces/metrics/logs and query them back
datalayer-otel logfire # Send test spans/logs to Logfire
datalayer-otel flush # Force-flush buffered data
datalayer-otel services # List observed service names

All query commands authenticate via DATALAYER_API_KEY (Bearer token).

Environment Variables

FastAPI Query Service

Authentication (via datalayer_common)

VariableRequiredDefaultDescription
DATALAYER_JWT_SECRETYesShared secret for JWT token validation
DATALAYER_JWT_ISSUERYesExpected JWT token issuer (e.g. https://id.datalayer.run)
DATALAYER_JWT_ALGORITHMNoHS256JWT signing algorithm
DATALAYER_JWT_CACHE_VALIDATENofalseCache JWT validation results
DATALAYER_IAM_API_KEYYesInternal service-to-service API key

Service Configuration

VariableRequiredDefaultDescription
DATALAYER_OTEL_PORTNo7800Port the FastAPI server listens on
DATALAYER_CORS_ORIGINNo*Allowed CORS origin (used by datalayer_common)

DataFusion / Storage

VariableRequiredDefaultDescription
DATALAYER_OTEL_DATAFUSION_DATA_DIRNo/var/lib/datalayer-otel/dataPath to Parquet data directory
DATALAYER_OTEL_DATAFUSION_MAX_ROWS_PER_FILENo100000Max rows per Parquet file before rotation
DATALAYER_OTEL_RETENTION_DAYSNo30Number of days to retain Parquet files

Apache Pulsar

VariableRequiredDefaultDescription
DATALAYER_OTEL_PULSAR_URLNopulsar://pulsar-broker:6650Pulsar broker URL
DATALAYER_OTEL_PULSAR_TRACES_TOPICNopersistent://public/default/otel-tracesPulsar topic for traces
DATALAYER_OTEL_PULSAR_METRICS_TOPICNopersistent://public/default/otel-metricsPulsar topic for metrics
DATALAYER_OTEL_PULSAR_LOGS_TOPICNopersistent://public/default/otel-logsPulsar topic for logs
DATALAYER_OTEL_PULSAR_SUBSCRIPTIONNodatalayer-otel-consumerPulsar subscription name
DATALAYER_OTEL_PULSAR_BATCH_SIZENo1000Number of messages per batch
DATALAYER_OTEL_PULSAR_BATCH_TIMEOUT_SECONDSNo3Max seconds before a partial batch is flushed

Logfire (optional)

VariableRequiredDefaultDescription
DATALAYER_LOGFIRE_API_KEYNo""Logfire write token
DATALAYER_LOGFIRE_PROJECTNostarter-projectLogfire project name
DATALAYER_LOGFIRE_URLNohttps://logfire-us.pydantic.devLogfire base URL
DATALAYER_LOGFIRE_SEND_TO_LOGFIRENotrueWhether to send data to Logfire cloud

OTEL Collector

The collector is configured via the Helm chart ConfigMap. The following Helm values control its behavior:

Helm ValueDefaultDescription
collector.imageotel/opentelemetry-collector-contrib:0.117.0Collector container image
collector.otlp.grpcPort4317OTLP gRPC listen port
collector.otlp.httpPort4318OTLP HTTP listen port
collector.pulsar.endpointpulsar://pulsar-broker:6650Pulsar broker endpoint
collector.pulsar.topics.tracespersistent://public/default/otel-tracesPulsar traces topic
collector.pulsar.topics.metricspersistent://public/default/otel-metricsPulsar metrics topic
collector.pulsar.topics.logspersistent://public/default/otel-logsPulsar logs topic

Instrumented Services

The 13 platform services that send telemetry to the OTEL Collector use these environment variables (set via up.sh and the datalayer_common.instrumentation module):

VariableRequiredDefaultDescription
OTEL_EXPORTER_OTLP_TRACES_ENDPOINTYeshttp://datalayer-otel-otel-collector-svc.datalayer-otel.svc.cluster.local:4317Collector gRPC endpoint for traces
OTEL_EXPORTER_OTLP_METRICS_ENDPOINTYeshttp://datalayer-otel-otel-collector-svc.datalayer-otel.svc.cluster.local:4317Collector gRPC endpoint for metrics
OTEL_EXPORTER_OTLP_LOGS_ENDPOINTYeshttp://datalayer-otel-otel-collector-svc.datalayer-otel.svc.cluster.local:4317Collector gRPC endpoint for logs
DATALAYER_OTEL_API_KEYNo""Bearer token attached to OTLP export requests
OTEL_PYTHON_LOG_LEVELNoinfoPython OTEL SDK log level
OTEL_SDK_DISABLEDNofalseSet to true to disable the OTEL SDK entirely
Endpoint Values

For a standard deployment, all three OTEL_EXPORTER_OTLP_*_ENDPOINT variables point to the same collector:

# The datalayer-otel collector in the datalayer-otel namespace
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="http://datalayer-otel-otel-collector-svc.datalayer-otel.svc.cluster.local:4317"
export OTEL_EXPORTER_OTLP_METRICS_ENDPOINT="http://datalayer-otel-otel-collector-svc.datalayer-otel.svc.cluster.local:4317"
export OTEL_EXPORTER_OTLP_LOGS_ENDPOINT="http://datalayer-otel-otel-collector-svc.datalayer-otel.svc.cluster.local:4317"

If the datalayer-observer stack is also deployed, services can point to the observer collector instead (or in addition) — that collector forwards to Tempo, Prometheus and Loki for Grafana dashboards:

# The datalayer-observer collector in the datalayer-observer namespace
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="http://datalayer-collector-collector.datalayer-observer.svc.cluster.local:4317"
export OTEL_EXPORTER_OTLP_METRICS_ENDPOINT="http://datalayer-collector-collector.datalayer-observer.svc.cluster.local:4317"
export OTEL_EXPORTER_OTLP_LOGS_ENDPOINT="http://datalayer-collector-collector.datalayer-observer.svc.cluster.local:4317"

Connecting from the Internet (Public Endpoints)

When the datalayer-otel service is deployed behind a public ingress (e.g. https://prod1.datalayer.run), external clients can send OTLP data and query the REST API over the internet.

Environment Variables (Public Internet)

When sending telemetry or querying the OTEL service from outside the cluster, use these environment variables:

VariableRequiredDefaultDescription
DATALAYER_RUN_URLYeshttps://prod1.datalayer.runBase URL of the Datalayer platform
DATALAYER_API_KEYYesJWT or API key for authentication (used as Bearer token)
DATALAYER_OTLP_URLNo${DATALAYER_RUN_URL}/api/otel/v1/otlpOTLP/HTTP collector endpoint for sending signals
DATALAYER_OTEL_URLNo${DATALAYER_RUN_URL}REST API base URL for querying traces/metrics/logs
OTEL_EXPORTER_OTLP_TRACES_ENDPOINTNoStandard OTEL SDK env var (set to ${DATALAYER_OTLP_URL} for external use)
OTEL_EXPORTER_OTLP_METRICS_ENDPOINTNoStandard OTEL SDK env var (set to ${DATALAYER_OTLP_URL} for external use)
OTEL_EXPORTER_OTLP_LOGS_ENDPOINTNoStandard OTEL SDK env var (set to ${DATALAYER_OTLP_URL} for external use)
Internal vs. Public Endpoints

Internal services (inside the cluster) use gRPC on port 4317 via the cluster-local service name. External clients use OTLP/HTTP via the public ingress — gRPC is not available over the public endpoint.

Sending OTLP Signals (OTLP/HTTP)

The OTEL Collector is exposed at https://<RUN_HOST>/api/otel/v1/otlp.
External clients use OTLP/HTTP (not gRPC) and authenticate with a Bearer token:

# Public OTLP endpoint for external / internet clients
export DATALAYER_RUN_URL="https://prod1.datalayer.run"
export DATALAYER_API_KEY="<your-jwt-or-api-key>"

# OTLP/HTTP endpoints (used by the core otel example generator)
export DATALAYER_OTLP_URL="${DATALAYER_RUN_URL}/api/otel/v1/otlp"
# Individual signal endpoints:
# POST ${DATALAYER_OTLP_URL}/v1/traces
# POST ${DATALAYER_OTLP_URL}/v1/logs
# POST ${DATALAYER_OTLP_URL}/v1/metrics

Querying the REST API

export DATALAYER_RUN_URL="https://prod1.datalayer.run"
export DATALAYER_API_KEY="<your-jwt-or-api-key>"

# Query traces
curl -H "Authorization: Bearer ${DATALAYER_API_KEY}" \
"${DATALAYER_RUN_URL}/api/otel/v1/traces/?limit=10"

# Query logs
curl -H "Authorization: Bearer ${DATALAYER_API_KEY}" \
"${DATALAYER_RUN_URL}/api/otel/v1/logs/?limit=10"

# Query metrics
curl -H "Authorization: Bearer ${DATALAYER_API_KEY}" \
"${DATALAYER_RUN_URL}/api/otel/v1/metrics/?limit=10"

# Run SQL on DataFusion
curl -X POST -H "Authorization: Bearer ${DATALAYER_API_KEY}" \
-H "Content-Type: application/json" \
-d '{"sql": "SELECT service_name, COUNT(*) as cnt FROM spans GROUP BY service_name ORDER BY cnt DESC LIMIT 10"}' \
"${DATALAYER_RUN_URL}/api/otel/v1/query/sql"

WebSocket (Real-Time Stream)

# Connect to the WebSocket stream (authenticates via query param)
wscat -c "wss://prod1.datalayer.run/api/otel/v1/ws?token=${DATALAYER_API_KEY}"

Core OTEL Example

The core otel example uses these environment variables to connect to a public deployment:

# Point the example at the public Datalayer platform
export DATALAYER_RUN_URL="https://prod1.datalayer.run"
export DATALAYER_API_KEY="<your-jwt-or-api-key>"

# Optional: override the OTLP target (defaults to ${DATALAYER_RUN_URL}/api/otel/v1/otlp)
# export DATALAYER_OTLP_URL="https://prod1.datalayer.run/api/otel/v1/otlp"

# Optional: override the OTEL REST query URL (defaults to DATALAYER_RUN_URL)
# export DATALAYER_OTEL_URL="https://prod1.datalayer.run"

# Start the example
cd examples/otel
uvicorn app.main:app --reload --port 8600

The generator (generator.py) resolves the OTLP endpoint in this order:

  1. DATALAYER_OTLP_URL — explicit OTLP collector URL
  2. DATALAYER_OTEL_RUN_URL or DATALAYER_RUN_URL + /api/otel/v1/otlp
  3. https://prod1.datalayer.run/api/otel/v1/otlp — production fallback

The UI (vite.config.ts) resolves the REST + WebSocket URLs from DATALAYER_RUN_URL (default https://prod1.datalayer.run).

Smoke Test

You can verify the full pipeline end-to-end (send → Pulsar → DataFusion → query) using:

# Via the datalayer CLI (from datalayer-core)
datalayer otel smoke-test --url https://prod1.datalayer.run --token $DATALAYER_API_KEY

# Via the datalayer-otel CLI (from the otel service itself)
datalayer-otel smoke-test --url https://prod1.datalayer.run --token $DATALAYER_API_KEY

This sends test traces, metrics and logs, waits for ingestion, then queries them back and runs SQL queries on the DataFusion tables.