Skip to content

Observability

OJS defines OpenTelemetry-native observability conventions covering trace context propagation, span semantics, metrics, structured logging, and health endpoints.

OJS uses W3C Trace Context for distributed tracing. The trace parent is stored in the job envelope metadata:

{
"type": "email.send",
"args": ["user@example.com"],
"meta": {
"traceparent": "00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01"
}
}

When a job is enqueued, the SDK injects the current trace context into meta.traceparent. When a worker processes the job, it extracts the trace context and creates a linked span, enabling end-to-end tracing from producer through queue to consumer.

AttributeValue
messaging.systemojs
messaging.operationpublish
ojs.job.idJob ID
ojs.job.typeJob type
ojs.queueTarget queue
Span kindPRODUCER
AttributeValue
messaging.systemojs
messaging.operationprocess
ojs.job.idJob ID
ojs.job.typeJob type
ojs.job.attemptCurrent attempt number
ojs.worker.idWorker identifier
Span kindCONSUMER
LinksLink to the PRODUCER span via traceparent

Backend internal operations create spans for: scheduler promotion, cron evaluation, stalled job reaping, dead letter processing, and workflow orchestration.

MetricDescription
ojs.jobs.enqueuedTotal jobs enqueued
ojs.jobs.completedTotal jobs completed successfully
ojs.jobs.failedTotal jobs that failed (including retries)
ojs.jobs.retriedTotal retry attempts
ojs.jobs.discardedTotal jobs discarded after exhaustion
ojs.jobs.expiredTotal jobs expired (TTL)
MetricUnitDescription
ojs.jobs.durationmsJob execution duration
ojs.jobs.queue_timemsTime spent waiting in queue
ojs.jobs.enqueue_durationmsTime to enqueue a job
MetricDescription
ojs.queue.depthCurrent number of jobs in queue
ojs.workers.activeNumber of connected workers
ojs.workers.busyNumber of workers currently processing

All metrics SHOULD be labeled with queue and job_type dimensions for filtering.

OJS backends SHOULD emit structured JSON logs with correlation fields:

{
"timestamp": "2026-02-15T10:30:00.123Z",
"severity": "INFO",
"message": "job completed",
"component": "worker",
"job_id": "01961234-5678-7abc-def0-123456789abc",
"job_type": "email.send",
"queue": "default",
"duration_ms": 245,
"trace_id": "0af7651916cd43dd8448eb211c80319c",
"span_id": "b7ad6b7169203331"
}

Including trace_id and span_id in log entries enables correlation between logs and traces.

Backends MUST expose a health endpoint at GET /ojs/v1/health:

{
"status": "healthy",
"backend": {
"type": "redis",
"connected": true,
"latency_ms": 1
},
"queues": {
"total": 5,
"paused": 0
},
"workers": {
"active": 3,
"busy": 2
}
}
StatusMeaning
healthyAll systems operational
degradedFunctioning with reduced capability
unhealthyUnable to process jobs
RequirementLevel
Health endpointRequired (Level 0)
Trace context propagationRequired (Level 1)
Job lifecycle metricsRequired (Level 1)
Structured logging with job contextRecommended
System span instrumentationRecommended
Queue and worker gaugesRecommended