Observability & Debugging
This tutorial covers monitoring, tracing, and debugging your OJS deployment. By the end, you’ll have dashboards, alerts, distributed traces, and CLI debugging skills.
Prerequisites
Section titled “Prerequisites”- A running OJS backend (any backend works; we’ll use Redis)
- Docker and Docker Compose
- The OJS CLI installed (
go install github.com/openjobspec/ojs-cli@latest)
1. Metrics with Prometheus
Section titled “1. Metrics with Prometheus”Every OJS backend exposes Prometheus metrics at /metrics. Key metrics:
ojs_jobs_enqueued_total{queue, type} # Jobs enqueuedojs_jobs_completed_total{queue, type} # Jobs completedojs_jobs_failed_total{queue, type} # Jobs failedojs_queue_depth{queue} # Current queue depthojs_job_duration_seconds{queue, type} # Processing time histogramojs_worker_active_jobs{worker_id} # Active jobs per workerojs_worker_heartbeat_age_seconds # Time since last heartbeatStart with Docker Compose
Section titled “Start with Docker Compose”Create a docker-compose.observability.yml:
services: redis: image: redis:7-alpine ports: ["6379:6379"]
ojs: image: ghcr.io/openjobspec/ojs-backend-redis:0.2.0 ports: ["8080:8080"] environment: REDIS_URL: redis://redis:6379 OJS_ALLOW_INSECURE_NO_AUTH: "true"
prometheus: image: prom/prometheus:latest ports: ["9090:9090"] volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana: image: grafana/grafana:latest ports: ["3000:3000"] environment: GF_SECURITY_ADMIN_PASSWORD: admin GF_AUTH_ANONYMOUS_ENABLED: "true"And a prometheus.yml:
global: scrape_interval: 15s
scrape_configs: - job_name: ojs static_configs: - targets: ["ojs:8080"]docker compose -f docker-compose.observability.yml up -dVerify metrics
Section titled “Verify metrics”curl -s http://localhost:8080/metrics | grep ojs_You should see counters, histograms, and gauges for all job operations.
Import Grafana dashboards
Section titled “Import Grafana dashboards”- Open Grafana at http://localhost:3000 (admin/admin)
- Add Prometheus data source: http://prometheus:9090
- Import the OJS dashboards from
deploy/grafana/:- Overview — system-wide throughput, latency, error rate
- Queues — per-queue depth, age, throughput
- Workers — count, utilization, heartbeat status
- Jobs — lifecycle timing and state distribution
- Errors — error rate by type, retry patterns
- Performance — p50/p95/p99 latency
2. Distributed Tracing with OpenTelemetry
Section titled “2. Distributed Tracing with OpenTelemetry”OJS SDKs include built-in OpenTelemetry middleware that traces jobs across producers and workers.
Go SDK
Section titled “Go SDK”import "go.opentelemetry.io/otel"
// Producer: traces propagate automaticallyclient := ojs.NewClient("http://localhost:8080", ojs.WithOTel(ojs.OTelConfig{ ServiceName: "order-api", }),)
// Worker: traces link to producer spansworker := ojs.NewWorker("http://localhost:8080", ojs.WithOTel(ojs.OTelConfig{ ServiceName: "email-worker", }),)TypeScript SDK
Section titled “TypeScript SDK”import { OJSWorker, openTelemetryMiddleware } from '@openjobspec/sdk';
const worker = new OJSWorker({ url: 'http://localhost:8080' });
worker.use(openTelemetryMiddleware({ serviceName: 'email-worker', endpoint: 'http://otel-collector:4317',}));Python SDK
Section titled “Python SDK”worker = ojs.Worker("http://localhost:8080")
@worker.middlewareasync def otel_middleware(ctx, next): with tracer.start_as_current_span(f"ojs.{ctx.job.type}"): return await next(ctx)Viewing traces
Section titled “Viewing traces”Add Jaeger to your Docker Compose:
jaeger: image: jaegertracing/all-in-one:latest ports: - "16686:16686" # Jaeger UI - "4317:4317" # OTLP gRPCOpen http://localhost:16686 to see traces spanning from enqueue → fetch → process → ack.
3. CLI Debugging
Section titled “3. CLI Debugging”The OJS CLI provides powerful debugging commands:
Live monitoring dashboard
Section titled “Live monitoring dashboard”ojs monitor --url http://localhost:8080Shows a real-time TUI with queue depths, throughput, worker status, and error rates.
Job inspection
Section titled “Job inspection”# Get job detailsojs status <job-id> --detail
# View job history (state transitions with timestamps)ojs debug history <job-id>
# Trace a job's full lifecycleojs debug trace <job-id>Queue diagnostics
Section titled “Queue diagnostics”# Queue statsojs queues --url http://localhost:8080
# Check for bottlenecksojs debug bottleneck --queue default
# View dead letter queueojs dead-letter listHealth checks
Section titled “Health checks”# Run diagnostic suiteojs doctor --url http://localhost:8080
# Output includes:# ✓ Server reachable# ✓ Backend connected (Redis latency: 1.2ms)# ✓ Conformance level: L4# ✓ No stale workers# ✓ Dead letter queue: 0 jobs4. Auto-Tuning
Section titled “4. Auto-Tuning”OJS includes an auto-tuning engine that analyzes your metrics and recommends optimal settings.
Enable auto-tuning
Section titled “Enable auto-tuning”# Via environment variableOJS_AUTOTUNE=true
# Or via the APIcurl http://localhost:8080/ojs/v1/admin/autotune/analyze | jq .What it recommends
Section titled “What it recommends”| Parameter | How it’s tuned |
|---|---|
| Worker concurrency | Little’s Law: throughput × latency |
| Poll interval | Queue depth + throughput analysis |
| Retry backoff | Error rate pattern classification |
| Connection pool | Peak concurrency × 1.5 |
| Visibility timeout | p99 latency × 2 |
View recommendations
Section titled “View recommendations”curl http://localhost:8080/ojs/v1/admin/autotune/analyze | jq '.recommendations'Anomaly detection
Section titled “Anomaly detection”The auto-tuning engine includes anomaly detection that alerts on:
- Failure spikes — failure rate exceeds baseline by 2σ
- Latency drift — p50 latency trending upward
- Queue backlog — depth growing faster than processing
- Throughput drops — sudden decrease vs learned baseline
curl http://localhost:8080/ojs/v1/admin/autotune/anomalies?learn=true | jq .5. Alerting Rules
Section titled “5. Alerting Rules”Create Prometheus alerting rules for production:
groups: - name: ojs rules: - alert: OJSQueueBacklog expr: ojs_queue_depth > 1000 for: 5m labels: severity: warning annotations: summary: "Queue {{ $labels.queue }} has {{ $value }} pending jobs"
- alert: OJSHighFailureRate expr: rate(ojs_jobs_failed_total[5m]) / rate(ojs_jobs_enqueued_total[5m]) > 0.1 for: 2m labels: severity: critical annotations: summary: "Failure rate above 10% for queue {{ $labels.queue }}"
- alert: OJSWorkerStall expr: time() - ojs_worker_last_heartbeat_timestamp > 60 labels: severity: critical annotations: summary: "Worker {{ $labels.worker_id }} has not sent a heartbeat in 60s"Summary
Section titled “Summary”| Tool | What it shows | URL |
|---|---|---|
| Prometheus | Raw metrics | http://localhost:9090 |
| Grafana | Visual dashboards | http://localhost:3000 |
| Jaeger | Distributed traces | http://localhost:16686 |
| OJS Admin UI | Job management | http://localhost:8080/ojs/admin/ |
ojs monitor | Real-time TUI | CLI |
ojs doctor | Health diagnostics | CLI |
| Auto-tuning API | Performance recommendations | /ojs/v1/admin/autotune/ |
Next: Production Deployment →