Observability & Debugging

This tutorial covers monitoring, tracing, and debugging your OJS deployment. By the end, you’ll have dashboards, alerts, distributed traces, and CLI debugging skills.

Prerequisites

A running OJS backend (any backend works; we’ll use Redis)
Docker and Docker Compose
The OJS CLI installed (go install github.com/openjobspec/ojs-cli@latest)

1. Metrics with Prometheus

Every OJS backend exposes Prometheus metrics at /metrics. Key metrics:

ojs_jobs_enqueued_total{queue, type}      # Jobs enqueued
ojs_jobs_completed_total{queue, type}     # Jobs completed
ojs_jobs_failed_total{queue, type}        # Jobs failed
ojs_queue_depth{queue}                    # Current queue depth
ojs_job_duration_seconds{queue, type}     # Processing time histogram
ojs_worker_active_jobs{worker_id}         # Active jobs per worker
ojs_worker_heartbeat_age_seconds          # Time since last heartbeat

Start with Docker Compose

Create a docker-compose.observability.yml:

services:
  redis:
    image: redis:7-alpine
    ports: ["6379:6379"]

  ojs:
    image: ghcr.io/openjobspec/ojs-backend-redis:0.2.0
    ports: ["8080:8080"]
    environment:
      REDIS_URL: redis://redis:6379
      OJS_ALLOW_INSECURE_NO_AUTH: "true"

  prometheus:
    image: prom/prometheus:latest
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
      GF_AUTH_ANONYMOUS_ENABLED: "true"

And a prometheus.yml:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: ojs
    static_configs:
      - targets: ["ojs:8080"]

docker compose -f docker-compose.observability.yml up -d

Verify metrics

curl -s http://localhost:8080/metrics | grep ojs_

You should see counters, histograms, and gauges for all job operations.

Import Grafana dashboards

Open Grafana at http://localhost:3000 (admin/admin)
Add Prometheus data source: http://prometheus:9090
Import the OJS dashboards from deploy/grafana/:
- Overview — system-wide throughput, latency, error rate
- Queues — per-queue depth, age, throughput
- Workers — count, utilization, heartbeat status
- Jobs — lifecycle timing and state distribution
- Errors — error rate by type, retry patterns
- Performance — p50/p95/p99 latency

2. Distributed Tracing with OpenTelemetry

OJS SDKs include built-in OpenTelemetry middleware that traces jobs across producers and workers.

Go SDK

import "go.opentelemetry.io/otel"

// Producer: traces propagate automatically
client := ojs.NewClient("http://localhost:8080",
    ojs.WithOTel(ojs.OTelConfig{
        ServiceName: "order-api",
    }),
)

// Worker: traces link to producer spans
worker := ojs.NewWorker("http://localhost:8080",
    ojs.WithOTel(ojs.OTelConfig{
        ServiceName: "email-worker",
    }),
)

TypeScript SDK

import { OJSWorker, openTelemetryMiddleware } from '@openjobspec/sdk';

const worker = new OJSWorker({ url: 'http://localhost:8080' });

worker.use(openTelemetryMiddleware({
  serviceName: 'email-worker',
  endpoint: 'http://otel-collector:4317',
}));

Python SDK

worker = ojs.Worker("http://localhost:8080")

@worker.middleware
async def otel_middleware(ctx, next):
    with tracer.start_as_current_span(f"ojs.{ctx.job.type}"):
        return await next(ctx)

Viewing traces

Add Jaeger to your Docker Compose:

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"  # Jaeger UI
      - "4317:4317"    # OTLP gRPC

Open http://localhost:16686 to see traces spanning from enqueue → fetch → process → ack.

3. CLI Debugging

The OJS CLI provides powerful debugging commands:

Live monitoring dashboard

ojs monitor --url http://localhost:8080

Shows a real-time TUI with queue depths, throughput, worker status, and error rates.

Job inspection

# Get job details
ojs status <job-id> --detail

# View job history (state transitions with timestamps)
ojs debug history <job-id>

# Trace a job's full lifecycle
ojs debug trace <job-id>

Queue diagnostics

# Queue stats
ojs queues --url http://localhost:8080

# Check for bottlenecks
ojs debug bottleneck --queue default

# View dead letter queue
ojs dead-letter list

Health checks

# Run diagnostic suite
ojs doctor --url http://localhost:8080

# Output includes:
#   ✓ Server reachable
#   ✓ Backend connected (Redis latency: 1.2ms)
#   ✓ Conformance level: L4
#   ✓ No stale workers
#   ✓ Dead letter queue: 0 jobs

4. Auto-Tuning

OJS includes an auto-tuning engine that analyzes your metrics and recommends optimal settings.

Enable auto-tuning

# Via environment variable
OJS_AUTOTUNE=true

# Or via the API
curl http://localhost:8080/ojs/v1/admin/autotune/analyze | jq .

What it recommends

Parameter	How it’s tuned
Worker concurrency	Little’s Law: throughput × latency
Poll interval	Queue depth + throughput analysis
Retry backoff	Error rate pattern classification
Connection pool	Peak concurrency × 1.5
Visibility timeout	p99 latency × 2

View recommendations

curl http://localhost:8080/ojs/v1/admin/autotune/analyze | jq '.recommendations'

Anomaly detection

The auto-tuning engine includes anomaly detection that alerts on:

Failure spikes — failure rate exceeds baseline by 2σ
Latency drift — p50 latency trending upward
Queue backlog — depth growing faster than processing
Throughput drops — sudden decrease vs learned baseline

curl http://localhost:8080/ojs/v1/admin/autotune/anomalies?learn=true | jq .

5. Alerting Rules

Create Prometheus alerting rules for production:

groups:
  - name: ojs
    rules:
      - alert: OJSQueueBacklog
        expr: ojs_queue_depth > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Queue {{ $labels.queue }} has {{ $value }} pending jobs"

      - alert: OJSHighFailureRate
        expr: rate(ojs_jobs_failed_total[5m]) / rate(ojs_jobs_enqueued_total[5m]) > 0.1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Failure rate above 10% for queue {{ $labels.queue }}"

      - alert: OJSWorkerStall
        expr: time() - ojs_worker_last_heartbeat_timestamp > 60
        labels:
          severity: critical
        annotations:
          summary: "Worker {{ $labels.worker_id }} has not sent a heartbeat in 60s"

Summary

Tool	What it shows	URL
Prometheus	Raw metrics	http://localhost:9090
Grafana	Visual dashboards	http://localhost:3000
Jaeger	Distributed traces	http://localhost:16686
OJS Admin UI	Job management	http://localhost:8080/ojs/admin/
`ojs monitor`	Real-time TUI	CLI
`ojs doctor`	Health diagnostics	CLI
Auto-tuning API	Performance recommendations	`/ojs/v1/admin/autotune/`

Next: Production Deployment →