Retry Policies
The retry policy governs what happens when a job handler fails: how many times the job is re-attempted, how long the system waits between attempts, and what happens when all attempts are exhausted.
OJS adopts Temporal’s structured retry policy format because it is explicit, language-agnostic, and avoids the field-overloading anti-patterns found in other systems (like Sidekiq’s retry: true vs retry: 5 vs retry: false).
RetryPolicy Structure
Section titled “RetryPolicy Structure”{ "max_attempts": 3, "initial_interval": "PT1S", "backoff_coefficient": 2.0, "max_interval": "PT5M", "jitter": true, "non_retryable_errors": [], "on_exhaustion": "discard"}| Field | Type | Default | Description |
|---|---|---|---|
max_attempts | integer | 3 | Total attempts including the first execution. 1 means no retries. 0 means never retry. |
initial_interval | string | "PT1S" | ISO 8601 duration before the first retry |
backoff_coefficient | number | 2.0 | Multiplier for successive retry delays. Must be >= 1.0 |
max_interval | string | "PT5M" | ISO 8601 duration cap on computed delay |
jitter | boolean | true | Whether to add randomness to prevent thundering herd |
non_retryable_errors | string[] | [] | Error types that skip retry entirely |
on_exhaustion | string | "discard" | Action when exhausted: "discard" or "dead_letter" |
The max_attempts field counts total attempts, not retries. If max_attempts is 3, the job runs at most 3 times (1 initial execution plus 2 retries). This avoids the off-by-one confusion of Sidekiq’s retry: 25, which means 26 total executions.
Backoff Strategies
Section titled “Backoff Strategies”OJS supports four backoff strategies. The default is exponential.
Constant (None)
Section titled “Constant (None)”The delay is the same for every retry.
delay(n) = initial_intervalUse case: polling an external service with a known, fixed recovery time.
Linear
Section titled “Linear”The delay increases by initial_interval for each successive retry.
delay(n) = initial_interval * nUse case: moderate backpressure with gradually increasing delays.
Exponential (Default)
Section titled “Exponential (Default)”The delay grows exponentially using the backoff coefficient.
delay(n) = initial_interval * backoff_coefficient^(n-1)With initial_interval = "PT1S" and backoff_coefficient = 2.0:
| Attempt | Retry # | Delay | Capped (max 5m) |
|---|---|---|---|
| 2 | 1 | 1s | 1s |
| 3 | 2 | 2s | 2s |
| 4 | 3 | 4s | 4s |
| 5 | 4 | 8s | 8s |
| 6 | 5 | 16s | 16s |
| 7 | 6 | 32s | 32s |
| 8 | 7 | 64s | 64s |
| 9 | 8 | 128s | 128s |
| 10 | 9 | 256s | 256s |
| 11 | 10 | 512s | 300s (capped) |
Polynomial (Sidekiq-style)
Section titled “Polynomial (Sidekiq-style)”The delay grows polynomially, using the backoff coefficient as the exponent.
delay(n) = initial_interval * n^backoff_coefficientWith backoff_coefficient = 4.0, this produces Sidekiq-equivalent growth. Polynomial backoff grows slower than exponential, making it suitable for high-retry-count scenarios (25+ retries over hours or days).
Regardless of strategy, the max_interval cap is always applied:
effective_delay(n) = min(raw_delay(n), max_interval)Jitter
Section titled “Jitter”When jitter is true, the implementation applies a random multiplier to decorrelate retries:
jittered_delay = effective_delay * random(0.5, 1.5)final_delay = min(jittered_delay, max_interval)The multiplier is uniformly distributed in the half-open interval [0.5, 1.5). This means a 10-second computed delay becomes something between 5 and 15 seconds. The max_interval cap is applied again after jitter.
Jitter is enabled by default because the thundering herd problem is common in production: when many jobs fail simultaneously due to a downstream outage, without jitter they would all retry at the same instant, recreating the same load spike.
Non-Retryable Errors
Section titled “Non-Retryable Errors”The non_retryable_errors array lists error type strings that should never trigger a retry. Error types use dot-namespaced strings:
{ "non_retryable_errors": [ "validation.payload_invalid", "auth.*" ]}Matching supports exact match and prefix match (with .* suffix):
| Error Type | Matches "auth.*"? |
|---|---|
auth.token_expired | Yes (prefix match) |
auth.forbidden | Yes (prefix match) |
auth | No (auth does not start with auth.) |
external.auth.failure | No (prefix is auth, not external.auth) |
When a handler error matches a non-retryable entry, the job immediately transitions to the terminal state determined by on_exhaustion, regardless of remaining attempts.
Handler Response Codes
Section titled “Handler Response Codes”Handlers can override the default retry behavior at runtime by returning structured error codes:
| Code | Meaning |
|---|---|
RETRY | Retry with normal backoff (the default when no code is specified) |
DISCARD | Discard immediately. No retry, no dead letter. |
DEAD_LETTER | Move to dead letter queue immediately. Skip all remaining retries. |
FAIL | Permanent failure. Semantically identical to DISCARD but signals a recognized failure condition. |
Handler response codes take precedence over the retry policy. A payment handler that receives a “card stolen” response knows definitively that retrying is futile and can return DEAD_LETTER to escalate for investigation.
Default Retry Policy
Section titled “Default Retry Policy”Jobs enqueued without an explicit retry policy receive these defaults:
{ "max_attempts": 3, "initial_interval": "PT1S", "backoff_coefficient": 2.0, "max_interval": "PT5M", "jitter": true, "non_retryable_errors": [], "on_exhaustion": "discard"}When a job specifies a partial retry policy, provided fields override the defaults and omitted fields inherit from them. A developer who only wants max_attempts: 10 does not need to re-specify every other field.
Practical Examples
Section titled “Practical Examples”No retry at all (analytics event that is acceptable to lose):
{ "retry": { "max_attempts": 1 } }Aggressive retry with dead letter (payment processing):
{ "retry": { "max_attempts": 25, "initial_interval": "PT15S", "backoff_coefficient": 4.0, "max_interval": "PT1H", "non_retryable_errors": ["payment.card_stolen", "payment.card_expired", "validation.*"], "on_exhaustion": "dead_letter" }}Constant backoff for polling (check export status every 10 seconds):
{ "retry": { "max_attempts": 60, "initial_interval": "PT10S", "backoff_coefficient": 1.0, "max_interval": "PT10S", "jitter": false }}Interaction with Visibility Timeout
Section titled “Interaction with Visibility Timeout”The retry delay and visibility timeout are independent mechanisms. The retry delay applies when a worker explicitly reports failure via FAIL. The visibility timeout applies when a worker crashes or disappears without acknowledging the job. When a visibility timeout expires, it counts as a failure attempt and follows the retry policy.
Workers can extend the visibility timeout by sending heartbeats during long-running execution. This prevents jobs from being prematurely requeued while still actively processing.