Skip to content

Retry Policies

The retry policy governs what happens when a job handler fails: how many times the job is re-attempted, how long the system waits between attempts, and what happens when all attempts are exhausted.

OJS adopts Temporal’s structured retry policy format because it is explicit, language-agnostic, and avoids the field-overloading anti-patterns found in other systems (like Sidekiq’s retry: true vs retry: 5 vs retry: false).

{
"max_attempts": 3,
"initial_interval": "PT1S",
"backoff_coefficient": 2.0,
"max_interval": "PT5M",
"jitter": true,
"non_retryable_errors": [],
"on_exhaustion": "discard"
}
FieldTypeDefaultDescription
max_attemptsinteger3Total attempts including the first execution. 1 means no retries. 0 means never retry.
initial_intervalstring"PT1S"ISO 8601 duration before the first retry
backoff_coefficientnumber2.0Multiplier for successive retry delays. Must be >= 1.0
max_intervalstring"PT5M"ISO 8601 duration cap on computed delay
jitterbooleantrueWhether to add randomness to prevent thundering herd
non_retryable_errorsstring[][]Error types that skip retry entirely
on_exhaustionstring"discard"Action when exhausted: "discard" or "dead_letter"

The max_attempts field counts total attempts, not retries. If max_attempts is 3, the job runs at most 3 times (1 initial execution plus 2 retries). This avoids the off-by-one confusion of Sidekiq’s retry: 25, which means 26 total executions.

OJS supports four backoff strategies. The default is exponential.

The delay is the same for every retry.

delay(n) = initial_interval

Use case: polling an external service with a known, fixed recovery time.

The delay increases by initial_interval for each successive retry.

delay(n) = initial_interval * n

Use case: moderate backpressure with gradually increasing delays.

The delay grows exponentially using the backoff coefficient.

delay(n) = initial_interval * backoff_coefficient^(n-1)

With initial_interval = "PT1S" and backoff_coefficient = 2.0:

AttemptRetry #DelayCapped (max 5m)
211s1s
322s2s
434s4s
548s8s
6516s16s
7632s32s
8764s64s
98128s128s
109256s256s
1110512s300s (capped)

The delay grows polynomially, using the backoff coefficient as the exponent.

delay(n) = initial_interval * n^backoff_coefficient

With backoff_coefficient = 4.0, this produces Sidekiq-equivalent growth. Polynomial backoff grows slower than exponential, making it suitable for high-retry-count scenarios (25+ retries over hours or days).

Regardless of strategy, the max_interval cap is always applied:

effective_delay(n) = min(raw_delay(n), max_interval)

When jitter is true, the implementation applies a random multiplier to decorrelate retries:

jittered_delay = effective_delay * random(0.5, 1.5)
final_delay = min(jittered_delay, max_interval)

The multiplier is uniformly distributed in the half-open interval [0.5, 1.5). This means a 10-second computed delay becomes something between 5 and 15 seconds. The max_interval cap is applied again after jitter.

Jitter is enabled by default because the thundering herd problem is common in production: when many jobs fail simultaneously due to a downstream outage, without jitter they would all retry at the same instant, recreating the same load spike.

The non_retryable_errors array lists error type strings that should never trigger a retry. Error types use dot-namespaced strings:

{
"non_retryable_errors": [
"validation.payload_invalid",
"auth.*"
]
}

Matching supports exact match and prefix match (with .* suffix):

Error TypeMatches "auth.*"?
auth.token_expiredYes (prefix match)
auth.forbiddenYes (prefix match)
authNo (auth does not start with auth.)
external.auth.failureNo (prefix is auth, not external.auth)

When a handler error matches a non-retryable entry, the job immediately transitions to the terminal state determined by on_exhaustion, regardless of remaining attempts.

Handlers can override the default retry behavior at runtime by returning structured error codes:

CodeMeaning
RETRYRetry with normal backoff (the default when no code is specified)
DISCARDDiscard immediately. No retry, no dead letter.
DEAD_LETTERMove to dead letter queue immediately. Skip all remaining retries.
FAILPermanent failure. Semantically identical to DISCARD but signals a recognized failure condition.

Handler response codes take precedence over the retry policy. A payment handler that receives a “card stolen” response knows definitively that retrying is futile and can return DEAD_LETTER to escalate for investigation.

Jobs enqueued without an explicit retry policy receive these defaults:

{
"max_attempts": 3,
"initial_interval": "PT1S",
"backoff_coefficient": 2.0,
"max_interval": "PT5M",
"jitter": true,
"non_retryable_errors": [],
"on_exhaustion": "discard"
}

When a job specifies a partial retry policy, provided fields override the defaults and omitted fields inherit from them. A developer who only wants max_attempts: 10 does not need to re-specify every other field.

No retry at all (analytics event that is acceptable to lose):

{ "retry": { "max_attempts": 1 } }

Aggressive retry with dead letter (payment processing):

{
"retry": {
"max_attempts": 25,
"initial_interval": "PT15S",
"backoff_coefficient": 4.0,
"max_interval": "PT1H",
"non_retryable_errors": ["payment.card_stolen", "payment.card_expired", "validation.*"],
"on_exhaustion": "dead_letter"
}
}

Constant backoff for polling (check export status every 10 seconds):

{
"retry": {
"max_attempts": 60,
"initial_interval": "PT10S",
"backoff_coefficient": 1.0,
"max_interval": "PT10S",
"jitter": false
}
}

The retry delay and visibility timeout are independent mechanisms. The retry delay applies when a worker explicitly reports failure via FAIL. The visibility timeout applies when a worker crashes or disappears without acknowledging the job. When a visibility timeout expires, it counts as a failure attempt and follows the retry policy.

Workers can extend the visibility timeout by sending heartbeats during long-running execution. This prevents jobs from being prematurely requeued while still actively processing.