3. Retries¶

Beginner - 7 minutes. Retry counts, delays, backoff, and the dead-letter queue.

Soniq retries failed jobs automatically. Every job gets 3 retries by default (4 total attempts). You control the retry count, delay strategy, and backoff at the decorator level.

A note on terms. Soniq guarantees at-least-once delivery: a job will run at least once, and may run more than once if a worker crashes mid-handler. Idempotent means "safe to run more than once with the same end result." The two ideas come together in this chapter - retries are the most common reason a job runs twice, so handlers need to be idempotent. The Idempotency section at the bottom shows how.

Configuration¶

@app.job(
    max_retries=5,        # retry up to 5 times (6 total attempts)
    retry_delay=10,       # wait 10 seconds between retries
    retry_backoff=True,   # exponential backoff
    retry_max_delay=300,  # never wait more than 5 minutes
)
async def call_payment_api(invoice_id: str):
    ...

Parameter	Type	Default	Description
`max_retries`	`int`	`3`	Number of retry attempts after first failure
`retry_delay`	`int \\| list[int]`	`0`	Seconds between retries
`retry_backoff`	`bool`	`False`	Multiply delay exponentially on each attempt
`retry_max_delay`	`int \\| None`	`None`	Upper bound on computed delay

Fixed delay¶

Same wait between every retry:

@app.job(max_retries=5, retry_delay=10)
async def sync_inventory(product_id: str):
    ...

Retries at: 10s, 10s, 10s, 10s, 10s.

Per-attempt delays¶

Provide a list to set exact delays for each retry:

@app.job(max_retries=3, retry_delay=[1, 5, 30])
async def send_verification_email(user_id: int):
    ...

Retries at: 1s, 5s, 30s. If there are more retries than list entries, the last value is reused.

Exponential backoff¶

Set retry_backoff=True to double the delay on each attempt:

@app.job(max_retries=5, retry_delay=10, retry_backoff=True)
async def call_payment_api(invoice_id: str):
    ...

Retries at: 10s, 20s, 40s, 80s, 160s.

If retry_delay is 0, a base of 1 second is used: 1s, 2s, 4s, 8s, 16s.

Max delay cap¶

Prevent delays from growing unbounded:

@app.job(max_retries=8, retry_delay=5, retry_backoff=True, retry_max_delay=300)
async def fetch_external_report(report_id: str):
    ...

Retries at: 5s, 10s, 20s, 40s, 80s, 160s, 300s, 300s. The delay never exceeds retry_max_delay.

After max retries: the dead-letter queue¶

When a job exhausts all retries, the row is moved out of soniq_jobs and into the dead-letter queue - the soniq_dead_letter_jobs table, which is a holding area for jobs that failed every retry. The original soniq_jobs row is deleted in the same transaction.

Dead-letter jobs do not run on their own. You inspect them, fix the underlying problem, and replay the ones you want re-run. See Dead-letter queue for the full inspect / replay workflow.

Examples¶

API call with backoff¶

External APIs may rate-limit or have intermittent outages. Exponential backoff gives them time to recover.

@app.job(max_retries=5, retry_delay=2, retry_backoff=True, retry_max_delay=120)
async def sync_user_to_crm(user_id: int):
    user = await get_user(user_id)
    async with httpx.AsyncClient() as client:
        resp = await client.post("https://crm.example.com/api/users", json=user.dict())
        resp.raise_for_status()

Email sending with fixed delay¶

Email servers sometimes reject connections temporarily. A short fixed delay usually works.

@app.job(max_retries=3, retry_delay=5)
async def send_order_confirmation(order_id: str):
    order = await get_order(order_id)
    await smtp_send(to=order.email, subject=f"Order {order_id} confirmed", body=...)

File processing with escalating delays¶

File imports may fail due to locks or temporary storage issues. Escalate the delay to avoid hammering the system.

@app.job(max_retries=3, retry_delay=[5, 30, 120])
async def import_csv(file_path: str):
    async with aiofiles.open(file_path) as f:
        rows = await parse_csv(f)
    await bulk_insert(rows)

Idempotency¶

Soniq provides at-least-once delivery: a job will run at least once, and may run more than once. Retries are the obvious source of repeats, but a worker that crashes mid-handler will also cause the job to be picked up by another worker and re-run. Every job queue has this property - it is not a Soniq bug, it is a consequence of "do not lose work when machines fail".

The fix is to make your handler idempotent - safe to run more than once with the same end result:

Use INSERT ... ON CONFLICT DO UPDATE instead of plain inserts. The second run updates a row that already exists, instead of creating a duplicate.
Store an idempotency key (often the job arguments hashed, or a unique field on the request) and check it before performing the side effect. If the key already shows the work as done, return.
Check current state before acting. Before sending an email, look at whether you have already recorded sending it. Before charging a card, look at whether the charge already exists in your payments table.

The pattern is "make the second run a no-op", not "guarantee the second run never happens".

Job timeouts¶

Every job has a default execution timeout of 300 seconds (5 minutes). If a job exceeds its timeout, it is treated as a failure and follows normal retry logic.

@app.job(max_retries=3, timeout=60)
async def quick_health_check(service_url: str):
    ...

@app.job(timeout=None)  # no timeout
async def long_running_migration():
    ...

Change the global default with SONIQ_JOB_TIMEOUT. Set to 0 to disable timeouts globally.