Going to production¶
The 80/20 of running Soniq in production. If you do the four things in Required below and the four things in Strongly recommended, you have a healthy deploy. Everything else on this section is detail.
At a glance¶
A copy-into-the-ticket checklist:
-
SONIQ_DATABASE_URLandSONIQ_JOBS_MODULESset in the worker environment -
soniq setupruns once per deploy (CI step, init container, or migration job - never from app startup on every replica) - Process manager sends
SIGTERMfor shutdown, withterminationGracePeriodSeconds/TimeoutStopSec>= longest job timeout - Handlers are idempotent (safe to run more than once - upserts, dedup keys, idempotency tokens)
- A metrics sink wired up (Prometheus or your own
MetricsSink) - If you use
@app.periodic, a separatesoniq schedulerprocess is running -
SONIQ_POOL_MAX_SIZE >= concurrency + SONIQ_POOL_HEADROOM, andmax_connectionson the database has headroom fornum_workers * pool_size - Per-job timeouts set for slow jobs; the global default is 300s
The rest of this page explains each item. The common production mistakes section at the bottom is the most useful page on the site if something is misbehaving in production - read it first if you are debugging.
Required¶
Set SONIQ_DATABASE_URL and SONIQ_JOBS_MODULES¶
| Variable | Example | Why |
|---|---|---|
SONIQ_DATABASE_URL |
postgresql://user:pass@host/db |
Postgres connection string. |
SONIQ_JOBS_MODULES |
myapp.jobs,myapp.billing |
Modules the worker imports on startup so @app.job decorators run. Workers cannot process jobs they cannot import. |
Run soniq setup only once per deploy¶
soniq setup applies versioned migrations and is idempotent, but multiple replicas racing to apply the same migrations cause confusing errors. Run it from a CI step, a Kubernetes init container, a dedicated migration job, or an entrypoint that fires only on the first replica. Never from application startup on every replica.
Stop workers with SIGTERM, not SIGKILL¶
Soniq handles SIGTERM by finishing in-flight jobs before exiting. SIGKILL (or OOM) leaves jobs stuck in processing until the heartbeat sweep recovers them, which takes up to SONIQ_HEARTBEAT_TIMEOUT (default 300s). Match the grace window in your process manager:
- systemd:
TimeoutStopSec=<longest job timeout + buffer> - Kubernetes:
terminationGracePeriodSeconds: <same> - Supervisor:
stopwaitsecs=<same>
Design jobs to be idempotent¶
Soniq guarantees at-least-once delivery: a job will run at least once, and may run more than once. If a worker crashes after running your handler but before marking the row done, the heartbeat sweep will requeue the job and another worker will run it. Idempotent means "safe to run more than once with the same end result" - use upserts (INSERT ... ON CONFLICT DO UPDATE), dedup checks against current state, or idempotency keys for any side effect you do not want to repeat. The fix is "make the second run a no-op", not "guarantee the second run never happens" - on Postgres alone the latter is not possible.
Strongly recommended¶
Logging and observability¶
Wire a metrics sink:
from soniq import Soniq
from soniq.metrics import PrometheusMetricsSink
app = Soniq(metrics_sink=PrometheusMetricsSink())
Mount /metrics from prometheus_client.make_asgi_app() on whatever HTTP surface you scrape, or run soniq dashboard and point Grafana at its /api/metrics.
Run soniq scheduler if you use @app.periodic¶
The CLI worker (soniq worker) does not evaluate due recurring jobs. Run a separate soniq scheduler process. Multiple instances coordinate via a Postgres advisory lock; runners-up wait for the leader.
Pool sizing¶
Each worker maintains its own asyncpg pool. Soniq reserves SONIQ_POOL_HEADROOM connections (default 2) for the LISTEN/NOTIFY listener and heartbeat writer:
Total Postgres load is num_worker_processes * SONIQ_POOL_MAX_SIZE. Confirm max_connections on the database can handle that plus your application's pools and admin sessions.
Per-job timeouts¶
The global default is SONIQ_JOB_TIMEOUT=300 seconds. Override for slow jobs with @app.job(timeout=600); set to 0 to disable per-job. Without a timeout, a hung handler ties up a concurrency slot until the worker restarts.
Common production mistakes¶
- Running
soniq setupfrom application startup on every replica. Replicas race, migrations error out. Run it in one place per deploy. concurrencyhigher than the pool can serve. RaisesConnection pool too small. Either raiseSONIQ_POOL_MAX_SIZEor lower concurrency.- Sync handlers monopolising the bounded thread pool.
def(non-async) handlers run on a pool ofSONIQ_SYNC_HANDLER_POOL_SIZE(default 8). If those are slow, async handlers cannot claim slots. Convert toasync defor raise the pool size. - Worker host does not have your job code installed.
SONIQ_JOBS_MODULES=myapp.jobsis an instruction to importmyapp.jobs, not a way to ship code. Bake the code into the worker image. - PgBouncer in transaction-pooling mode. Breaks
LISTEN/NOTIFY(workers fall back to polling) and breaks the scheduler advisory lock. Use session pooling for worker and scheduler connections. SONIQ_DATABASE_URLdivergence betweensetupandworker.setupwrites to URL A; the worker reads URL B; the worker logs "table does not exist."- Forgetting
awaitafter switching from Celery. Celery's.delay()is synchronous. Soniq'senqueue(...)is a coroutine. A bareapp.enqueue(send_email, ...)schedules nothing. Treat theRuntimeWarning: coroutine was never awaitedwarning as an error in CI. - Resurrecting a full DLQ without fixing the cause.
soniq dead-letter replay --allre-runs every dead-letter job. Use--dry-runfirst, fix the underlying bug, then replay.
Next steps¶
- Deployment recipes - systemd, Docker Compose, Kubernetes, Supervisor
- PostgreSQL tuning - pool sizing, connection ceilings, fsync notes
- Reliability - what at-least-once means, recovering from crashes
- Troubleshooting - symptom -> cause -> fix