operations

Background Job Dead Letter Queue (DLQ) Table

Retries can become infinite noise. For certain failure classes, record the payload + error in a DLQ table and alert/triage. This keeps queues healthy and makes failures actionable.

Health Check Endpoint with Dependency Probes

A real health check tests the dependencies you care about (DB, Redis). Keep it fast and don’t make it do expensive queries. Use it for load balancers and alerts.

Safer Background Reindex: slice batches + checkpoints

Full reindexes can be long and fragile. Add checkpoints (last processed id), process in batches, and make it resumable. That turns a scary operation into a routine one.

Database-Backed “Run Once” Migrations for Maintenance Tasks

Sometimes you need a one-time maintenance operation outside normal schema changes. Use a small table to track “run once” tasks so reruns are safe and the operation is visible.

Avoid Memory Blowups: find_each + select Columns

Backfills often fail because we accidentally load full records and associations. Use select to fetch only needed columns and find_each to keep memory flat. This is basic, but it’s where outages come from.

Counter Cache Repair Job (Consistency Tooling)

Counter caches drift (deleted records, backfills, manual SQL). A repair job that recomputes counts safely is invaluable. It’s the kind of operational code you’re glad you wrote the first time a dashboard is wrong.

Cache Key Versioning with a Single “namespace”

When cache structures change, you want to invalidate safely without flushing the world. Use a namespace version key (per feature) and incorporate it into cache keys.

Background Job Backpressure with Queue Depth Guard

When downstream systems degrade, jobs pile up and amplify outages. Add a simple “queue depth guard” so non-critical jobs skip or reschedule instead of making the backlog worse.

Guard Rails for Dangerous Admin Actions

Admin actions are production sharp edges. Require a typed confirmation string, log actor + request_id, and run the dangerous work in a background job. This reduces accidental incidents.

Safer Time-Based Deletes with “mark then sweep”

Direct deletes can be risky and slow. Mark records for deletion, then sweep in batches in a maintenance job. This gives you observability and a rollback window.

Keep DB Connections Healthy in Long Jobs

Long-running jobs can hit stale connections. Wrap work in with_connection and consider verify! before heavy DB usage. This reduces “PG::ConnectionBad” noise during long maintenance tasks.

Redis-Based Distributed Mutex (with TTL)

Sometimes you need “only one runner globally” (backfills, refresh jobs). A Redis mutex with TTL avoids deadlocks if the process dies. It’s not perfect, but it’s a solid pragmatic tool.