reliability

Exponential backoff with jitter for retries

Immediate retries are a great way to stampede a struggling dependency. I use exponential backoff with jitter so retries spread out naturally and don’t synchronize across instances, and I cap the maximum delay so worst-case latency doesn’t become unbou

Retry Postgres serialization failures with bounded attempts

For workloads that run at SERIALIZABLE isolation (or that hit serialization conflicts under load), retries are part of the contract. The important part is to retry only the safe errors (typically SQLSTATE 40001) and to keep the loop bounded so you don

React Error Boundary + error reporting hook

Frontend errors are inevitable, so I’d rather control the blast radius. I wrap unstable sections with an ErrorBoundary that shows a friendly fallback and captures context for reporting. I also avoid spamming the error tracker by including a fingerprin

Background Job Dead Letter Queue (DLQ) Table

Retries can become infinite noise. For certain failure classes, record the payload + error in a DLQ table and alert/triage. This keeps queues healthy and makes failures actionable.

Cron scheduling with node-cron (with guard)

Periodic jobs are common (cleanup, digests, rollups), and it’s easy to accidentally run them on every instance in production. In local dev, in-process node-cron is fine. In production, I either run a dedicated worker or add a guard so only one instanc

Health Check Endpoint with Dependency Probes

A real health check tests the dependencies you care about (DB, Redis). Keep it fast and don’t make it do expensive queries. Use it for load balancers and alerts.

Safer Feature Flagging: Cache + DB Fallback

A robust feature flag read path should be fast, but also resilient to cache outages. Cache the computed result briefly and fall back to DB if needed; keep the interface dead simple.

Transactional outbox in Node (DB write + event)

The moment you split ‘write to DB’ and ‘publish to a queue’ into two independent operations, you create a place to lose data. Publish first and a DB failure means consumers act on something that never happened. Write first and a publish failure means

Idempotent Job with Advisory Lock

I reached for idempotency the moment retries started duplicating side effects. In Advisory lock helper, I generate a deterministic lock ID and use pg_try_advisory_lock to ensure only one worker owns the critical section; the ensure block always calls

Safer Background Reindex: slice batches + checkpoints

Full reindexes can be long and fragile. Add checkpoints (last processed id), process in batches, and make it resumable. That turns a scary operation into a routine one.

“Read Your Writes” Consistency: Pin to Primary After POST

With replicas, a user can create a record then immediately not see it due to replication lag. A simple mitigation is to pin subsequent reads to primary for a short window after writes.

Config parsing with env defaults and strict validation

Config bugs are some of the most expensive production incidents because they vary by environment and can be hard to reproduce. I keep configuration in a typed struct, load it from environment variables, and validate it before the server starts. The va