reliability

Robfig cron jobs with context cancellation and jitter

Scheduled jobs get risky when deploys overlap or when jobs take longer than their interval. I use robfig/cron with a wrapper that creates a per-run context.WithTimeout, adds a bit of jitter to avoid synchronized load across instances, and logs start/f

Panic recovery middleware that fails closed and logs context

Even well-tested services panic occasionally: a nil pointer from an unexpected edge case, a slice bounds bug, or a library doing something surprising. If you don’t recover, one request can crash the entire process and turn a small bug into an outage.

Lazy init of expensive clients with sync.Once

Some dependencies are expensive to initialize (TLS-heavy clients, large config loads, SDKs that fetch metadata). I use sync.Once to ensure initialization happens exactly once, even under concurrent access. The important detail is capturing the initial

Safer Time-Based Deletes with “mark then sweep”

Direct deletes can be risky and slow. Mark records for deletion, then sweep in batches in a maintenance job. This gives you observability and a rollback window.

Idempotency keys for “create” endpoints

Retries are inevitable: mobile clients, flaky networks, and load balancers will resend POST requests. Without idempotency you end up double-charging or double-creating records. I store an Idempotency-Key with a sha256 hash of the request body and the

Defensive Deserialization for ActiveJob

Deploys happen while jobs are in the queue. Be defensive: accept both old and new payload shapes, and keep migrations forward-compatible. This prevents “deploy broke jobs” incidents.

Feature flag snapshot with periodic refresh and atomic reads

I like feature flags that are boring at runtime: reads should be lock-free and refresh should happen in the background. The pattern here stores a JSON flag snapshot in an atomic.Value, which makes reads cheap and race-free. A ticker refreshes the snap

Inbound rate limiting middleware with token bucket

If you expose an API to the public internet, you need a basic guardrail against bursts. I like a token bucket with rate.NewLimiter because it’s easy to reason about: steady-state rate plus a burst capacity. The middleware checks Allow() and returns 42

Postgres advisory lock for one-at-a-time work

Sometimes you just need ‘only one worker does this thing at a time’, and building distributed locks from scratch is risky. Postgres advisory locks are a pragmatic option when your DB is already the source of truth. I derive a deterministic lock key (l

Readiness and liveness probes with dependency checks

I separate liveness from readiness because they answer different questions. Liveness is “is the process alive enough to respond?” and should be cheap; readiness is “can this instance take traffic?” and can include dependency checks like DB connectivit

SQL migration safety: add column nullable, backfill, then constrain

The fastest way to surprise yourself in production is ADD COLUMN ... NOT NULL DEFAULT ... on a large table. My safe pattern is: add the column nullable with no default, backfill in batches, then add the NOT NULL constraint. If I need a default for new

Graceful shutdown for Node HTTP servers

Deploys got a lot calmer once I treated SIGTERM as a first-class signal instead of an afterthought. In Kubernetes (and most PaaS platforms), you’re expected to stop accepting new requests quickly while finishing in-flight work. The worst failure mode