Production AWS platform on Fargate, queues, and Pulumi

Context

By the time I took over the platform side, the same product had several Node services running on a mix of legacy infrastructure. Deploys were partly manual, observability was reactive, and cron jobs ran on a single EC2 box where they sometimes silently died.

Problem

A new service was a multi-day ceremony. Every team copied the previous service’s infra and made small mistakes doing it. Failed background jobs were noticed when a customer complained. Rollbacks were slow enough that the safe move was usually to roll forward.

Approach

Standardized on ECS Fargate with Pulumi templates: a new service is a few hundred lines of TypeScript that anyone on the team can review. Background-work patterns moved onto SQS-driven Fargate workers with retries, dead-letter queues, and CloudWatch alarms tied to real SLOs. Container-based local dev mirrored prod, so the gap between “works on my machine” and “works in production” closed.

Outcome

New services ship in hours instead of days. Worker reliability stopped being a guessing game. Pulumi templates made infra reviewable in PRs the same way application code is. The same blueprint was reused for later services without re-litigating the basics each time.

← All work Get in touch →