10x Throughput for Digital Banking
We migrated Meridian's 8-year-old Rails monolith to a Go microservices architecture with full Kubernetes orchestration. Result: deployment time dropped from 40 minutes to 6, and outages essentially disappeared.
99.97%
Uptime
Up from 98.2% before migration (3 outages/month → 0 in last 6 months)
47ms
p95 Latency
Down from 890ms under the Rails monolith during peak hours
42.3%
Infrastructure Cost Reduction
Right-sized Kubernetes pods replaced over-provisioned EC2 instances
6 min
Deploy Time
Down from 40-minute rolling restarts with zero-downtime canary deploys
The Challenge
What We Were Up Against
Meridian Financial had been running a Ruby on Rails monolith for eight years. What started as a fast MVP had grown into a 340K-line codebase handling 2.3 million daily transactions. The single PostgreSQL instance was hitting connection pool limits during market hours, Sidekiq background workers were single-threaded and backing up queues by 40+ minutes during peak load, and the deployment process required a 40-minute rolling restart that the team could only risk running once a week — usually on Sunday nights. The breaking point came when three production outages in a single month triggered an SLA review from their largest institutional client.
Database Bottleneck
Single PostgreSQL instance handling both OLTP and reporting queries. Connection pool maxing out at 2.3M daily transactions, causing request queuing during US market hours (9:30 AM – 4:00 PM ET).
Deployment Risk
40-minute rolling restarts with no canary deployment capability. Every deploy was all-or-nothing, limiting releases to weekly Sunday windows and making hotfixes terrifying.
Queue Backlog
Sidekiq workers processing transaction reconciliation were single-threaded. During peak hours, the queue would back up by 40+ minutes, delaying settlement confirmations to partner banks.
Observability Gaps
No distributed tracing, no structured logging. When outages hit, the team was SSH-ing into production boxes and tailing log files to diagnose issues.
Constraints & Requirements
Zero downtime during migration — $12M+ daily transaction volume couldn't be interrupted
SOC 2 Type II compliance must be maintained throughout
Existing API contracts with 14 partner bank integrations couldn't change
Team of 6 Rails developers needed to be productive during the transition
Our Approach
How We Built It
We chose a strangler fig pattern over a big-bang rewrite. Each service was extracted incrementally behind a facade, letting the legacy system continue operating while new services came online. Go was selected over Rust — the team needed a language with a gentler learning curve and faster hiring pipeline, and Go's goroutine model was a natural fit for the concurrent transaction processing workload.
Audit & Foundation
Weeks 1–6Mapped all domain boundaries in the monolith. Set up Kubernetes cluster, CI/CD pipelines, observability stack (Datadog APM + structured logging), and the API gateway that would route traffic between legacy and new services.
Auth & Identity Service
Weeks 7–12Extracted the authentication service first — it had the clearest boundaries and lowest risk. Implemented JWT-based auth with refresh token rotation, replacing the legacy cookie-based session system.
Transaction Processing Engine
Weeks 13–22The core extraction. Built the transaction processing service in Go with Apache Kafka for event streaming. Chose Kafka over RabbitMQ because we needed message replay capability for transaction audit trails — a hard SOC 2 requirement.
Data Migration & Read Replicas
Weeks 23–28Migrated from single PostgreSQL to a primary + 2 read replica setup. Separated reporting queries onto replicas, freeing the primary for transactional writes. Used pglogical for zero-downtime replication setup.
Legacy Decommission & Optimization
Weeks 29–34Decommissioned remaining Rails routes, optimized Kubernetes resource allocation based on 6 weeks of production metrics, and implemented auto-scaling policies.
Key Features
What We Built
Event-Driven Transaction Pipeline
Replaced synchronous transaction processing with a Kafka-based event pipeline that handles 2.3M+ daily transactions with sub-50ms latency.
Technical Detail
Each transaction flows through a 3-stage pipeline: validation → processing → settlement. Kafka partitioning by account ID ensures ordering guarantees per account while allowing parallel processing across accounts. Dead letter queues with automatic retry handle transient failures.
Zero-Downtime Deployment Pipeline
Canary deployments with automated rollback reduced deploy time from 40 minutes to 6 minutes and eliminated deployment-related outages entirely.
Technical Detail
ArgoCD manages GitOps-based deployments to EKS. Each deploy starts with 5% canary traffic, auto-promotes after 3 minutes if error rate stays below 0.1%, and auto-rolls back if p99 latency exceeds 200ms. The full rollout completes in 6 minutes.
CQRS Read/Write Separation
Separated read and write paths to eliminate database contention during peak trading hours.
Technical Detail
Write operations hit the primary PostgreSQL instance through the Go transaction service. Read operations (dashboards, reports, partner queries) are served from eventually-consistent read replicas with a max lag of 150ms. The API gateway routes based on HTTP method and path prefix.
Distributed Tracing & Alerting
Full request tracing from API gateway through all microservices, with intelligent alerting that reduced mean time to detection from 23 minutes to under 90 seconds.
Technical Detail
Datadog APM with OpenTelemetry instrumentation. Every request gets a trace ID propagated through Kafka headers. PagerDuty integration with escalation policies based on severity. Custom Datadog monitors for transaction settlement SLAs.
Tech Stack
Why We Chose What We Chose
Backend
Go 1.22
Goroutine concurrency model ideal for transaction processing. Faster hiring than Rust, with comparable performance for this workload.
Apache Kafka
Needed message replay for SOC 2 audit trails. RabbitMQ doesn't support replay without additional tooling.
PostgreSQL 16
Existing data model was relational. Migrating to a different database type would have doubled the project timeline.
gRPC
Inter-service communication needed strong typing and was latency-sensitive. REST was reserved for external-facing APIs only.
Infrastructure
AWS EKS
Client was already on AWS. EKS gave us managed Kubernetes without the overhead of self-hosting the control plane.
Terraform
Infrastructure as code for reproducible environments. Chose over Pulumi because the ops team already knew HCL.
ArgoCD
GitOps-based deployments with built-in canary analysis. Preferred over Flux for its UI and rollback capabilities.
PgBouncer
Connection pooling to handle 2.3M daily transactions without exhausting PostgreSQL's connection limit.
Observability
Datadog APM
End-to-end distributed tracing with Kafka span propagation. Client already had a Datadog contract.
PagerDuty
On-call rotation management with escalation policies. Integrated with Datadog for automated incident creation.
OpenTelemetry
Vendor-agnostic instrumentation. If the client switches from Datadog, the instrumentation stays.
CI/CD
GitHub Actions
Client's code was already on GitHub. Actions eliminated the need for a separate CI server.
Trivy
Container image scanning for CVE detection, required for SOC 2 compliance.
SonarQube
Static analysis for code quality gates. Blocks merges if coverage drops below 85%.
Impact
Before & After
Metric
Before
After
Deploy Frequency
1x/week (Sunday nights)
14x/week average
Deploy Duration
40 minutes
6 minutes
Incident Response (MTTD)
23 minutes
87 seconds
Transaction Latency (p95)
890ms
47ms
Monthly Outages
3 average
0 in last 6 months
Infrastructure Cost
$38K/month
$21.9K/month
Engineering Quality
How We Ship
Test Coverage
87% unit, 94% integration on critical transaction paths
CI/CD Pipeline
GitHub Actions — 11-minute full suite with parallelized test shards
Monitoring
Datadog APM + PagerDuty with 90-second MTTD
Deploy Frequency
14 deploys/week average via ArgoCD canary
“The team didn't just rewrite our stack — they understood why our old architecture failed under load and designed something that actually scales with our transaction volume. The strangler fig approach meant we never had to do a scary big-bang cutover. Our institutional clients noticed the latency improvement before we even told them about the migration.”
David Park
CTO, Meridian Financial
Ongoing
What's Next
Extracting the notification service (currently still in the Rails monolith)
Implementing multi-region failover for disaster recovery
Building a real-time fraud detection pipeline on the Kafka event stream