The situation
A growth-stage e-commerce company was rebuilding its product catalogue service ahead of an anticipated traffic ramp. The team had operated on a monolithic PostgreSQL database for three years. The new service — a read-heavy catalogue API serving product listings, inventory status, and pricing to both the storefront and third-party integrations — needed to support a baseline of 8,000 RPS, weekly wave peaks of 40,000 RPS, and campaign-day spikes to 120,000 RPS.
The engineering lead had a working assumption: DynamoDB was the natural choice for a high-read, horizontally scalable catalogue service. The data model was largely key-value — product ID to product record — with limited relational complexity. The operational simplicity argument was compelling. There was no appetite for managing an RDS fleet at this stage of growth.
The question wasn't whether to use DynamoDB. It was which DynamoDB capacity mode — on-demand or provisioned — and whether Aurora Serverless v2 deserved a serious look given its ACU-based autoscaling and SQL compatibility for the more complex catalogue queries downstream analytics teams were already asking for.
We thought we knew the answer before we started. The simulation changed three things we were confident about. That's the whole point of running it.
The team ran the evaluation entirely in pinpole before provisioning a single resource. Two canvases. Three traffic patterns. Eight simulation runs. The decision was made, documented, and presented to the CTO in a single afternoon.
Canvas setup and architecture
The team built two canvases representing the same application topology — a serverless API stack — with only the database layer differing. Both canvases share the same critical path:
Canvas A: → DynamoDB (product-catalogue-table)
Canvas B: → Aurora Serverless v2 (catalogue-cluster) + RDS Proxy
Supporting: WAF · ElastiCache (read cache, both canvases) · CloudWatch · SQS (write path)
Canvas A was configured with DynamoDB in both on-demand and provisioned modes — toggled between simulation runs — to produce direct capacity mode comparisons under identical traffic. Canvas B used Aurora Serverless v2 with a minimum of 0.5 ACU and a maximum of 64 ACU, behind an RDS Proxy to manage Lambda connection pooling.
Figure 1 — Canvas A: DynamoDB architecture. All service connections validated. DynamoDB configured in on-demand capacity mode for initial simulation run.
Figure 2 — Canvas B: Aurora Serverless v2 architecture. RDS Proxy node present to manage Lambda → Aurora connection pooling. All connections validated.
Before running any simulation, pinpole's connection validation surfaced one immediate finding on Canvas B: the initial RDS Proxy configuration had only a single availability zone selected. The canvas flagged this as a WARNING — a single-AZ proxy creates a single point of failure for the connection pool under Lambda burst conditions. The team added a second proxy endpoint in a second AZ before the first simulation ran. This was a five-minute canvas change that would have been an undiscovered production gap under the old workflow.
Canvas validation flagged single-AZ RDS Proxy before simulation ran. Under Lambda burst at 120K RPS, a single-AZ proxy creates connection routing contention that manifests as intermittent p99 latency spikes. A second proxy endpoint was added to a second AZ at canvas design time — not discovered in production.
Simulation methodology
Three traffic patterns were run against both canvases to reflect the real operational profile of the catalogue service. All simulations used the same RPS parameters to ensure the comparison was direct.
Constant — 8,000 RPS (baseline)
Simulates the steady-state weekday traffic profile. Used to establish baseline cost and validate all nodes are healthy at operating load before wave and spike patterns are applied.
Wave — 8,000 → 40,000 RPS (weekly traffic cycle)
Simulates the weekly traffic pattern: trough at 8K RPS Sunday night, peak at 40K RPS Friday afternoon. Exposes ACU scale-up latency in Aurora Serverless v2 and DynamoDB auto-scaling behaviour at sustained peak.
Spike — 8,000 → 120,000 RPS (campaign day burst)
Simulates a near-instantaneous 15× traffic spike during a promotional event. The most demanding pattern for both architectures. Surfaces throttling, connection pool exhaustion, and burst capacity limits.
Eight simulation runs were executed in total: Constant × 3 (DynamoDB on-demand, DynamoDB provisioned, Aurora Serverless v2), Wave × 2 (DynamoDB provisioned, Aurora Serverless v2), and Spike × 3 (DynamoDB on-demand, DynamoDB provisioned, Aurora Serverless v2). DynamoDB on-demand was excluded from Wave and Spike after the Constant run surfaced a cost finding that made further comparisons academic.
Finding 1 — DynamoDB on-demand is prohibitively expensive at sustained peak
The first simulation surprise came within minutes of running the Constant pattern at 8,000 RPS. DynamoDB on-demand showed healthy latency metrics — 2ms p50, 7ms p99 — but the live cost estimate in the pinpole panel was immediately alarming.
Figure 3 — Simulation output: DynamoDB on-demand at constant 8,000 RPS. Latency is excellent. Monthly cost estimate is not.
The $76,400/month estimate for DynamoDB on-demand at 8,000 RPS baseline was the first assumption-breaker. The team had budgeted approximately $12,000/month based on rough AWS Pricing Calculator estimates using static read/write unit counts — not simulated traffic load. The gap was a function of actual RCU and WCU consumption under real traffic patterns, including the write path through SQS, which on-demand pricing captures fully.
Simulated estimate: $76,400/month. Team's Pricing Calculator estimate: $12,000/month. The delta is attributable to on-demand pricing capturing full WCU consumption on the SQS-driven write path, which the static estimate had modelled at 10% of actual write volume. DynamoDB on-demand was removed from further consideration before Wave or Spike simulations ran.
The recommendation surfaced at this point was direct: switch to DynamoDB provisioned capacity with auto-scaling. The recommendation included suggested starting RCU/WCU values based on the 8K RPS simulation profile and noted the hot partition key risk given the product ID access pattern.
Figure 4 — Recommendations panel: pinpole recommends switching to DynamoDB provisioned capacity with suggested RCU/WCU values and auto-scaling policy. Hot partition key warning included.
Finding 2 — Aurora Serverless v2 ACU cold-start latency under spike load
With DynamoDB on-demand eliminated, the comparison narrowed to DynamoDB provisioned vs Aurora Serverless v2. Both performed well under the Constant pattern. The Wave pattern at 8K → 40K RPS is where Aurora Serverless v2 revealed a behaviour the team had not accounted for.
Aurora Serverless v2's ACU scale-up is not instantaneous. When the Wave simulation ramped from 8K to 40K RPS over a ten-minute period, the p99 latency for Aurora rose from a baseline of 12ms to 47ms during the ACU scale-out window — a 3–5 minute period during which the cluster was operating beyond its provisioned capacity while new ACUs came online.
Figure 5 — Aurora Serverless v2 under Wave pattern: ACU scale-out latency spike to 47ms p99 visible during the ramp phase. Two warnings flagged by simulation.
The recommendation for this finding was to raise the minimum ACU to 4 (from 0.5) to ensure the cluster is always warm enough to absorb wave ramp load without a latency spike. This change increases the cost floor by approximately $280/month but eliminates the ACU cold-start window. A second recommendation advised enabling cluster read replicas to distribute read load during peak, reducing per-primary-instance load pressure during scale-out.
With minimum ACU set to 0.5, Aurora Serverless v2 exhibits a 3–5 minute latency degradation window during rapid traffic ramps. Setting minimum ACU to 4 eliminates this at a cost of approximately $280/month. For any traffic profile that includes wave or ramp patterns, minimum ACU should be sized to the trough-to-peak ramp rate, not the trough volume.
The same ACU minimum fix was applied to Canvas B, and the Wave simulation was re-run. With minimum ACU set to 4, the p99 latency during the ramp window dropped from 47ms to 14ms — a canvas change that took under two minutes and was validated by re-simulation before any infrastructure was provisioned.
Finding 3 — The campaign spike reveals the economic crossover
The Spike simulation at 120,000 RPS was where the decision became clear. Both architectures were now properly configured based on the Constant and Wave learnings. The Spike pattern revealed the economic crossover that the team had expected to find — but at a different threshold than anticipated.
| Configuration | p50 Latency | p99 (Spike Peak) | Throttle Events | Est. Monthly Cost | Simulation Result |
|---|---|---|---|---|---|
| DynamoDB on-demand | 2ms | 8ms | 0 | $136,000+ | ELIMINATED |
| DynamoDB provisioned (auto-scale) | 2ms | 31ms | ~1,200 | $9,800 | WARNING |
| DynamoDB provisioned (headroom config) | 2ms | 9ms | 0 | $12,400 | HEALTHY |
| Aurora Serverless v2 (min 4 ACU + replica) | 11ms | 28ms | 0 | $8,200 | HEALTHY |
DynamoDB provisioned with auto-scaling configured at 70% utilisation showed throttle events under the instantaneous 15× spike — the auto-scaler cannot respond fast enough to a near-vertical traffic ramp. The fix was to provision with explicit headroom: capacity configured at 150% of expected peak rather than relying on auto-scale to keep up with burst. This is a well-known DynamoDB operational pattern, but it is not discoverable without a spike simulation. The cost delta between the auto-scale config and the headroom config was $2,600/month — an acceptable trade for zero campaign-day throttling.
Figure 6 — pinpole Execution History: version comparison between DynamoDB auto-scale and headroom configurations. Cost delta of $2,600/month visible alongside the throttle event and latency improvements.
Aurora Serverless v2 — with the minimum ACU correction and a read replica added — handled the 120K RPS spike cleanly. p99 under spike was 28ms, no throttle events, and the monthly cost estimate of $8,200 reflects ACU-based pricing that automatically scales down during trough periods. The provisioned DynamoDB headroom configuration costs $12,400/month because over-provisioned capacity units are always billed regardless of actual consumption.
For this traffic profile — 8K baseline, 40K weekly peak, 120K spike — Aurora Serverless v2 is $4,200/month cheaper than DynamoDB provisioned with spike headroom. The crossover is driven by Aurora's ACU-based pricing model, which scales to actual demand during trough periods rather than billing for provisioned capacity that isn't being used. Over 12 months, the Aurora Serverless v2 option saves approximately $50,400 at the traffic projections modelled.
Cost comparison
DynamoDB on-demand (eliminated)
Aurora Serverless v2 — wave/spike optimised
The comparison above reflects the team's initial working assumption (DynamoDB on-demand) versus the selected architecture (Aurora Serverless v2, wave/spike optimised). The monthly cost difference of $68,200/month was identified entirely through pre-deployment simulation — no infrastructure provisioned, no AWS bill incurred during the evaluation.
The decision and how it was made
The final architecture selection was Aurora Serverless v2 — but not primarily on cost. The cost finding was significant, but the engineering lead's reasoning was more nuanced:
Why Aurora Serverless v2 was selected
- ACU-based pricing matches the traffic profile — trough periods are genuinely cheap; on-demand DynamoDB is not
- SQL compatibility preserves the downstream analytics query interface that the data team already uses
- Simulation validated that the ACU cold-start issue is solvable with minimum ACU configuration — not an architectural ceiling
- p99 of 28ms under 120K RPS spike is acceptable for a catalogue API; sub-10ms is not a hard requirement
- RDS Proxy provides the Lambda connection pooling solution; no additional operational complexity vs DynamoDB at this scale
When DynamoDB provisioned would have been selected
- If p99 latency under 10ms at all load tiers was a hard requirement
- If the traffic profile was constant (not wave/spike) — provisioned capacity pricing is more competitive at steady sustained load
- If the data model had no relational query requirements and the analytics team had no existing SQL dependency
- If the team had higher confidence in traffic forecasting, making over-provisioning headroom a known and bounded cost
The decision was documented as a pinpole canvas version and presented to the CTO with the simulation history as the evidence base. Three canvases, eight simulation runs, and an execution history showing every configuration iteration from initial assumption to final selected architecture.
Figure 7 — Final selected canvas: Aurora Serverless v2 with minimum 4 ACU, dual-AZ RDS Proxy, and read replica. All nodes healthy. Ready to deploy.
Outcome and post-deployment validation
The Aurora Serverless v2 architecture was deployed to ST and UAT environments using pinpole's deployment workflow. Production was promoted two weeks later. The first real AWS bill came in at $8,640/month — within 5.4% of the pinpole simulation estimate of $8,200.
On the first promotional campaign following go-live, the catalogue API hit 92,000 RPS — below the 120K spike ceiling modelled in simulation, but the first real data point validating that the architecture held under load. Aurora Serverless v2 scaled to 38 ACUs at peak. p99 latency was 24ms — slightly better than the 28ms simulation estimate, attributable to the cache-hit rate being higher in production than modelled. Zero throttle events. Zero incidents.
The simulation didn't just find a cost saving. It changed the conversation. Instead of debating architecture options in a design review, we were reviewing simulation evidence. That's a different kind of meeting.
A note on simulation methodology
Simulation results are design-time projections, not production measurements. The 5.4% cost variance in this case study is favourable, but teams should treat simulation estimates as directionally reliable guidance rather than operationally authoritative figures. The value of pre-deployment simulation is not precision — it is the ability to surface order-of-magnitude differences, catch architectural gaps like the single-AZ RDS Proxy issue, and eliminate options that are clearly unviable (DynamoDB on-demand at 8K+ RPS sustained) before a dollar is spent on infrastructure.
All architectures selected in simulation should still be promoted through ST and UAT environments and validated with post-deployment load testing before production. Pinpole narrows the decision space and eliminates the surprises. It does not replace the deployment pipeline.
Make your next database selection with simulation data, not assumptions.
Two canvases, three traffic patterns, eight simulation runs. One afternoon. The free tier includes 5 simulations per month — no credit card required. The DynamoDB on-demand cost finding alone is worth the session.
Start your free trial →All cost estimates are simulation projections generated by pinpole pre-deployment traffic simulation. Actual AWS costs depend on region, reserved instance coverage, support plans, and other factors. Post-deployment variance of 5.4% is reported for this specific case study and is not a guaranteed accuracy level.
Tags: AWS · DynamoDB · Aurora · Aurora Serverless v2 · Database Selection · Pre-Deploy Simulation · Cost Optimisation · Lambda · RDS Proxy · pinpole