Performance & Optimization

Database Replication — Scale Reads, Survive Failures, and Sleep at Night

A single database instance is a single point of failure. It goes down, your application goes down. It gets overloaded with read queries, your API slows to a crawl. Database replication solves both problems: a replica takes over when the primary fails, and read traffic distributes across replicas to reduce primary load. We set up replication for PostgreSQL and MySQL with proper failover, monitoring, and application-level read/write routing.

Need this done for your project?

We implement, you ship. Async, documented, done in days.

Start a Brief

Replication Strategies

Streaming Replication (Physical): The primary database sends WAL (Write-Ahead Log) records to replicas, which replay them to maintain an exact copy. This is the standard for PostgreSQL high availability. The replica is a byte-for-byte copy of the primary, which means it can handle any read query but cannot have different indexes or schema. Failover promotes the replica to primary, and the application reconnects. With RDS Multi-AZ, this is managed automatically.

Logical Replication: Instead of copying WAL records, logical replication sends decoded row changes (INSERT, UPDATE, DELETE) to the replica. This allows selective replication (specific tables only), different indexes on the replica, and cross-version replication. It is useful for analytical replicas that need additional indexes without affecting primary write performance, or for migrating between database versions.

Read Replicas: Whether using streaming or logical replication, read replicas offload read queries from the primary. The trade-off is replication lag — replicas are slightly behind the primary (typically milliseconds to seconds). Queries that require absolute consistency must go to the primary. Queries that tolerate slight staleness (dashboards, search, reports) go to replicas. Your application needs to route queries accordingly.

Multi-Region Replication: Cross-region replicas serve two purposes: disaster recovery (promote the replica if the primary region fails) and latency reduction (read queries served from the nearest region). The trade-off is higher replication lag (50-200ms for cross-region vs. sub-millisecond for same-region). We configure monitoring for cross-region lag and alerting when it exceeds your tolerance.

Our Replication Implementation

Infrastructure: We provision replicas via Terraform with appropriate instance sizing (replicas can be smaller than the primary if they handle only read traffic), storage configuration, and parameter groups. For RDS, we use the aws_db_instance resource with replicate_source_db. For Aurora, we add reader instances to the cluster. Cross-region replicas use aws_rds_cluster with the global cluster feature.

Failover Configuration: For RDS Multi-AZ, failover is automatic — the standby is promoted within 60-120 seconds. We configure the application's connection string to use the RDS endpoint, which automatically resolves to the current primary. For Aurora, failover takes 30 seconds with the reader endpoint providing automatic read routing. We test failover by triggering a manual failover and verifying the application recovers within the expected RTO.

Application Routing: We configure your application to send write queries to the primary and read queries to replicas. For Node.js with Prisma, this uses the $extends API or a custom middleware. For Django, it uses DATABASE_ROUTERS. For Rails, it uses connects_to with reading and writing roles. We also handle the edge case where a user writes data and immediately reads it — the read must go to the primary to avoid reading stale data.

Monitoring: We set up alerts for replication lag (warning at 5 seconds, critical at 30 seconds), replica status (is it still replicating?), and replica resource utilization. CloudWatch metrics for RDS provide this natively. For self-managed PostgreSQL, we query pg_stat_replication and export the metrics to Prometheus. A dashboard shows lag over time, which helps identify patterns (lag spikes during batch jobs, for example).

What You Get

A production database replication setup:

  • Replica provisioning — read replicas provisioned and configured via Terraform
  • Automatic failover — RDS Multi-AZ or Aurora with tested failover procedures
  • Application routing — read/write query routing configured in your ORM or connection layer
  • Replication monitoring — lag, status, and utilization dashboards with alerts
  • Cross-region replication — optional disaster recovery replica in a secondary region
  • Failover testing — documented failover test with measured RTO
  • Connection pooling — PgBouncer or ProxySQL configured for both primary and replica endpoints
  • Runbook — procedures for replica promotion, lag investigation, and replication troubleshooting

Why Anubiz Engineering

100% async — no calls, no meetings
Delivered in days, not weeks
Full documentation included
Production-grade from day one
Security-first approach
Post-delivery support included

Ready to get started?

Skip the research. Tell us what you need, and we'll scope it, implement it, and hand it back — fully documented and production-ready.