Scaling to 1 Million Users: The Architecture I Wish I Knew Sooner

When we launched, we were happy just having 100 daily users. But within months, we hit 10,000, then 100,000. And scaling problems piled up faster than users.

We aimed for 1 million users, but the architecture that worked for 1,000 couldn’t keep up. Looking back, here’s the architecture I wish I’d built from day one — and what we learned scaling under pressure.

Phase 1: The Monolith That Worked (Until It Didn’t)

Our first stack was simple:

Spring Boot app
MySQL database
NGINX load balancer
Everything deployed on one VM

[ Client ] → [ NGINX ] → [ Spring Boot App ] → [ MySQL ]

This setup handled 500 concurrent users easily. But at 5,000 concurrent users:

CPU maxed out
Queries slowed down
Uptime dropped below 99%

Monitoring showed DB locks, GC pauses, and thread contention.

Phase 2: Throwing More Servers (But Missing the Real Bottleneck)

We added more app servers behind NGINX:

[ Client ] → [ NGINX ] → [ App1 | App2 | App3 ] → [ MySQL ]

It scaled reads fine. But writes still funneled into a single MySQL instance.

Under load tests:

| Users | Avg Response Time |
| ----- | ----------------- |
| 1000  | 120ms             |
| 5000  | 480ms             |
| 10000 | 3.2s              |

The bottleneck wasn’t CPU — it was the database.

Phase 3: Introducing a Cache

We added Redis as a caching layer for read-heavy queries:

public User getUser(String id) {
    User cached = redisTemplate.opsForValue().get(id);
    if (cached != null) return cached;
    User user = userRepository.findById(id).orElseThrow();
    redisTemplate.opsForValue().set(id, user, 10, TimeUnit.MINUTES);
    return user;
}

This reduced DB load by 60% and cut response times to under 200ms for cached reads.

Benchmark for 1,000 concurrent user profile requests:

| Approach   | Avg Latency | DB Queries |
| ---------- | ----------- | ---------- |
| No Cache   | 150ms       | 1000       |
| With Cache | 20ms        | 50         |

Phase 4: Breaking the Monolith

We broke out core features into microservices:

User Service
Post Service
Feed Service

Each with its own database schema (same DB instance initially).

Inter-service communication used REST APIs:

@RestController
public class FeedController {
    @GetMapping("/feed/{userId}")
    public Feed getFeed(@PathVariable String userId) {
        User user = userService.getUser(userId);
        List<Post> posts = postService.getPostsForUser(userId);
        return new Feed(user, posts);
    }
}

But chaining REST calls caused latency inflation. One request fanned out into 3–4 internal requests.

At scale, this killed performance.

Phase 5: Messaging and Asynchronous Processing

We added Kafka for async workflows:

User signup triggers Kafka event
Downstream services consume events instead of synchronous REST

// Publish
kafkaTemplate.send("user-signed-up", newUserId);

// Consume
@KafkaListener(topics = "user-signed-up")
public void handleSignup(String userId) {
    recommendationService.prepareWelcomeRecommendations(userId);
}

With Kafka, signup latency dropped from 1.2s to 300ms, since expensive downstream tasks ran out of band.

Phase 6: Scaling the Database

At 500,000 users, our MySQL instance couldn’t keep up — even with caching.

We added:

Read replicas → Split reads/writes
Sharding → User-based partitions (users 0–999k, 1M-2M, etc.)
Archive tables → Move cold data out of hot paths

Example query router:

if (userId < 1000000) {
    return jdbcTemplate1.query(...);
} else {
    return jdbcTemplate2.query(...);
}

This reduced write contention and query times across shards.

Phase 7: Observability

At 100,000+ users, debugging was a nightmare without visibility.

We added:

Distributed tracing (Jaeger + OpenTelemetry)
Centralized logs (ELK stack)
Prometheus + Grafana dashboards

Sample Grafana metrics:

| Metric         | Value   |
| -------------- | ------- |
| P95 latency    | 280ms   |
| DB connections | 120/200 |
| Kafka lag      | 0       |

Before observability, diagnosing latency spikes took hours. After, minutes.

Phase 8: CDN and Edge Caching

At 1 million users, 40% of traffic hit static files (images, avatars, JS bundles).

We moved them to Cloudflare CDN with aggressive caching:

| Asset              | Origin Latency | CDN Latency |
| ------------------ | -------------- | ----------- |
| /static/app.js     | 400ms          | 40ms        |
| /images/avatar.png | 300ms          | 35ms        |

This offloaded 70% of traffic from origin servers.

Final Architecture I’d Build Sooner

If I could start over, I’d skip phases and build this earlier:

[ Client ]  
   ↓  
[ CDN + Edge Caching ]  
   ↓  
[ API Gateway → Service Mesh ]  
   ↓  
[ Microservices + Kafka + Redis Cache ]  
   ↓  
[ Sharded Database + Read Replicas ]

Key lessons:

Caching isn’t optional
DB scaling needs to be designed early
Async processing is critical
Observability pays off early

Scaling isn’t about “adding more servers” — it’s about removing bottlenecks at every layer.

Final Benchmark (1 Million Users, 1,000 RPS):

| Metric             | Value  |
| ------------------ | ------ |
| P95 API Latency    | 210ms  |
| Error Rate         | <0.1%  |
| Cache Hit Ratio    | 85%    |
| DB Query Rate      | 50 qps |
| Kafka Consumer Lag | 0      |

Closing Thoughts

Scaling to a million users isn’t about fancy tech — it’s about solving the right problems in the right order.

The architecture that served your first 1,000 users won’t serve the next million.

Plan for failure modes before you hit them.

What architectural mistake cost you the most at scale? I’d love to hear.

AI Chronicle

Search This Blog

Scaling to 1 Million Users: The Architecture I Wish I Knew Sooner

Phase 1: The Monolith That Worked (Until It Didn’t)

Phase 2: Throwing More Servers (But Missing the Real Bottleneck)

Phase 3: Introducing a Cache

Phase 4: Breaking the Monolith

Phase 5: Messaging and Asynchronous Processing

Phase 6: Scaling the Database

Phase 7: Observability

Phase 8: CDN and Edge Caching

Final Architecture I’d Build Sooner

Final Benchmark (1 Million Users, 1,000 RPS):

Closing Thoughts

Comments

Post a Comment

Popular posts from this blog

GPT-5 Drops in July 2025: The AI Revolution That’s About to Explode Your World

ChatGPT Launched A NEW Feature That’s CRAZY! New MCP connectors for Google Drive, Box

How to Connect Your Zerodha Account to Claude Using Kite MCP