Lessons from Scaling a Real-Time Collaboration System
What I learned building a real-time collaborative editor — from WebSocket architecture and conflict resolution to the operational realities of keeping persistent connections alive at scale.
Last year I worked on a real-time collaborative editing system — think Google Docs, but for a domain-specific structured document format. The project taught me more about distributed systems than any textbook, mostly through painful lessons about what happens when theory meets production traffic.
Here’s what I’d tell someone starting a similar project.
Choose Your Consistency Model Early
The first and most consequential decision is how you handle concurrent edits. There are two main approaches:
Operational Transformation (OT) — the algorithm Google Docs uses. Each edit is an operation that can be transformed against concurrent operations to preserve intent. Proven, but the transformation functions are notoriously complex and hard to get right.
Conflict-free Replicated Data Types (CRDTs) — data structures that are mathematically guaranteed to converge. Each client maintains a local replica and merges changes without a central coordinator. Simpler to reason about, but can have higher memory overhead.
We went with a CRDT-based approach using Yjs, and I’d make the same choice again. Here’s the core setup:
import * as Y from "yjs";
import { WebsocketProvider } from "y-websocket";
// Each document is a Yjs document with shared types
function createCollaborativeDoc(docId: string) {
const ydoc = new Y.Doc();
// Shared data structures — changes to these automatically sync
const content = ydoc.getXmlFragment("content");
const metadata = ydoc.getMap("metadata");
const comments = ydoc.getArray("comments");
// Connect to the sync server
const provider = new WebsocketProvider(
"wss://collab.example.com",
docId,
ydoc,
{ connect: true }
);
// Awareness protocol — cursor positions, user presence
const awareness = provider.awareness;
awareness.setLocalStateField("user", {
name: currentUser.name,
color: currentUser.color,
});
return { ydoc, content, metadata, comments, awareness };
}
The key advantage of Yjs is that the CRDT logic is embedded in the data structures themselves. You don’t write merge logic — you manipulate shared types and synchronisation happens automatically.
WebSocket Architecture: The Unglamorous Reality
The textbook WebSocket setup is simple: client connects to server, messages flow both ways. The production reality is considerably messier.
Connection Management
Clients disconnect constantly — network switches, laptop lids closing, mobile connections dropping. You need reconnection logic that handles all of these gracefully:
class ResilientConnection {
private ws: WebSocket | null = null;
private reconnectAttempts = 0;
private maxReconnectDelay = 30_000;
private pendingMessages: Message[] = [];
connect(url: string): void {
this.ws = new WebSocket(url);
this.ws.onopen = () => {
this.reconnectAttempts = 0;
this.flushPendingMessages();
};
this.ws.onclose = (event) => {
if (event.code !== 1000) {
// Abnormal close — schedule reconnect
this.scheduleReconnect(url);
}
};
this.ws.onerror = () => {
// Error always fires before close, so we handle
// reconnection in onclose to avoid double-reconnecting
};
}
private scheduleReconnect(url: string): void {
const delay = Math.min(
1000 * Math.pow(2, this.reconnectAttempts) + Math.random() * 1000,
this.maxReconnectDelay
);
this.reconnectAttempts++;
setTimeout(() => this.connect(url), delay);
}
send(message: Message): void {
if (this.ws?.readyState === WebSocket.OPEN) {
this.ws.send(JSON.stringify(message));
} else {
// Buffer messages while disconnected
this.pendingMessages.push(message);
}
}
private flushPendingMessages(): void {
while (this.pendingMessages.length > 0) {
const msg = this.pendingMessages.shift()!;
this.ws!.send(JSON.stringify(msg));
}
}
}
The jittered exponential backoff is critical. Without the random component, a server restart causes a “thundering herd” — every client reconnects at the exact same intervals, creating load spikes that can bring the server right back down.
Server-Side Fan-Out
On the server, each document is a “room” that multiple clients subscribe to. The challenge is efficiently broadcasting changes to all participants without blocking:
class DocumentRoom {
private clients: Map<string, WebSocket> = new Map();
private docState: Uint8Array;
addClient(clientId: string, ws: WebSocket): void {
this.clients.set(clientId, ws);
// Send current state to the new client
ws.send(encodeStateMessage(this.docState));
ws.on("message", (data: Buffer) => {
const update = data;
// Apply update to server state
this.docState = mergeUpdate(this.docState, update);
// Broadcast to all other clients in the room
for (const [id, client] of this.clients) {
if (id !== clientId && client.readyState === WebSocket.OPEN) {
client.send(update);
}
}
});
ws.on("close", () => {
this.clients.delete(clientId);
this.broadcastPresence();
});
}
private broadcastPresence(): void {
const activeUsers = Array.from(this.clients.keys());
const message = JSON.stringify({
type: "presence",
users: activeUsers,
});
for (const client of this.clients.values()) {
if (client.readyState === WebSocket.OPEN) {
client.send(message);
}
}
}
}
The Problems Nobody Warns You About
Memory Pressure from Idle Connections
Each WebSocket connection consumes memory on the server, even when idle. With 10,000 concurrent documents and an average of 3 users per document, you have 30,000 persistent connections. Each connection holds buffers, state objects, and awareness data.
We implemented idle connection pruning — if a client sends no meaningful updates for 30 minutes, we close the connection and let the client reconnect on demand:
const IDLE_TIMEOUT = 30 * 60 * 1000; // 30 minutes
function startIdleMonitor(ws: WebSocket, clientId: string): void {
let lastActivity = Date.now();
const originalOnMessage = ws.onmessage;
ws.onmessage = (event) => {
lastActivity = Date.now();
originalOnMessage?.(event);
};
const interval = setInterval(() => {
if (Date.now() - lastActivity > IDLE_TIMEOUT) {
ws.close(4000, "Idle timeout");
clearInterval(interval);
}
}, 60_000); // Check every minute
ws.on("close", () => clearInterval(interval));
}
Document Size Growth
CRDTs have a dirty secret: they tend to grow over time. Every insertion and deletion is tracked in the history, and that history never shrinks. A document that’s been actively edited for months can have a CRDT state that’s 10-50x larger than the visible content.
We implemented periodic compaction that snapshots the current state and discards the operation history:
async function compactDocument(docId: string): Promise<void> {
const currentState = await loadDocumentState(docId);
// Create a fresh Yjs doc from the current state
const freshDoc = new Y.Doc();
Y.applyUpdate(freshDoc, Y.encodeStateAsUpdate(currentState));
// The fresh doc has the same content but no edit history
const compactedState = Y.encodeStateAsUpdate(freshDoc);
await saveDocumentState(docId, compactedState);
// Notify connected clients to re-sync
broadcastResync(docId);
}
We run compaction as a background job during off-peak hours. It’s not seamless — connected clients need to handle the re-sync — but it keeps storage costs manageable.
Cursor Teleportation
Cursor positions are transmitted via the “awareness” protocol, separate from document changes. When a user types quickly, their cursor position updates can arrive out of order relative to document changes, causing other users’ cursors to briefly appear in wrong positions.
The fix is interpolation — don’t render cursor jumps immediately, smooth them with a short animation:
.remote-cursor {
transition: transform 100ms ease-out;
will-change: transform;
}
It’s a small thing, but it eliminates the visual jitter that makes collaborative editing feel broken.
Operational Lessons
Health Checks for WebSocket Servers
HTTP health checks don’t tell you if your WebSocket server is healthy. A server can return 200 OK on its health endpoint while having a full connection pool that rejects new WebSocket upgrades. We added a dedicated WebSocket health probe:
app.get("/health/ws", async (req, res) => {
try {
// Actually attempt a WebSocket connection to ourselves
const probe = new WebSocket(`ws://localhost:${PORT}/probe`);
await new Promise<void>((resolve, reject) => {
probe.onopen = () => {
probe.close();
resolve();
};
probe.onerror = reject;
setTimeout(() => reject(new Error("Probe timeout")), 5000);
});
res.json({
status: "healthy",
connections: server.getActiveConnections(),
rooms: roomManager.getActiveRooms(),
});
} catch {
res.status(503).json({ status: "unhealthy" });
}
});
Graceful Shutdown
When deploying a new version, you can’t just kill the process — you’ll drop thousands of connections mid-edit. We implemented a drain sequence:
- Stop accepting new connections
- Send a “server restarting” message to all clients
- Wait for clients to save and disconnect (with a timeout)
- Force-close remaining connections
- Exit
process.on("SIGTERM", async () => {
server.stopAcceptingConnections();
broadcastToAll({
type: "server_restart",
reconnectAfter: 5000,
});
// Give clients 10 seconds to disconnect gracefully
await Promise.race([
waitForAllDisconnected(),
sleep(10_000),
]);
process.exit(0);
});
The client-side reconnection logic handles the rest — clients wait the specified duration, then reconnect to a new server instance.
Would I Do It Again?
Yes, but I’d do a few things differently:
- Start with a managed service — Building your own WebSocket infrastructure is educational but expensive in engineering time. Services like PartyKit, Liveblocks, or Cloudflare Durable Objects handle the hard operational bits.
- Invest more in observability early — We added detailed metrics (connection counts, message throughput, sync latency) too late. These should be there from day one.
- Load test with realistic patterns — Our initial load tests used uniform message sizes and timing. Real users are bursty — they type in bursts, paste large blocks, and idle for long periods. Your load tests should simulate this.
Real-time collaboration is one of those problems that’s easy to prototype and hard to ship. The demo works in an afternoon; the production system takes months. But when it works well, it feels like magic — and that’s worth building.