Keeping the auth path alive: a DDoS lane and failover for a single-process control plane
Keeping the auth path alive: a DDoS lane and failover for a single-process control plane
One process is simple to reason about and a single point of failure. The job is to keep the auth decision available: drop unauthenticated floods at the edge before they reach it, and have a standby ready when the process dies. The trade-offs are not the ones you expect.
A control plane for an OpenVPN deployment makes one decision over and over: should this client connect, and where does it route. That decision is the expensive part. It touches a cache, sometimes an upstream API, sometimes a database. Everything else exists to keep that decision fast, correct, and available. This is a Node + TypeScript service whose own description calls it an "OpenVPN Management Interface bridge + upstream API cache" — a single long-lived process bridging the VPN's management socket to the rest of the system.
Single-process is a deliberate choice, and it has two obvious failure modes. A flood of unauthenticated connect attempts can starve the auth path before any real client gets served. And if the one process dies, every session goes with it. The repo answers both in two small modules: src/ddos for the flood, src/ha for the death. Neither tries to be clever. Both are honest about what they do not do, and that honesty is the interesting part.
The flood arrives before the auth decision
Here is the threat. Every CONNECT attempt that hits the OpenVPN management socket eventually triggers an auth decision — cache lookups, possibly an upstream call. If an attacker opens connections faster than you can decide, the auth path becomes the bottleneck and legitimate clients queue behind garbage. You do not want the cost of authentication to be paid for traffic that was never going to authenticate.
The module's own comment states the design rule plainly: connect attempts are recorded regardless of allow or deny.
// src/ddos/CLAUDE.md
// Connect attempts are recorded regardless of allow/deny — that's the whole
// point: we want to spot floods before the auth layer absorbs them.
That sentence is the whole architecture. The detector sits in front of the decision, not behind it. It counts everything, including the attempts that auth would have rejected, because the rejected ones are precisely the flood you are trying to survive. A rate limiter that only sees authenticated traffic is measuring the wrong population.
Three signals, computed before the expensive path
The DDoS lane watches three things, defined as a Zod-validated config in src/ddos/types.ts:
// src/ddos/types.ts
export const DdosConfigSchema = z.object({
/** Connect attempts from a single public IP within `connWindowMs`. */
connectFloodThreshold: z.number().int(),
connectFloodWindowMs: z.number().int().positive(),
/** Per-CN sustained bytes/sec (combined rx+tx) above which we alert. */
bpsThreshold: z.number(),
/** Distinct CNs from a single trusted IP within the window. */
distinctCnThreshold: z.number().int(),
});
Three signals, three intents. connect-flood catches one public IP hammering the socket. distinct-cn-flood catches one IP cycling through many client identities — the comment names it directly: "cert mill / brute force." high-bps catches a single authenticated client saturating its tunnel. The first two are pre-auth IP-level signals; the third is post-auth and per-client. They are deliberately different shapes because the attacks are different shapes.
The detection primitives live in src/ddos/detection.ts and carry no I/O and no clock — they take now as an argument. The SlidingWindow is a circular array of timestamps:
// src/ddos/detection.ts
count(now: number): number {
const cutoff = now - this.opts.windowMs;
while (this.head < this.events.length) {
const t = this.events[this.head];
if (t === undefined || t >= cutoff) break;
this.head++;
}
return this.events.length - this.head;
}
Advancing a head pointer past expired timestamps is amortised O(1) in steady state, which matters because this runs on every connect attempt. The pure, time-injected design is also why the tests can assert exact counts: baseCfg uses connectFloodThreshold: 3 and a 60_000ms window, and the test feeds three attempts from 1.1.1.1 to prove the alert fires on the fourth — because every threshold is a strict greater-than.
The detector observes, the caller actuates
The single most important property of this module is what it refuses to do. From the class comment:
// src/ddos/DdosMonitor.ts
/**
* The monitor never *acts* — actuation (kill, ban, firewall update) is wired
* by the caller via the `alert` event.
*/
export class DdosMonitor extends EventEmitter {
DdosMonitor extends EventEmitter. It emits alert; it does not kill sessions, push firewall rules, or page anyone. The decision engine wires a default reaction (kill the offending CN's tunnel), but that policy lives in the caller, not the detector. This separation is why the module is testable without a network and why you can change the response without touching detection logic. The cost is that a misconfigured caller can leave alerts firing into the void — detection without actuation is just expensive logging.
The IP-keyed counters reach across processes via Redis, because a flood can spread across instances:
// src/ddos/DdosMonitor.ts — onConnectAttempt
const member = `${now}-${this.zaddSeq++}`;
await r.zadd(connKey, now, member);
await r.zremrangebyscore(connKey, 0, `(${cutoff}`);
ipCount = await r.zcard(connKey);
await r.expire(connKey, ttlSec);
A sorted set keyed by IP holds connect timestamps as scores; ZREMRANGEBYSCORE trims the window, ZCARD counts what remains, and EXPIRE keeps the key from outliving the window. The zaddSeq monotonic counter exists for a real bug: two connects in the same millisecond would collide as identical sorted-set members and one would be lost, so the member is <now>-<seq>, not just <now>. When Redis is unhealthy, every counter op falls back to in-memory Map<string, SlidingWindow> structures — isRedisHealthy() gates each call, so a Redis outage degrades flood detection to per-process instead of breaking it.
The per-CN bps sampler is the deliberate exception. It stays in-memory, and the comment explains why rather than apologising for it:
// src/ddos/DdosMonitor.ts (class comment, trimmed)
// merging two independent EWMA streams across instances has no meaningful
// semantics (each instance only sees its own client's bytecount feed anyway,
// since OpenVPN sessions are pinned to one process). Cross-instance bps
// detection would require a real metrics pipeline; it isn't worth a Redis
// round-trip on the bytecount hot path.
Sessions are pinned to one process, so a per-CN rate is already complete on the instance that owns the session. Pushing it to Redis would add a round-trip on a hot path to compute a number that is meaningless to merge. That is the right call, and it is the kind of decision that only shows up when you have actually run the thing under load.
Where the DDoS lane fails
Be honest about the edges. The pre-auth signals are IP-keyed, so anything that obscures the source IP weakens them — carrier-grade NAT puts thousands of legitimate clients behind one address, where a per-IP connect threshold either fires constantly or gets raised so high it stops protecting anything. A distributed flood from many IPs, each staying just under connectFloodThreshold, slips through entirely; this is a per-source detector, not a global one. And because the monitor only observes, the actual protection is no better than the actuation the caller wired. The module is a sensor. Treat it as one.
When the one process dies
Now the second failure mode. The DDoS lane keeps the auth path from being overwhelmed; HA keeps it from disappearing. With a single process, the whole control plane is one node dist/index.js. The src/ha module is where that single point of failure gets a partner — and the module is careful to tell you exactly how weak a partner it is.
// src/ha/HaCoordinator.ts (class comment, trimmed)
/**
* NOTE: this is a *coordination* layer, not a quorum manager. The expected
* deployment is active/standby with shared upstream + shared SQLite (e.g.,
* over NFS / litefs) or pure session-state replication. Split-brain is
* bounded by the dead-after window.
*/
export class HaCoordinator extends EventEmitter {
This is not Raft. There is no quorum, no consensus, no fencing token. It is a soft coordination layer for active/standby, and split-brain is bounded, not prevented. Calling it a coordinator instead of a cluster manager is the most useful sentence in the module, because it sets the right expectation: you get fast, cheap failover with a small window of ambiguity, not a strongly consistent cluster.
Failover by sorting strings
Leader election here is a sort. Every node heartbeats its role and session count to a Redis channel on a timer; the lexically smallest live nodeId is primary.
// src/ha/HaCoordinator.ts — heartbeat()
for (const [id, p] of this.peers) {
if (now - p.ts > this.opts.deadAfterMs) this.peers.delete(id);
}
const liveIds = [this.opts.nodeId, ...this.peers.keys()].sort();
const newRole: Role = liveIds[0] === this.opts.nodeId ? 'primary' : 'standby';
if (newRole !== this.role) {
this.role = newRole;
this.emit('role', this.role);
}
Prune any peer whose last heartbeat is older than deadAfterMs, sort the surviving node IDs, and the smallest one is primary. No election protocol, no negotiation — every node runs the same deterministic computation against the same membership view and arrives at the same answer. When a node's role flips it emits role, and the rest of the system reacts.
What does failover actually cost here? Exactly deadAfterMs. The standby cannot promote itself until it has stopped hearing from the dead primary for that full window. Set it short and a slow heartbeat or a GC pause looks like death — you get a spurious promotion and, briefly, two primaries. Set it long and real downtime stretches out. That single number is the entire tuning surface, and there is no value that makes both problems go away. The test pins deadAfterMs: 10_000 against a heartbeatMs: 60_000 to exercise the logic, but in production the relationship between those two numbers is the failover behaviour.
Split-brain is not an edge case to wave away; it is a designed-in window. During those deadAfterMs milliseconds, a primary that was merely slow (not dead) and its newly-promoted standby can both believe they are primary. With shared SQLite and shared upstream this is survivable. Without a shared backing store, it is a correctness risk you are accepting on purpose.
What survives a failover, and what does not
Sessions are replicated so a standby can take over IPAM hints. On every accepted connect or disconnect, the decision engine calls publishSession, which writes a TTL'd snapshot and announces it:
// src/ha/HaCoordinator.ts — publishSession()
const key = `${this.opts.redisKeyPrefix}ha:session:${snap.cn}`;
if (op === 'delete') {
await this.pub.del(key);
} else {
await this.pub.set(key, JSON.stringify(snap), 'EX', this.opts.sessionTtlSec);
}
await this.pub.publish(this.opts.sessionChannel, JSON.stringify({ op, snap, from: this.opts.nodeId }));
The snapshot is small and bounded — cn, vpnIp, instanceId, an optional fingerprint, and a timestamp, validated by HaSessionSnapshotSchema in src/ha/types.ts. On takeover a standby calls recallSession(cn) to read the last known good session and re-establish routing hints. The TTL matters: a snapshot that outlives its session would hand a standby a stale IP, so sessionTtlSec caps how wrong recalled state can be.
What survives is the routing hint. What does not survive is the live tunnel. OpenVPN sessions are pinned to a process — the same fact that kept the bps sampler in-memory — so when the primary dies, the encrypted tunnels die with it and clients reconnect. HA here means the standby knows where clients were and can route them correctly on reconnect; it does not mean the TCP/UDP tunnels migrate. If a stakeholder hears "high availability" and pictures zero dropped packets, that is the gap to close in conversation before it becomes a support ticket.
The coordinator is also degradation-tolerant in the same way the DDoS lane is. The HA test points redisUrl at redis://127.0.0.1:1 — a deliberately unreachable port — and asserts that start() does not throw. When Redis is down, pub stays null, publishSession returns early, recallSession returns null, and the node simply runs as a lone primary. Losing the coordination layer degrades you to single-node, exactly the posture you started from. Failover is the bonus, not the dependency.
The deciding factor: cheap insurance versus expensive guarantees
The recommendation this repo makes, and the one I would defend: for a single-process control plane fronting a VPN, soft active/standby coordination plus a pre-auth DDoS lane is the right amount of resilience — not a full consensus cluster. The deciding artifact is the HaCoordinator class comment itself, which names the deployment model (active/standby, shared upstream, shared SQLite) and the bound (split-brain limited by deadAfterMs). The module knows what it is. A quorum manager would buy you split-brain prevention at the cost of a third node, a consensus library, and a class of operational failures that are harder to debug than the one process you were protecting.
The trade-off you accept is real and should be stated out loud. You are trading strong consistency for operational simplicity. You get string-sort election instead of Raft, a bounded split-brain window instead of fencing, and reconnect-with-preserved-routing instead of tunnel migration. In return you get a system one engineer can hold in their head, that degrades to single-node when Redis is down instead of failing, and whose entire failover behaviour is governed by two numbers — heartbeatMs and deadAfterMs — you can reason about on a whiteboard. For a VPN control plane where clients already reconnect on network blips, that is a good trade. For a payments ledger it would be malpractice.
The business consequence is straightforward. The DDoS lane means an unauthenticated flood costs an IP-keyed counter increment, not an auth round-trip, so the bill for an attack stays close to flat and real clients keep connecting through it. The HA layer means a crashed or redeployed primary turns into a sub-deadAfterMs reconnect window with preserved routing, instead of a manual incident where someone SSHes in to restart a process while clients sit disconnected. Both are insurance you can price in two config values, which is exactly what a small operator can afford to own.
What to do next
Open the failover question with one number: what is your deadAfterMs, and have you written down what happens during that window? If the answer is "we never picked one" or "we assumed sessions migrate," you have found the gap before your users did. Then check the cheaper side — does your rate limiter count connection attempts before or after the auth decision? If it only sees authenticated traffic, it is measuring the wrong population, and a flood will find that out for you. Copy the connect attempts are recorded regardless of allow/deny rule into your own edge and see what breaks.
Related Articles
Same CategoryComments (0)
Newsletter
Stay updated! Get all the latest and greatest posts delivered straight to your inbox