Skip to content

From OpenVPN event to upstream call to pushed config: the decision module as glue

2/19/2026Backend DevelopmentOpenVPN Control Plane11 min read

From OpenVPN event to upstream call to pushed config: the decision module as glue

The socket delivers the event, the upstream answers the question, the daemon needs a push. The decision module is the thin, testable layer that turns three subsystems into one client-auth verdict.

The decision this framework is for

When an OpenVPN client tries to connect, something has to answer three questions in order: Is this client allowed? What IP and routes does it get? What config do we push back so the tunnel actually works? Each answer lives in a different subsystem — a management socket parser, an HTTP client to the upstream API, an IP pool. The decision you make early is whether to let those subsystems call each other directly, or to keep a single coordinator that owns the sequence and nothing else.

This framework is for the second choice. In vpn-control-plane, the src/decision module is that coordinator. DecisionEngine holds the shared state and dispatches each event; the actual work lives in free-function handlers (handlers/connect.ts, handlers/disconnect.ts) and stateless policies (policies/cache.ts, policies/pushBuilder.ts). The engine never parses a management line and never builds an HTTP request itself. It sits in the seam between src/ovpn (the daemon bridge) and src/upstream (the API client) and turns one event into one verdict. The reason to do it this way is testability: when the orchestration is thin and the side-effecting collaborators are interfaces, you can drive the whole connect flow with in-memory fakes and assert on outcomes, not mocks-of-mocks.

The framework

Call it event in, verdict out, side effects at the edges. Four moves:

  1. One entrypoint, typed event dispatch. Everything enters through DecisionEngine.handle(ev). The engine switches on ev.kind and forwards to a handler. No business logic in the switch.
  2. Split the handler into phases that return data, not side effects. The connect flow is an auth phase (runAuthPhase) that returns an AuthDecision | PendingSignal | null, and an apply phase (runApplyPhase) that consumes that decision. The phase boundary is a value you can inspect.
  3. Push the I/O behind injected collaborators. ManagementClient, UpstreamClient, Cache, IpPool are constructor arguments. The pure transforms (buildPushConfig) take plain options and return plain values.
  4. Make the verdict the only output that matters. The flow ends in exactly one of mgmt.allowClient(...) or mgmt.denyClient(...). Tests assert on which one fired and with what payload.

Each step with one paragraph of explanation

Step 1 — one entrypoint. handle() is the entire public surface for events. It builds a request-scoped logger, switches on the event kind, and delegates. The notable detail is the catch: an unhandled throw during connect does not leak — it sends a deny so the OpenVPN daemon never hangs waiting for a reply.

async handle(ev: ClientEvent): Promise<void> {
  const lg: Logger = log.child({ rid: newRequestId(), cid: ev.cid, instance: this.instance });
  try {
    if (ev.kind === 'connect' || ev.kind === 'reauth') return onConnect(this, ev, lg);
    if (ev.kind === 'disconnect') return onDisconnect(this, ev, lg);
    if (ev.kind === 'address') { lg.debug({ addr: ev.addr }, 'decision: address learnt'); return; }
  } catch (err) {
    lg.error({ err, ev }, 'decision: unhandled error');
    if (ev.kind === 'connect' || ev.kind === 'reauth') {
      metrics.decisionOutcomes.inc({ outcome: 'error' });
      await this.mgmt.denyClient(ev.cid, ev.kid, 'internal error', 'Server error');
    }
  }
}

The engine knows three event kinds and the names of three handler functions. That is the whole dispatch. Notice it imports onConnect/onDisconnect as free functions and passes this in — the handlers read the engine's public-readonly fields directly rather than going through getters. That is a deliberate trade for keeping the class file small, and the module's CLAUDE.md documents the rule: external callers still go through handle() only.

Step 2 — phases return data. The connect flow does not auth-then-immediately-allocate in one straight line. runAuthPhase returns a discriminated union, and the coordinator decides what to do with it:

export async function runAuthFlow(
  engine: DecisionEngine,
  ev: ClientEvent & { kind: 'connect' | 'reauth' },
  cn: string, fp: string,
  authType: 'CERT_ONLY' | 'USER_PASS_ONLY',
  trustedIp: string | undefined,
  pendingAttempt: number,
  lg: Logger,
): Promise<void> {
  const result = await runAuthPhase(engine, ev, cn, fp, authType, trustedIp, pendingAttempt, lg);
  if (!result) return; // deny already sent

  if (result.kind === 'pending') {
    const key = `${ev.cid}:${ev.kid}`;
    const timer = setTimeout(() => {
      engine.pendingRetries.delete(key);
      void runAuthFlow(engine, ev, cn, fp, authType, trustedIp, result.nextAttempt, lg);
    }, config.upstream.pendingRetryMs);
    timer.unref();
    engine.pendingRetries.set(key, timer);
    return;
  }

  await runApplyPhase(engine, result, ev, lg);
}

runAuthPhase can return null (a deny was already sent, the coordinator just stops), a PendingSignal (the upstream answered PENDING, so schedule a retry keyed by cid:kid), or an AuthDecision (proceed to apply). The retry recursion lives here, not buried inside the auth phase, because retry scheduling is coordination — it touches engine.pendingRetries, which onDisconnect must be able to cancel. Keeping it in the coordinator keeps the auth phase pure enough to reason about.

Step 3 — I/O behind collaborators. The cache-or-upstream lookup is a stateless function that takes exactly the collaborators it needs. No engine reference, no hidden globals:

export async function getOrFetchAuth(
  cache: Cache, upstream: UpstreamClient,
  cn: string, fp: string,
  authType: 'CERT_ONLY' | 'USER_PASS_ONLY',
  lg: Logger,
): Promise<ConnectResponse> {
  const key = `auth:${authType}:${cn}:${fp}`;
  const cached = await cache.get<ConnectResponse>(key);
  if (cached) { lg.debug({ decision: cached.decision }, 'cache: auth hit'); return cached; }
  const fresh = await upstream.connect({
    authType, certificate: { commonName: cn, fingerprintSha256: fp },
  });
  const ttl = fresh.decision === 'ALLOW' ? config.cache.authTtl
            : fresh.decision === 'DENY'  ? config.cache.authDenyTtl
            : 0;
  if (ttl > 0) await cache.set(key, fresh, ttl);
  return fresh;
}

The TTL policy is right here and it is data-driven: ALLOW caches for authTtl, DENY caches for the shorter authDenyTtl, and PENDING caches for zero — you never want to cache "ask me again." Because cache and upstream are parameters, the test suite swaps in a FakeCache backed by a Map and a FakeUpstream backed by a queue of canned responses. No network, no Redis.

Step 4 — one verdict. Every branch of the connect flow terminates in exactly one management call. The apply phase, after allocation and route install and shaping, ends with the push:

const push = buildPushConfig({
  role: engine.role,
  mode: engine.deps.mode ?? 'routed-tun',
  allocation,
  network: resolvedNetwork,
  defaultNetmask: engine.ipPool.netmask,
  defaultDns: [...config.push.dns],
  crossConnectorRoutes,
});
// ... lease persisted, sessions tracked ...
await engine.mgmt.allowClient(ev.cid, ev.kid, push);

buildPushConfig is pure: allocation plus optional upstream network metadata plus defaults in, a PushConfig out. It has no idea a socket exists. That separation is what lets the push logic — ifconfig mask, dedup of cross-connector routes, iroute lines only for s2s role — be unit-tested as a function, while the side effect of writing to the management socket stays a single line at the end of the apply phase.

Walk the framework through a real artifact in the target repo

The clearest proof that the seam holds is DecisionEngine.test.ts. It builds an engine with fakes and drives whole connect flows, asserting only on what came out the other side. Here is the happy path:

test('ALLOW: cache miss → upstream → ipam allocate → allowClient called', async () => {
  const { engine, mgmt, upstream } = await buildEngine();
  upstream.connectResponses = [{ decision: 'ALLOW' }];
  upstream.networkResponses = [{ assignedIp: '10.8.0.7', netmask: '255.255.0.0' } as NetworkResponse];

  await engine.handle(connectEv('alice'));

  assert.equal(mgmt.allows.length, 1, 'one allowClient call');
  assert.equal(mgmt.denies.length, 0);
  assert.equal(mgmt.allows[0]!.push!.ifconfigPush?.ip, '10.8.0.7');
  assert.equal(upstream.connectCalls, 1);
  assert.equal(upstream.networkCalls, 1);
});

The test never touches HTTP. buildEngine() (in src/test/fixtures.ts) wires a FakeMgmt that records every allow/deny into an array, a FakeUpstream that shifts responses off a queue and counts calls, and a FakeCache backed by a Map. The assertion is the verdict — one allow, the right pushed IP, exactly one call to each upstream endpoint. Because the engine's only outputs are management calls and its only inputs are collaborator methods, the test can describe the entire behavior in five assertions.

The same shape covers the unhappy paths. The cache-hit test connects alice twice and asserts upstream.connectCalls === 1 and upstream.networkCalls === 1 — the second connect was served entirely from cache, and both allocations returned the same sticky IP. The PENDING test queues three PENDING responses with UPSTREAM_PENDING_MAX_RETRIES=2, waits for the retry timers, and asserts a single final deny with reason pending-timeout. And the resilience case is the one that matters most operationally:

test('Network upstream failure → falls back to local pool, still ALLOW', async () => {
  const { engine, mgmt, upstream } = await buildEngine();
  upstream.connectResponses = [{ decision: 'ALLOW' }];
  upstream.networkResponses = [new UpstreamError('network', 500, 'boom')];
  await engine.handle(connectEv('eve'));
  assert.equal(mgmt.allows.length, 1);
  const ip = mgmt.allows[0]!.push!.ifconfigPush!.ip;
  assert.ok(ip.startsWith('10.8.0.'));
});

Auth succeeded, but the network-metadata call to the upstream API threw a 500. The apply phase catches UpstreamError specifically, logs a fallback warning, and allocates from the local pool anyway — so the client still connects with a valid IP. That behavior is a deliberate availability decision (auth is hard-fail, network metadata is soft-fail), and because the seam is clean, it is one short test instead of a staging environment and a chaos script.

Where the framework fails

This shape is honest about its limits, and vpn-control-plane shows two of them.

First, the "engine fields are public-readonly, handlers read them directly" decision trades encapsulation for a small class file. runApplyPhase reaches into engine.iroutes, engine.sessions, engine.throttled, and a dozen optional engine.deps.* collaborators. That is fine while one team owns the module and the CLAUDE.md invariant ("external callers go through handle() only") is respected, but it is not a boundary the compiler enforces. A readonly field stops reassignment, not mutation of the Map it points at. The seam between subsystems is clean; the seam between the engine and its own handlers is a documented convention, not a wall.

Second, the apply phase has quietly become long. It does network fetch, IPAM, iroute virtualization, kernel route install, NAT mapping, shaping, lease persistence, group-firewall pairing, compliance logging, HA replication, and anomaly fan-out before it ever calls allowClient. Most of that is guarded by if (engine.deps.x) so the optional collaborators stay optional, but the function is now the place where a dozen subsystems meet. The framework kept the coordinator thin by moving logic into phases and policies — it did not stop the apply phase from accumulating responsibilities. The next refactor is to split runApplyPhase the way runAuthPhase was already split out, and the disconnect handler tells you why: onDisconnect has to unwind every one of those side effects in the right order, and an apply phase that grows unchecked makes its mirror harder to keep correct.

Trade-off

You pay for this with indirection. A connect is no longer one function you read top to bottom — it is handleonConnectrunAuthFlowrunAuthPhase / runApplyPhase, with policies pulled in from policies/. For someone tracing a single bug, that is more files open than a monolithic handler would need. The bet is that you read the flow once and test it constantly, so optimizing for testability over first-read locality pays back. In a system where a wrong verdict either locks out a paying client or lets an unauthorized one in, fast confident tests are worth the extra hop. If this were a low-stakes internal tool with two event types and no upstream, the indirection would be overhead.

Business impact

The thin-coordinator shape is why this control plane can change its auth rules, its caching TTLs, or its fallback behavior without a fear-driven manual test pass. Each change lands with a unit test that drives the real flow through fakes in milliseconds, so a release that touches the connect path ships with evidence instead of hope. For a service whose whole job is deciding who gets on the network, that means fewer "we changed the policy and locked everyone out at 9am" incidents — and that is the failure mode that turns a routine deploy into an emergency call.

What to do next

If you have a coordination layer that sits between subsystems, run one check: can you test its main outcome without a network, a database, or a real socket? Open the test file and count the assertions against the verdict versus the assertions against internal calls. If you are asserting "this mock was called with these args" more than "this is the answer that came out," the side effects are leaking into the logic. Pull the I/O behind interfaces, make the phases return values, and see whether the flow gets shorter to describe. The connect path here is five assertions; yours can be too.

Related Articles

Same Category

Comments (0)

Newsletter

Stay updated! Get all the latest and greatest posts delivered straight to your inbox