Skip to content

Driving OpenVPN from its management socket: client-auth decisions over a unix socket

7/9/2025Backend DevelopmentOpenVPN Control Plane13 min read

Driving OpenVPN from its management socket: client-auth decisions over a unix socket

The daemon asks, the control plane answers. client-auth, ifconfig-push, and kill — three verbs on a unix socket that replace a compiled C plugin.

The decision this framework is for

When a VPN client connects, something has to decide whether it gets in, what IP it receives, and which routes it can reach. The textbook answer is a C plugin compiled against openvpn-plugin.h, loaded into the daemon, making that call inside OpenVPN's own address space. It works, and it is also the wrong place to put business logic that talks to an upstream API, checks a quota, and consults an anomaly detector.

This framework is for the moment you decide to move that decision out of the daemon and into a Node process that speaks OpenVPN's management protocol over a unix socket. In this control plane, a TypeScript ManagementClient connects to each instance's socket, parses the daemon's notifications, and answers with three verbs: client-auth to allow, client-deny to reject, and kill to evict an existing session. No plugin, no recompile, no daemon restart to change policy. The cost is that you now own a stateful line protocol and a child process you have to keep alive — which is the second half of this post.

The framework

Treat the management socket as a request/response control plane with four obligations, in order:

  1. Connect and survive. The socket only exists while OpenVPN runs, and it disappears on every crash-restart. The client must reconnect with backoff and treat ECONNREFUSED/ENOENT as normal, not as errors.
  2. Parse before you trust. The daemon streams async notifications (>CLIENT:CONNECT, >BYTECOUNT_CLI, >HOLD) interleaved with replies to your own commands. A stateful parser turns lines into typed events; nothing downstream touches raw strings.
  3. Answer with intent, serialized. client-auth is a multi-line block terminated by END. Two commands must never interleave on the wire, so every write goes through a queue.
  4. Own the lifecycle. Because the socket's existence is tied to the process, the same component (or its sibling supervisor) decides hold, release, and eviction — hold release on connect, kill on policy violation.

The four steps map onto real files: src/ovpn/ManagementClient.ts (1, 3), src/ovpn/protocol/index.ts (2), and the supervisor in src/ovpn/Supervisor.ts (4).

Each step with one paragraph of explanation

Connect and survive means the happy path is the exception. The client races OpenVPN's boot and its restart cycle, so the socket file is routinely missing or refusing connections. The fix is to demote those two error codes to debug logging and let a backoff timer drive reconnection — anything noisier and the logs become useless during a restart storm. Parse before you trust means OpenVPN's protocol is a flat line stream where a connect event spans many lines (CONNECT, then ENV key/value pairs, then ENV,END), so the parser buffers per-client environment until the terminator before emitting one typed event. Answer with intent, serialized is the rule that keeps a client-auth ... END block from being torn apart by a concurrent kill; a promise chain guarantees one write finishes before the next starts. Own the lifecycle acknowledges that none of this matters if the daemon isn't running — and that the management socket is the visible symptom of an invisible lifecycle problem.

Walk the framework through a real artifact in the target repo

The allow path is the clearest place to start. allowClient builds the multi-line block and pushes per-client network config:

// src/ovpn/ManagementClient.ts
async allowClient(cid: number, kid: number, push?: PushConfig): Promise<void> {
  const lines: string[] = [`client-auth ${cid} ${kid}`];
  if (push) lines.push(...renderPush(push));
  lines.push('END');
  await this.send(lines.join('\n') + '\n');
}

The cid and kid are the client and key IDs OpenVPN assigned to this connection attempt; they are how you address one pending client among many. The push argument is where the control plane earns its keep — instead of a static ccd/ directory, it computes the client's IP and routes at decision time and renders them into the block:

// src/ovpn/ManagementClient.ts
function renderPush(p: PushConfig): string[] {
  const out: string[] = [];
  if (p.ifconfigPush) out.push(`ifconfig-push ${p.ifconfigPush.ip} ${p.ifconfigPush.mask}`);
  if (p.pushRoutes) for (const r of p.pushRoutes) out.push(`push "route ${r.network} ${r.mask}"`);
  if (p.pushDns) for (const dns of p.pushDns) out.push(`push "dhcp-option DNS ${dns}"`);
  if (p.rawPush) for (const r of p.rawPush) out.push(r.startsWith('push ') ? r : `push "${r}"`);
  if (p.configLines) out.push(...p.configLines);
  return out;
}

This is the part a C plugin makes painful and a socket makes trivial: ifconfig-push assigns the tunnel IP, push "route ..." injects routes, and the whole thing is a string the daemon applies to exactly this session. Deny and evict are the other two verbs, and they have a sharp detail in the quoting:

// src/ovpn/ManagementClient.ts
async denyClient(cid: number, kid: number, reason: string, clientReason?: string): Promise<void> {
  const r = quote(reason);
  const cr = quote(clientReason ?? reason);
  await this.send(`client-deny ${cid} ${kid} ${r} ${cr}\n`);
}

async kill(cidOrCn: number | string): Promise<void> {
  log.info({ target: cidOrCn, transport: this.transport }, 'mgmt: → kill');
  await this.send(`kill ${cidOrCn}\n`);
}

client-deny carries two strings: an operator-facing reason for the log and a client-facing reason shown to the user, both run through quote() so an embedded quote or backslash can't break out of the argument and inject a second command. kill is the eviction verb — it accepts either a numeric CID or a common name, which is what lets an anomaly detector kick a user by identity without knowing their current connection ID.

The reason these verbs can be issued safely from an event-driven Node process is the parser. OpenVPN's notifications and your command replies arrive on the same socket, so the client splits the byte stream on newlines and feeds each line to a stateful parser:

// src/ovpn/protocol/index.ts
if (sub === 'CONNECT' || sub === 'REAUTH') {
  const [cidS, kidS] = tail.split(',');
  this.pendingHeader = { kind: sub === 'CONNECT' ? 'connect' : 'reauth', cid: Number(cidS), kid: Number(kidS) };
  this.pendingEnv = {};
  return;
}
if (sub === 'ENV') {
  if (tail === 'END') {
    const h = this.pendingHeader;
    const env = this.pendingEnv;
    this.pendingHeader = null;
    if (h?.kind === 'connect' || h?.kind === 'reauth') {
      out.push({ kind: 'client', ev: { kind: h.kind, cid: h.cid, kid: h.kid, env } });
    }
    return;
  }
  // accumulate key=value into pendingEnv until END
}

A >CLIENT:CONNECT line opens a header, every following >CLIENT:ENV,key=value line accumulates into pendingEnv, and only >CLIENT:ENV,END emits a complete, typed connect event with the full environment attached. Downstream code never sees a partial connect. Lines that don't start with > are command replies, surfaced separately so a kill that fails with ERROR (CN mismatch, a socket race) gets logged instead of vanishing — otherwise the operator sees a kick fire with no effect and no explanation.

The write side is the other half of correctness. Because allowClient sends multiple lines and kill sends one, two callers firing at once could interleave bytes on the wire. The client serializes every write through a promise chain:

// src/ovpn/ManagementClient.ts
private send(payload: string): Promise<void> {
  const task = (): Promise<void> => {
    if (!this.sock) return Promise.reject(new Error('mgmt: not connected'));
    const sock = this.sock;
    if (sock.write(payload)) return Promise.resolve();
    return new Promise<void>((res) => sock.once('drain', () => res()));
  };
  const next = this.writeQueue.then(task, task);
  this.writeQueue = next.then(() => undefined, () => undefined);
  return next;
}

Each send chains onto the previous one, so a client-auth ... END block always lands as a contiguous unit. The queue swallows errors internally (() => undefined) so one failed write doesn't poison every future command, while still rejecting the specific caller whose write failed. It also respects backpressure: if sock.write returns false, the next write waits for drain.

Where the framework fails

The socket-as-control-plane model breaks down exactly where you stop owning the process. The management socket exists only while OpenVPN runs, so every claim above — reconnect on close, kill by CID, push routes at auth time — assumes something keeps the daemon alive and recreates the socket after a crash. If that something is a separate systemd unit, you've reintroduced the race the design was trying to remove: Node connecting to a socket that systemd hasn't recreated yet, or worse, a stale socket file left behind by a process systemd killed without cleanup. The framework also doesn't help with the first auth at all if you forget management-hold — without it, OpenVPN accepts clients before Node has connected, and your control plane silently isn't in the loop. And the parser is only as good as the protocol's stability; a new OpenVPN notification tag falls through the switch and is dropped on purpose, which is the right default but means you find out about protocol additions by their absence.

That last failure mode is why this control plane doesn't run OpenVPN as a systemd unit at all. It spawns the daemon as a child of the Node process. The supervisor's docstring states the reason directly: "the management socket only exists while OpenVPN runs. Having Node own the child means start order, restart policy and graceful shutdown live in one place — no race with a separate systemd unit, no orphaned sockets."

// src/ovpn/Supervisor.ts
private spawnChild(): void {
  if (this.child || this.stopping) return;
  const args = ['--config', this.opts.config, ...(this.opts.extraArgs ?? [])];
  const child = spawn(this.opts.bin, args, {
    cwd: this.opts.cwd ?? process.cwd(),
    stdio: ['ignore', 'pipe', 'pipe'],
    detached: false,
  });
  this.child = child;
  this.startedAt = Date.now();
  child.on('exit', (code, signal) => {
    this.child = null;
    this.lastExit = { code, signal };
    this.emit('exit', { code, signal });
    if (!this.stopping && this.opts.restartOnCrash) {
      this.restartCount += 1;
      const delay = this.opts.restartDelayMs ?? 2000;
      this.restartTimer = setTimeout(() => this.start(), delay);
      this.restartTimer.unref();
    }
  });
}

detached: false and piped stdio mean the OpenVPN child's lifetime is bound to the supervisor and its log lines flow into the same structured logger as everything else. The exit handler is the restart policy that systemd would otherwise own — restart on unexpected exit, but never while a deliberate stop() is in flight, and never when restartOnCrash is disabled (the fail-fast mode the SupervisorOptions Zod schema documents for when you do want an external supervisor). Because Node owns the child, it can also run a preflight before each spawn — the step that closes the stale-socket hole:

// src/ovpn/preflight.ts
export async function preflightCleanup(opts: {
  configPath: string; mgmtSocketPath?: string; /* ... */
}): Promise<{ killed: number[]; unlinked: boolean }> {
  const exclude = new Set<number>([process.pid, ...(opts.excludePids ?? [])]);
  const foreign = findOpenVpnPidsByConfig(opts.configPath, exclude);
  for (const pid of foreign) await terminateProcess(pid, opts.killTimeoutMs ?? 5_000);
  let unlinked = false;
  if (opts.mgmtSocketPath) {
    const state = await probeUnixSocket(opts.mgmtSocketPath);
    if (state === 'alive') {
      throw new Error(`preflight: mgmt socket ${opts.mgmtSocketPath} is held by another process after cleanup`);
    }
    if (state === 'stale') { unlinkSync(opts.mgmtSocketPath); unlinked = true; }
  }
  return { killed: foreign, unlinked };
}

Before spawning, the supervisor scans /proc for any OpenVPN whose argv references its config file and terminates it, then probes the management socket: alive means a foreign listener it refuses to fight (it throws rather than spawn a competitor), stale means a leftover file it unlinks so the fresh daemon can bind. This is the single most important reason to own the child — a systemd-launched daemon plus a sidecar Node process has no clean owner for "kill the zombie and clean up its socket," and you end up debugging ECONNREFUSED against a socket file that exists but listens to nobody.

The wiring ties both halves together per instance:

// src/boot/wiring.ts
const mgmt = new ManagementClient({ kind: 'unix', path: w.mgmtSocket });
const supervisor = new OpenVpnSupervisor({
  bin: config.ovpn.bin,
  config: w.runtimeConfPath,
  restartOnCrash: config.ovpn.restartOnCrash,
  restartDelayMs: config.ovpn.restartDelayMs,
  mgmtSocketPath: w.mgmtSocket,
});

Each enabled instance gets one socket path shared between the client (which connects to it) and the supervisor (which preflights and recreates it). attachAll starts both on boot or on SIGUSR1; detachAll stops the mgmt client and SIGTERMs the child on SIGUSR2, leaving the admin HTTP server up. One process, one place to look.

CTA

If you're staring at an openvpn-server@.service unit and a Node sidecar that talks to it, the trigger question is: who recreates the management socket after a crash, and who guarantees Node reconnects before a client gets in? If the answer is "two different supervisors that don't know about each other," the socket-as-control-plane model is leaking through the gap between them.

Trade-off

Owning the daemon as a child buys a single lifecycle owner at the cost of giving up systemd's free conveniences: journalctl integration, Restart= policy, socket activation, and the operational muscle memory every Linux admin already has. You re-implement restart-with-backoff, you pipe logs yourself, and a bug in your supervisor takes the daemon down with it. That trade is worth it when the control plane must be in the auth loop and a torn socket means clients connect unpoliced — and it is not worth it for a plain VPN endpoint with static ccd/ files, where systemd is simpler and there's no decision to make per connection.

Business impact

For the audience signing off on this, the consequence is operational: policy changes (a new quota rule, an anomaly kick, a route change) ship as data through a socket, not as a recompiled plugin and a maintenance-window restart that drops every connected user. Authorization decisions can consult an upstream API and a live session map at connect time, which is the difference between a VPN that enforces business rules and one that merely terminates tunnels. The risk you accept in return is concentration — one Node process now owns auth decisions and the daemon's lifecycle, so its uptime and your supervisor's correctness are load-bearing in a way a stock systemd unit never was.

What to do next

Pull up your own management-interface client and check one thing: does every write go through a single serializer, or can two callers interleave a multi-line client-auth block with a kill? If writes aren't serialized, that's a latent corruption bug that only shows up under concurrent connect-and-evict — the kind that's invisible in tests and ugly in production. The send() queue and the ProtocolParser ENV-buffering pattern above are both small enough to copy and adapt; start there.

Related Articles

Same Category

Comments (0)

Newsletter

Stay updated! Get all the latest and greatest posts delivered straight to your inbox