Counting bytes per client: flow observation, bandwidth history, and quota enforcement

9/23/2025Backend Development•OpenVPN Control Plane•289 views•11 min read•by Kuray Karaaslan

Counting bytes per client: flow observation, bandwidth history, and quota enforcement

A tunnel that does not measure itself cannot be rate-limited or billed. Three tables turn raw byte counters into a quota you can enforce and a history you can chart.

The decision this framework is for

You are running a VPN control plane and someone asks the obvious operational question: how much has this client used this month, and what happens when they cross the line? If you cannot answer that from a query, you cannot throttle, you cannot bill, and you cannot tell a noisy client apart from a compromised one.

In a Node + TypeScript control plane that bridges the OpenVPN management interface, the raw material arrives as cumulative BYTECOUNT_CLI samples — one per online client, every few seconds. The framework here is the path those samples travel: from a fire-and-forget handler, through delta extraction, into two different stores tuned for two different questions, and finally into a verdict that drives Linux traffic control. The relevant code lives in src/decision/handlers/bytecount.ts, src/quota/QuotaTracker.ts, src/predict/baselineStore.ts, and src/shaper/TcManager.ts.

The framework

Per-client byte accounting in this codebase is five steps, and each one exists because the previous step produces data in the wrong shape for the next consumer:

Capture the cumulative counter from the management interface, fan it out fire-and-forget.
Differentiate the cumulative counter into a per-sample delta, guarding against counter resets.
Accumulate the delta into a period total (quota_usage) — the billing/enforcement view.
Aggregate the rate into an hour-of-week histogram (bandwidth_history) — the baseline/anomaly view.
Enforce the verdict by either throttling via tc or killing the session.

The split between steps 3 and 4 is the whole point. A quota wants a running total per period; a baseline wants a distribution per time-of-day. The same byte counter feeds both, but they are stored, indexed, and pruned differently.

Each step with one paragraph of explanation

Capture. The management-interface parser calls handleBytecount on every sample as fire-and-forget. It looks up the session, then fans the same (rx, tx) pair out to the DDoS feed, the predictor, the anomaly evaluator, and finally the quota tracker. None of these can block the parser, so each is wrapped in its own try/catch and logged rather than thrown. That discipline is load-bearing: a slow Redis write or a Prisma hiccup must not back up the event loop that is reading from a live socket.

Differentiate. OpenVPN reports a cumulative counter, not a per-interval delta. QuotaTracker.record keeps the last sample per connection id in memory and subtracts. The subtraction is clamped at zero, because a client reconnect (new cid) or a counter reset would otherwise produce a negative delta or a spurious spike.

Accumulate. The delta goes into quota_usage, keyed on (common_name, period_start). Hot path writes land in Redis as HINCRBY; a periodic flush mirrors the running total into SQLite. SQLite is the durable record; Redis is the canonical total while the process is healthy.

Aggregate. Separately, the rate (bytes per second, not bytes) is folded into bandwidth_history — one row per (cn, bucket_ts) hour bucket, holding running sums, sums of squares, and maxima. That is enough to reconstruct mean and standard deviation per hour-of-day without storing raw samples.

Enforce. record returns a verdict. If the period limit is crossed, the handler either applies a throttle class through TcManager or kills the session outright, depending on the operator's declared action.

Walk the framework through a real artifact in the target repo

Start at capture. The handler fans the sample out and only reaches quota at the very end, after every other consumer has had its turn:

// src/decision/handlers/bytecount.ts
export async function handleBytecount(
  engine: DecisionEngine,
  cid: number,
  rx: number,
  tx: number,
): Promise<void> {
  engine.leases.updateBytes(cid, rx, tx);

  const sess = engine.sessions.get(cid);
  if (!sess) return;
  const { cn, ip } = sess;

  // ...ddos, predictor, anomaly feeds, each in its own try/catch...

  if (!engine.deps.quota || !engine.deps.policies) return;
  let verdict;
  try { verdict = await engine.deps.quota.record(cid, cn, rx, tx); }
  catch (err) { log.error({ err, cn }, 'quota: record failed'); return; }
  if (!verdict.state.exhausted) return;

Note the ordering. Quota is last because it is the only consumer that can terminate the session, and you want the predictor and anomaly detector to have already seen the sample before that happens. Every feed is independently guarded, so a quota failure logs and returns rather than starving the others.

Inside record, the cumulative-to-delta conversion is the first thing that happens:

// src/quota/QuotaTracker.ts
async record(cid: number, cn: string, rx: number, tx: number, now = Date.now()): Promise<QuotaVerdict> {
  const prev = this.lastSamples.get(cid);
  const deltaRx = prev ? Math.max(0, rx - prev.rx) : rx;
  const deltaTx = prev ? Math.max(0, tx - prev.tx) : tx;
  this.lastSamples.set(cid, { rx, tx });

  const policy = await this.policies.get(cn);
  const window = windowFor(policy.quota.period, now);

  if (deltaRx > 0 || deltaTx > 0) {
    await this.applyDelta(cn, window, deltaRx, deltaTx);
  }
  // ...read usage, compute verdict...
}

The Math.max(0, ...) is the guard against counter resets. The cid is the per-connection id, so when a client reconnects it gets a fresh cid and prev is undefined — the first sample of a new connection is treated as a full delta, not subtracted against a stale total. The period window comes from windowFor, which maps a period kind to a half-open [start, end) interval in UTC:

// src/quota/periods.ts
export function windowFor(period: QuotaPeriod, now: number): PeriodWindow {
  if (period === 'none') return { start: 0, end: Number.MAX_SAFE_INTEGER };
  const d = new Date(now);
  if (period === 'daily') {
    const start = Date.UTC(d.getUTCFullYear(), d.getUTCMonth(), d.getUTCDate());
    return { start, end: start + 86_400_000 };
  }
  // weekly (ISO Monday) and monthly handled below...
}

Period rollover is implicit: when now crosses into a new day or month, windowFor returns a new period_start, the key changes, and a fresh quota_usage row begins accumulating. There is no cron job zeroing counters at midnight — the key derivation does it for free. UTC keeps the boundary math auditable; a local billing day would be an offset, not a redesign.

Now the accumulate step. The hot path is Redis, with SQLite as the durable mirror:

// src/quota/QuotaTracker.ts — applyDelta (Redis path)
const r = getRedis();
const key = quotaKey(cn, w.start);
// HINCRBY gives us atomic "+= delta" without read-modify-write.
const p = r.pipeline();
if (dRx > 0) p.hincrby(key, 'rx', dRx);
if (dTx > 0) p.hincrby(key, 'tx', dTx);
p.hset(key, 'periodEnd', String(w.end));
const ttl = Math.max(60_000, w.end - Date.now() + REDIS_TTL_MARGIN_MS);
p.pexpire(key, ttl);
await p.exec();

HINCRBY is the reason this path exists. The earlier design did one SQLite UPSERT per sample per online client, which made the bytecount path the database's hottest writer. Moving the running total into a Redis hash turns each sample into an atomic in-memory increment with no read-modify-write, and a periodic flush mirrors it back. When Redis is unhealthy the code falls through to the same atomic += on the SQLite row, so a Redis outage degrades to the old behavior rather than losing the delta. The flush itself uses max(bytes_rx, excluded.bytes_rx) on conflict so a stale Redis read racing a fresher one can never roll the durable mirror backwards across a Redis flap.

The aggregate step writes a completely different shape. BaselineStore.recordSample folds the rate into an hour bucket, keeping enough statistics to reconstruct a distribution:

// src/predict/baselineStore.ts
await db.$executeRaw`
  INSERT INTO bandwidth_history (
    cn, bucket_ts, hour_of_day, day_of_week,
    samples, sum_rx_bps, sum_tx_bps,
    sum_rx_bps_sq, sum_tx_bps_sq,
    max_rx_bps, max_tx_bps
  ) VALUES (${cn}, ${bucket}, ${hod}, ${dow}, 1, ${rx}, ${tx}, ${rxSq}, ${txSq}, ${rx}, ${tx})
  ON CONFLICT(cn, bucket_ts) DO UPDATE SET
    samples       = samples + 1,
    sum_rx_bps    = sum_rx_bps    + excluded.sum_rx_bps,
    sum_tx_bps    = sum_tx_bps    + excluded.sum_tx_bps,
    sum_rx_bps_sq = sum_rx_bps_sq + excluded.sum_rx_bps_sq,
    sum_tx_bps_sq = sum_tx_bps_sq + excluded.sum_tx_bps_sq,
    max_rx_bps    = MAX(max_rx_bps, excluded.max_rx_bps),
    max_tx_bps    = MAX(max_tx_bps, excluded.max_tx_bps)
`;

Storing sum, sum_of_squares, and samples lets getBaseline compute mean and standard deviation per hour-of-day with a single grouped SELECT, without ever retaining raw samples. That is the difference from quota: quota_usage answers "how much in total this period," bandwidth_history answers "what is normal for this client at 2pm on a Tuesday." A separate FlowObservation table (src/, flow_observations) holds per-flow five-tuple records — src/dst IP, ports, proto, bytes each direction — for the forensic and classification view; it is the third lens on the same traffic, not part of the enforcement total.

Finally, enforcement. The verdict comes back from record, and the handler turns it into a tc action:

// src/decision/handlers/bytecount.ts
if (verdict.action === 'throttle') {
  if (engine.throttled.has(cn)) return;
  engine.throttled.add(cn);
  metrics.quotaExhausted.inc({ action: 'throttle' });
  const policy = await engine.deps.policies.get(cn);
  const throttle = effectiveThrottle(policy);
  void engine.deps.shaper?.apply(ip, throttle)
    .catch((err) => log.error({ err, cn, ip }, 'shaper: throttle apply failed'));
} else {
  metrics.quotaExhausted.inc({ action: verdict.action });
  void engine.mgmt.kill(cn).catch((err) => log.error({ err, cn }, 'quota: kill failed'));
}

The engine.throttled.has(cn) guard matters: bytecount samples keep arriving after exhaustion, and without it every subsequent sample would re-issue the same tc commands. TcManager.apply pins the client's IP to a per-client HTB class and attaches a filter matching the IP — download on the tun device, upload on a paired IFB device fed by tun ingress. The class-id is allocated per IP from a 16-bit space, with 0xFFFE reserved as the default class for unmatched flows. Throttling is just replacing that client's class rate with a much smaller effectiveThrottle value; the quota total and the shaper meet exactly here, at one apply(ip, throttle) call.

Where the framework fails

The in-memory lastSamples map is per-process and not persisted. On a control-plane restart, the first sample after boot is treated as a full delta against an undefined prev, so a client that has already moved gigabytes this connection contributes its whole cumulative counter as a single delta. The warmup step rehydrates quota_usage totals from SQLite into Redis, but it does not rehydrate the per-cid baseline — cids are not stable across a restart anyway. In practice the period total is protected by the clamp and the warmup; the edge is a brief over-count window right after boot if a client was mid-transfer.

The second soft edge is the Redis/SQLite split. While Redis is healthy it is canonical and SQLite lags by up to one flush interval. An admin reading quota_usage directly during that window sees a slightly stale total. The flush's max(...) conflict resolution keeps the mirror monotonic, but it also means a legitimate downward correction (an admin reset that races a stale flush) has to delete the Redis key first — which reset() does deliberately, wiping Redis before touching SQLite so a racing record() cannot re-seed the row from a stale total.

The third is bandwidth_history cardinality. One row per client per hour bucket grows without bound until prune deletes rows past the retention cutoff. If pruning is misconfigured or never scheduled, the baseline table is the table that quietly fills the disk.

Trade-off

Putting the hot total in Redis and treating SQLite as a mirror buys throughput at the cost of a consistency window. You accept that the durable store can be up to one flush interval behind, and that a Redis outage degrades you to the slower SQLite-only path rather than failing closed. For a control plane where the bytecount path was the database's hottest writer, that trade is worth it — but it is a trade, not a free win, and the max()-on-conflict and explicit reset ordering are the cost of making it safe.

Business impact

For the operator, this is the difference between a metered service and an honor system. Because every byte lands in a period-keyed total that rolls over by key derivation rather than a fragile cron, you can quote a client's monthly usage from a single indexed query, enforce a cap the instant it is crossed, and choose per-policy whether crossing it throttles the client or disconnects them. The bandwidth_history side turns the same data into a per-client normal, which is what lets you tell "this customer is busy today" apart from "this credential is exfiltrating data" — a distinction that protects both the bill and the network.

What to do next

If you maintain a metering path, trace one counter end to end and check the two questions are actually separated: does your enforcement store key on the billing period, and does your baseline store key on time-of-day? Conflating them is the common mistake — a single table that tries to answer both ends up good at neither. The cheap audit: open your accounting code and confirm there is a Math.max(0, current - previous) somewhere before the increment. If a cumulative counter is being summed without a reset guard, every client reconnect is silently inflating your totals.

Counting bytes per client: flow observation, bandwidth history, and quota enforcement

Counting bytes per client: flow observation, bandwidth history, and quota enforcement

The decision this framework is for

The framework

Each step with one paragraph of explanation

Walk the framework through a real artifact in the target repo

Where the framework fails

Trade-off

Business impact

What to do next

Related Articles

Comments (0)

Newsletter