Packet captures that outlive the box: forensic evidence streamed to object storage
Packet captures that outlive the box: forensic evidence streamed to object storage
Local disk fills, instances get recycled, evidence evaporates. Streaming captures and compliance archives straight to S3 turns "we think it happened" into "here is the file".
The decision this framework is for
You run a network control plane that sometimes sees something it does not like — an anomalous client, a geo-velocity jump, a policy violation worth investigating later. So you capture packets. The naive version writes a .pcap to local disk and moves on. That works right up until the host reboots, the autoscaler recycles the instance, or the disk fills and your log rotation quietly eats the one file the auditor wanted six weeks later.
This framework is for the moment you decide that forensic evidence has a different lifecycle than the box that produced it. In vpn-control-plane, that decision is split across two modules: src/forensic/ForensicCapture.ts, which produces short rolling packet captures on demand, and src/storage/BackupService.ts, which pushes sealed compliance and audit artifacts to an S3-compatible store via @aws-sdk/client-s3@^3.1054.0. The interesting design question is not "how do I call S3." It is: which artifacts are ephemeral, which must be durable, and where exactly is the seam between the two? Get that boundary wrong and you either pay to store noise forever or you lose the one capture that mattered.
The framework
Call it the Durable Evidence Lifecycle. Five steps, in order:
- Capture under a budget — produce evidence with hard bounds on duration, size, and frequency, so a flood of triggers cannot fill the disk.
- Record the pointer, not the payload — persist a small database row that locates the artifact, so the system can reason about evidence it is not holding in memory.
- Cross the durability seam explicitly — move artifacts you intend to keep across a named boundary (local provider → object storage) instead of hoping the disk survives.
- Make offload best-effort, never blocking — an S3 outage must not block the capture, the seal, or the shutdown that produced the evidence.
- Rotate from both ends — evict locally by a size/count budget; delete remotely by a retention key. One key, two lifecycles.
The point of naming it is that each step maps to a real seam in the code, and each seam is a place where a careless implementation silently loses evidence.
Each step with one paragraph of explanation
Capture under a budget. Evidence collection is itself a denial-of-service vector. If every anomaly spawns an unbounded tcpdump, the first noisy client fills your disk and you stop capturing the quiet attacker. So the producer must enforce ceilings before it spawns anything. Record the pointer, not the payload. The capture file lives on disk; the database holds a row that says where it is, how big it is, and why it exists. That row is what the rest of the system queries — the admin UI lists captures without reading a single packet. Cross the durability seam explicitly. Local disk and object storage are different reliability domains. The code makes that a real interface boundary, not an if statement buried in a handler. Make offload best-effort. The whole reason you capture evidence is that something is already going wrong; the offload path cannot add a new failure mode to the path that seals it. Rotate from both ends. Local storage is bounded by operational limits (you only have so much disk); remote storage is bounded by retention policy (legal/compliance windows). They expire on different clocks, so deletion has to be addressable independently.
Walk the framework through a real artifact in the target repo
Start with capture under a budget. ForensicCapture.maybeCapture is gated before it ever touches the network:
async maybeCapture(req: CaptureRequest): Promise<void> {
const sev = req.severity ?? 'medium';
if (SEV_ORDER[sev] < SEV_ORDER[this.opts.minSeverity]) return;
const now = Date.now();
const last = this.lastByCn.get(req.cn) ?? 0;
if (now - last < this.opts.perCnCooldownMs) return;
if (this.active.has(req.cn)) return;
this.lastByCn.set(req.cn, now);
this.active.set(req.cn, now);
Three guards, all before any work: a severity floor (minSeverity), a per-common-name cooldown (perCnCooldownMs), and an in-flight de-dupe (active). A client that trips an anomaly detector every two seconds gets exactly one capture per cooldown window, not a fork bomb of tcpdump processes. The budget is declared in ForensicCaptureOptions, validated by Zod in src/forensic/types.ts, so durationSec, maxFiles, maxBytes, and snaplen are all positive integers by construction rather than by hope.
The capture itself is a bounded tcpdump invocation, not an open-ended stream:
const args = [
'-i', 'any',
'-n', '-q',
'-s', String(this.opts.snaplen),
'-G', String(this.opts.durationSec),
'-W', '1',
'-w', path,
'host', ip,
];
const started = Date.now();
const proc = spawn(this.opts.tcpdumpBin, args, { stdio: ['ignore', 'ignore', 'pipe'] });
-G durationSec -W 1 tells tcpdump to rotate exactly once after N seconds and stop — a self-terminating capture. -s snaplen caps per-packet bytes so you collect headers and a slice of payload, not full jumbo frames. There is still a setTimeout that sends SIGTERM two seconds past the deadline as a backstop, because trusting an external binary to always exit on time is how you leak processes. Note the exit-code handling: code === 0 || code === 124 || code === null all count as success, because a timeout-killed tcpdump is a normal outcome here, not an error.
Now record the pointer, not the payload. After the process closes, the code stats the file for its real size and writes a row:
let bytes = 0;
try { bytes = statSync(path).size; } catch { /* file may have failed before writing */ }
const db = await getDb();
await db.forensicCapture.create({
data: {
cn: req.cn,
ip: req.ip,
reason: req.reason,
anomalyId: req.anomalyId ?? null,
path,
bytes,
startedAt: now,
durationMs: dur,
createdAt: Date.now(),
},
});
metrics.forensicCaptures.inc({ outcome: 'ok' });
The ForensicCapture row stores path and bytes, not the packets. That single decision is what lets list(limit) page through thousands of captures cheaply and lets the garbage collector reason about total disk usage without opening files. The reason and anomalyId columns are the chain of custody: every capture is traceable back to the event that triggered it. And metrics.forensicCaptures.inc({ outcome: 'ok' }) means the offload pipeline is observable — you can alert on the error rate instead of discovering a silent capture failure during an incident.
Cross the durability seam explicitly. The storage layer does not bake S3 into the call site; it routes through a provider interface. StorageService constructs a BaseStorageProvider by name:
function defaultCreateProvider(name: StorageProviderType, cfg: S3Config): BaseStorageProvider {
switch (name) {
case 'aws-s3':
case 's3':
return new AWSS3Provider(cfg)
case 'cloudflare-r2':
return new CloudflareR2Provider(cfg)
case 'digitalocean-spaces':
return new DigitalOceanSpacesProvider(cfg)
case 'minio':
return new MinIOProvider(cfg)
default: {
const exhaustive: never = name
throw new Error(`${STORAGE_MESSAGES.PROVIDER_NOT_FOUND}: ${exhaustive}`)
}
}
}
The const exhaustive: never = name line is the seam doing real work: add a provider to the StorageProviderTypeSchema enum in src/storage/storage.enums.ts and forget to wire it here, and the build fails on a type error instead of at runtime in front of an auditor. The abstract BaseStorageProvider in src/storage/providers/base.provider.ts declares four methods — uploadFile, uploadFromUrl, deleteFile, getFileUrl — and every concrete provider, including AWSS3Provider, implements exactly those. MinIO is in the list on purpose: it is an S3-compatible store you can run locally, which means the durability seam is testable without a cloud account. Local versus S3 stops being a code change and becomes a config value (storageProvider in storage.setting.keys.ts).
For the compliance and audit path, the durable side of the seam is BackupService, which talks to the AWS SDK directly because its artifacts are server-internal, not user uploads:
private async uploadBuffer(s3Key: string, body: Buffer, contentType: string): Promise<void> {
await this.s3.send(new PutObjectCommand({
Bucket: this.bucket,
Key: s3Key,
Body: body,
ContentType: contentType,
}))
}
The key layout is deterministic and namespaced: <namespace>/compliance/sessions-YYYY-MM-DD.jsonl, with .sha256 and .tsr sidecars for integrity and timestamping. namespace defaults to the server ID, so several instances can share one bucket without colliding — which matters precisely because the instances are the thing that gets recycled. The evidence outlives any single one of them.
Make offload best-effort, never blocking. This is the step most teams get backwards. Look at how BackupService is constructed:
static fromConfig(): BackupService | null {
if (!config.s3.backupEnabled) return null
if (!config.s3.bucket) {
log.warn('backup: S3_BACKUP_ENABLED=true but S3_BUCKET is empty — backup disabled')
return null
}
// ...build S3Client and return new BackupService(...)
}
fromConfig() returns null when backups are off or misconfigured, and per the module contract every caller treats backup as optional and does .catch(log.warn) on the upload. An S3 outage degrades you to local-only evidence; it does not block the seal or the shutdown that produced the artifact. maybeCapture follows the same rule from the other direction — it wraps its whole body and bumps metrics.forensicCaptures.inc({ outcome: 'error' }) on failure rather than throwing, so a capture failure never takes down the anomaly actuator that called it.
Rotate from both ends. Locally, the garbage collector evicts by budget — both file count and total bytes:
let total = 0;
for (const r of rows) total += r.bytes;
let i = 0;
while (i < rows.length && (rows.length - i > this.opts.maxFiles || total > this.opts.maxBytes)) {
const r = rows[i]!;
try { unlinkSync(r.path); } catch { /* missing or denied */ }
await db.forensicCapture.deleteMany({ where: { id: r.id } });
total -= r.bytes;
i += 1;
}
Oldest-first eviction stops when both the count and byte budgets are satisfied, and it deletes the row in the same loop so the pointer never outlives the file. Remotely, deletion is keyed, not swept — rotateComplianceDay(dateYmd, backupKeyPrefix) reconstructs the exact .jsonl, .sha256, and .tsr keys and batches them through DeleteObjectsCommand in chunks of 1000 (the S3 per-call limit). Two clocks, two mechanisms: disk pressure drives local eviction, retention policy drives remote deletion.
Where the framework fails
This is not a packet-capture pipeline for high-throughput traffic forensics. tcpdump -i any with a single -W 1 rotation is built for short, targeted windows on one host's /32, triggered by an anomaly — not for line-rate mirroring of a busy uplink. If you need continuous full-packet capture, you want a dedicated tap and ring buffer, not this. The framework also leans on the per-CN cooldown to bound cost; an attacker who rotates common names faster than the cooldown can still trigger more captures than you would like, and the maxFiles/maxBytes budget is your only backstop there. Tune those down if your trigger surface is wide.
The best-effort offload rule has a real cost too: if S3 is down and the box is then recycled before it recovers, the evidence is genuinely gone — local-only is exactly as durable as the disk. Best-effort means you accept that gap rather than blocking shutdown to close it. And the provider seam buys portability, not equivalence: AWSS3Provider.uploadFile validates file extension and MIME type against the image-oriented StorageExtensionSchema/StorageMimeTypeSchema enums, so it is the wrong door for arbitrary forensic blobs — that is why BackupService bypasses the provider and calls PutObjectCommand with application/octet-stream directly. One seam for user uploads, one for server-internal evidence. Collapsing them into a single path is the mistake.
Trade-off
The recommendation accepts data loss at the edges in exchange for never blocking the hot path. Because both maybeCapture and the backup upload swallow their failures, you trade a guarantee ("every triggered capture reaches durable storage") for a property ("evidence collection never degrades the system it observes"). That is the right trade for a control plane, where the capture exists because something is already wrong — but it is a trade, not a free win, and it only holds if you actually watch metrics.forensicCaptures and the backup warning logs. A best-effort pipeline you do not monitor is just a pipeline that loses evidence quietly.
Business impact
For the people who answer to auditors and incident reviewers, the difference is concrete. When a regulator or a customer asks "what happened on the night of the 14th," durable evidence with a deterministic key layout means you produce the sealed archive and its .sha256 in minutes, from a bucket that survived every instance recycle since. The alternative — packets that lived on a since-recycled host's local disk — is a sentence you do not want to write in a post-incident report. The capture-under-budget and rotate-from-both-ends steps are what keep that durability from turning into an unbounded storage bill, so the cost of being able to answer the question stays predictable.
What to do next
Open your own capture or backup path and find the seam where ephemeral becomes durable. Then ask three questions of it: Is the producer bounded, so a flood of triggers cannot fill the disk? Is the offload best-effort, so the store being down cannot block the thing that produced the evidence? And can you delete remotely by a deterministic key, or are you sweeping a prefix and hoping? If any answer is "no," that is the seam to fix first. If you want a reference shape, the five-method BaseStorageProvider plus a fromConfig(): T | null that fails closed is a small, copyable pattern.
Related Articles
Same CategoryComments (0)
Newsletter
Stay updated! Get all the latest and greatest posts delivered straight to your inbox