Redis hot, SQLite durable: a two-tier cache that survives a control-plane restart

7/28/2025Backend Development•OpenVPN Control Plane•410 views•12 min read•by Kuray Karaaslan

Redis hot, SQLite durable: a two-tier cache that survives a control-plane restart

Redis answers in microseconds but forgets on restart. An embedded SQLite layer underneath turns the cache into a fallback that still knows the answer when Redis is cold.

The decision and why it cannot be deferred

A VPN control plane sits in the request path for every connection decision. When a client tries to authenticate or asks for a network assignment, the control plane calls an upstream API, and that call is the slow part. Caching the response is not optional. The real question is what happens to that cache when the process restarts or Redis is unreachable — because in this kind of system, a cold cache is not a performance dip, it is a thundering herd of auth requests hitting the upstream API at the exact moment you can least afford it.

The repo vpn-control-plane settles this with a single class in src/cache/Cache.ts. Two storage layers, one interface: Redis as the hot tier, an embedded SQLite database as the durable tier. The class header states the contract plainly:

/**
 * Two-layer cache:
 *   - Redis: hot, shared across instances (primary)
 *   - SQLite: durable fallback, also useful when Redis is down
 *
 * Reads: Redis → SQLite → miss
 * Writes: both layers, best-effort on Redis
 *
 * Keys are namespaced via config.storage.redisKeyPrefix.
 */
export class Cache {
  private redis: Redis;
  private redisOk = true;

The redisOk flag is the whole design in one boolean. It is the thing that lets every method downgrade gracefully instead of throwing. The dependency line that makes this possible is in package.json: ioredis@^5.4.1 for the hot tier, better-sqlite3@^12.4.0 for the durable tier, accessed through Prisma (@prisma/client@^7.8.0). You are choosing between three plausible designs here, and the choice has consequences that show up at 3am, not at code review.

Option A: Redis only

The default reach is Redis alone. It is fast, it is shared across instances, and the operational story is well understood. The cost is amnesia. Redis is an in-memory store; a restart, an eviction under maxmemory, or a network partition empties the answer set. For a cache of blog fragments that does not matter. For a cache that fronts authentication and network-assignment decisions, an empty cache means every connecting client generates an upstream call simultaneously.

The connection setup in the constructor shows how seriously the repo treats Redis being absent:

this.redis = new Redis(config.storage.redisUrl, {
  lazyConnect: true,
  maxRetriesPerRequest: 1,
  enableOfflineQueue: false,
  retryStrategy: (times) => Math.min(1000 * 2 ** Math.min(times, 5), 30_000),
});
this.redis.on('error', (err) => {
  if (this.redisOk) log.warn({ err: err.message }, 'cache: redis error (falling back to sqlite)');
  this.redisOk = false;
});
this.redis.on('ready', () => {
  if (!this.redisOk) log.info('cache: redis reconnected');
  this.redisOk = true;
});

enableOfflineQueue: false and maxRetriesPerRequest: 1 are the tell. A naive ioredis client buffers commands while disconnected and replays them on reconnect, which means a get during an outage hangs instead of failing fast. Here, a command during an outage fails immediately, the error handler flips redisOk to false, and the code moves on. Redis-only would skip the second tier entirely and accept the amnesia. This repo does not.

Option B: SQLite only

The opposite extreme is to drop Redis and cache everything in the embedded SQLite file. It survives restarts by definition, it has no separate service to operate, and better-sqlite3 through the Prisma adapter is genuinely fast for point lookups. The schema is trivial — three columns in prisma/schema.prisma:

model CacheKv {
  k         String @id
  v         String
  expiresAt Int    @map("expires_at")

  @@index([expiresAt], map: "idx_cache_kv_expires")
  @@map("cache_kv")
}

A string primary key, a JSON-serialized value, a unix-seconds expiry, and an index on expires_at so the garbage collector can sweep efficiently. That is a complete cache table.

What SQLite-only costs you is sharing. The file lives on one box. The moment you run more than one control-plane instance — and PM2 ecosystem config showed up in a recent commit, so multiple workers are in play — each process has its own SQLite file and its own view of the cache. Two workers would each call the upstream API for the same key, and an invalidation on one would not reach the other. SQLite is durable but local. That is the wall you hit, and it is why neither extreme wins outright.

The combined design: Redis in front, SQLite underneath

The repo keeps both and orders them. The read path tries Redis first, falls through to SQLite, and only then reports a miss:

async get<T = unknown>(key: string): Promise<T | null> {
  const kind = key.split(':', 1)[0] ?? 'other';
  const k = this.k(key);
  if (this.redisOk) {
    try {
      const v = await this.redis.get(k);
      if (v != null) {
        metrics.cacheHits.inc({ layer: 'redis', kind });
        return JSON.parse(v) as T;
      }
    } catch (err) {
      log.debug({ err }, 'cache: redis get failed');
    }
  }
  const v = await this.sqliteGet<T>(key);
  if (v != null) {
    metrics.cacheHits.inc({ layer: 'sqlite', kind });
  } else {
    metrics.cacheMisses.inc({ kind });
  }
  return v;
}

Two details earn their place. First, the redisOk guard means a healthy SQLite read never waits on a dead Redis socket — when Redis is down the code does not even attempt the network call. Second, the hit metric carries a layer label, so the dashboard can tell a Redis hit from a SQLite hit. If your SQLite-hit rate climbs, Redis is flapping and you find out from a graph instead of from the upstream API's rate limiter.

The write path is where durability is actually bought. Every set writes through to SQLite unconditionally; Redis is best-effort:

async set<T>(key: string, value: T, ttlSeconds: number): Promise<void> {
  const kind = key.split(':', 1)[0] ?? 'other';
  const k = this.k(key);
  const json = JSON.stringify(value);
  const expiresAt = Math.floor(Date.now() / 1000) + ttlSeconds;

  if (this.redisOk) {
    try { await this.redis.set(k, json, 'EX', ttlSeconds); }
    catch (err) { log.debug({ err }, 'cache: redis set failed'); }
  }
  const db = await getDb();
  await db.cacheKv.upsert({
    where: { k: key },
    create: { k: key, v: json, expiresAt },
    update: { v: json, expiresAt },
  });
  metrics.cacheSets.inc({ kind });
}

Notice the asymmetry. Redis uses native EX expiry — it expires the key itself, no application code needed. SQLite stores expiresAt as an absolute unix-seconds timestamp computed once at write time, and the application enforces it on read. The Redis set is wrapped in a try/catch that only logs; a failed Redis write does not fail the operation, because SQLite already has the durable copy. That single design choice — SQLite is authoritative, Redis is an accelerator — is what makes a restart cheap.

TTL, invalidation, and the cold-start that does not stampede

TTL is enforced in two different places by two different mechanisms, and conflating them is the classic mistake. Redis self-expires via EX. SQLite cannot, so the read path checks expiry and prunes lazily on access:

private async sqliteGet<T>(key: string): Promise<T | null> {
  const now = Math.floor(Date.now() / 1000);
  const db = await getDb();
  const row = await db.cacheKv.findUnique({ where: { k: key } });
  if (!row) return null;
  if (row.expiresAt <= now) {
    await db.cacheKv.deleteMany({ where: { k: key } });
    return null;
  }
  try { return JSON.parse(row.v) as T; } catch { return null; }
}

An expired SQLite row is treated as a miss and deleted on the spot, so a stale answer never escapes. Lazy pruning alone would leak rows that are never read again, so start() schedules a sweep every 60 seconds against the expires_at index, and the timer is unref()'d so it never holds the process open during shutdown:

async start(): Promise<void> {
  try {
    await this.redis.connect();
  } catch (err) {
    log.warn({ err }, 'cache: redis initial connect failed; continuing with sqlite only');
  }
  // Periodic GC of expired sqlite rows
  setInterval(() => { void this.gc(); }, 60_000).unref();
}

Active invalidation is glob-based and crosses both tiers. The admin-triggered delMatching scans Redis with a cursor and translates the same glob into a SQL LIKE for SQLite:

const like = pattern.replaceAll('\\', '\\\\').replaceAll('%', '\\%').replaceAll('_', '\\_').replaceAll('*', '%').replaceAll('?', '_');
const db = await getDb();
const res = await db.$executeRaw`DELETE FROM cache_kv WHERE k LIKE ${like} ESCAPE '\\'`;
return res ?? 0;

The *-to-% and ?-to-_ translation, with % and _ in the user pattern escaped first, is the kind of detail that gets missed and then silently over-deletes half the cache. The test suite pins it: delMatching('auth:*') removes auth:alice and auth:bob but leaves net:alice untouched.

This is also where the cold-start payoff lands. When the control plane restarts with an empty Redis, the first reads fall through to SQLite, repopulate, and the upstream API never sees the herd. The test that proves the fallback works does so with redisOk forced false and the Redis socket disconnected — set then get still round-trips through SQLite alone. The same suite pins the key-isolation invariant that makes this safe: Redis keys are namespaced with config.storage.redisKeyPrefix, but SQLite stores the un-prefixed key, and a dedicated test asserts the row lands under auth:zoe, not test:auth:zoe. Mixing those up would mean the durable tier could never find what the hot tier wrote.

One precision matters here: "SQLite" in this design is not raw better-sqlite3. The repo drives the embedded database through Prisma's better-sqlite3 adapter in src/db/sqlite.ts, with the connection memoized so every caller shares one client, and the pragmas set explicitly on connect:

const adapter = new PrismaBetterSqlite3({ url: `file:${config.storage.sqlitePath}` });
const client = new PrismaClient({ adapter });
await client.$connect();
await client.$executeRawUnsafe('PRAGMA journal_mode = WAL');
await client.$executeRawUnsafe('PRAGMA synchronous = NORMAL');
await client.$executeRawUnsafe('PRAGMA busy_timeout = 30000');

WAL mode lets readers and the GC sweep run concurrently with writes instead of serializing behind a global lock, and the 30-second busy timeout means a set that collides with the sweep waits rather than throwing SQLITE_BUSY. Those two pragmas are what keep the write-through tier from becoming the bottleneck the whole design is trying to avoid. The cache's two named consumers — the decision engine that caches auth and network responses, and the DHCP handler that caches MAC lookups — both go through this one Cache instance, so the durability guarantee is system-wide, not per-feature.

Trade-off

The recommendation is the two-tier design, and it accepts a real cost: every write is a double write, and the two tiers can briefly disagree. A set always pays a SQLite upsert even when Redis is healthy and would have sufficed, so write latency is bounded by the slower of the two stores. The WAL pragma in src/db/sqlite.ts (journal_mode = WAL, synchronous = NORMAL, busy_timeout = 30000) keeps that write cheap and concurrent, but it is not free. There is also a consistency seam: if Redis takes a write and the SQLite write then fails, or vice versa, the tiers diverge until the next set or the TTL expires. The design tolerates this on purpose — caches are allowed to be slightly stale, and TTL is the backstop. What it refuses to tolerate is losing the answer entirely on restart. That is the trade it makes, and for an auth-path cache it is the right one. If your cached data were the system of record rather than a copy of an upstream answer, this asymmetry would be unacceptable and you would reach for replication instead.

Business impact

For the people paying for the system, this is uptime during the worst five minutes. A control-plane restart or a Redis hiccup is exactly when a single-tier cache turns into an outage: the cache empties, every connecting user funnels their auth request to the upstream API at once, and the API either rate-limits you or falls over, and now nobody can connect. The durable SQLite tier means a restart is a non-event — the answers are still on disk, reads keep being served, and the upstream API sees a trickle instead of a flood. That is the difference between a deploy you do during business hours and one you schedule for 2am with a rollback plan.

What to do next

If you run a cache in front of anything in the auth or connection path, ask one question of it: what is your traffic to the origin in the first sixty seconds after a restart? If the honest answer is "the same as a cold start," you have a single-tier cache wearing a durability costume. The fix is small — a write-through to any embedded store with an absolute-timestamp expiry column and a lazy prune on read. The whole pattern fits in one class. Copy the get and set above, point the second tier at SQLite or your durable store of choice, and measure the SQLite-hit metric after your next deploy. If it ever spikes, your hot tier is flapping and you will know before your users do.

Redis hot, SQLite durable: a two-tier cache that survives a control-plane restart

Redis hot, SQLite durable: a two-tier cache that survives a control-plane restart

The decision and why it cannot be deferred

Option A: Redis only

Option B: SQLite only

The combined design: Redis in front, SQLite underneath

TTL, invalidation, and the cold-start that does not stampede

Trade-off

Business impact

What to do next

Related Articles

Comments (0)

Newsletter