/proof — Ungated runbook

The runbook we wrote for ourselves.

VOK and VOM are our own virtual-office businesses in Kochi and Mumbai. This is how we run them — the architecture, the modules, the anomaly rules, and the lessons from 6 weeks of iteration on telephony alone.

Last reviewed: 2026-06-03 · ~3,000 words · No email gate · The PDF mirror of this page is on the roadmap.

The stack

Single Oracle Cloud A1.Flex VPS (ARM, Ubuntu 22.04, Asia-South). CloudPanel for nginx + per-site Linux users. MariaDB 10.11 — local-socket only, public port closed. PHP 8.2 + CodeIgniter 3 for the Perfex CRM and its three custom modules. Python 3.11 for the brain services + the WhatsApp Vault importer. Cloudflare in front (SSL Full Strict; this site runs Full for now).

The same box serves two WordPress marketing sites (virtualofficekochi.in, virtualofficemumbai.in), two Perfex CRMs (crm.virtualofficekochi.in, crm.virtualofficemumbai.in), one parish directory app (marymathachurchsakinaka.com, FastAPI + SQLite), and now this site (froshtek.com). All isolated by CloudPanel site users; the parish directory cannot read the CRM's files.

The Perfex CRM modules

Three modules, ~18,000 lines of PHP, ~2,000 lines of versioned SQL: whatsapp_archive, brain_telephony (CallBridge), brain_as_manager (Operations Brain). Each is a normal Perfex module — hooks for install/activate/deactivate/ uninstall, capability gates via staff_can(), customer-profile tabs registered through the right hook, AJAX endpoints that ride the CRM's own $.ajaxSetup() CSRF token (Perfex ships no REST API; we don't fork the core).

~220 automated tests cover the modules — Playwright E2E smoke + a 09-doctype-first.spec.ts invariant that catches the entire class of "module emits HTML before init_head()" bugs that cost us ~6 hours of latent header-overlap regression on 2026-05-08.

WhatsApp Vault architecture

The phone is the source of record we don't trust. The vault is the source of record we do. The importer pulls the encrypted msgstore.db + media tree off a business Android device, decrypts on the server, upserts into MariaDB in ordered passes (contacts → group participants → media → messages → reactions → edits → autolink → denormalise → conv-role). It is keyed on WhatsApp's internal message ID; re-running is safe, and an unchanged re-run short-circuits fast.

Tamper-evidence isn't marketing. The audit log is append-only with each row's SHA-256 chaining the previous; the DB user holding that table is INSERT+SELECT only — no UPDATE, no DELETE, no admin override. The vault tables themselves carry BEFORE DELETE triggers that refuse deletion unless a three-flag session variable is set (the DPDP erasure tool, not a button). Original message text is in immutable columns that no UPDATE statement touches.

Numbers from our own deployment: 334,246 messages, 85% of contacts auto-linked to CRM clients, 65% to leads, 13,866 entities extracted from message bodies (617 GSTINs, 1,477 PANs), full re-import in ~2 minutes. A real "missing customer" bug was traced to WhatsApp's @lid privacy-identifier scheme and fixed, recovering ~1,131 previously-dropped chats.

Cloud telephony layer

FastAPI receiver behind nginx. The CRM module places click-to-call requests against a local bridge, which routes through the cloud PBX (Bonvoice today; the CRM reads a generic call table, so the provider is swappable). On an inbound call, the PBX hits a Dynamic API mid-ring and the CRM replies in milliseconds with the best destination — last agent → owner → default-DID agent → fallback, with self-call-loop protection and leave-aware exclusion.

Routing latency: typically sub-100ms, comfortably inside a documented <500ms per-request budget (parse → one SELECT → one config read → one INSERT → async WebSocket publish). Auth: a shared JWT cache cut auth-token mints by ~75%. Recordings download → content-addressed sha256 cache → Gemini transcription with Agent/Customer speaker diarization. Malayalam-English code-switching transcription confidence ~0.95 on clear audio.

The Operations Brain

A 4-layer architecture. Layer 1: ingestion of WhatsApp + CRM events into raw event tables. Layer 2: distillation into facts, lifecycle events and pricing observations via Gemini 2.5 Flash on Vertex AI. Layer 3: a LanceDB knowledge graph with self-hosted embeddings (sentence-transformers, not an external embedding API — local-first by deliberate choice). Layer 4: the suggestion emitter, which writes cards onto the CRM's desk surfaces.

Three desks are surfaced in production today; Lead Desk is the one we use daily. The Customer Care and Channel Partners desks are in development. Autonomous AI auto-reply (the L1–L6 AI Worker design) is on the roadmap — scaffolded but explicitly paused, gated behind Meta WABA onboarding and Coexistence-mode constraints (groups don't expose to the API).

Migration runbook (single planned cutover)

On 2026-04-26 we lifted two WordPress sites + two Perfex CRMs from shared cPanel hosting to the self-managed Oracle Cloud VPS. Single planned cutover, no rolling failover. Old /crm/* paths 301-redirect to the new crm.<domain> subdomain so existing bookmarks keep working.

Pre-cutover: full backup, schema dump, media tar; DNS TTLs lowered to 5 minutes; CloudPanel sites pre-provisioned with target docroots. Cutover: A records flipped, certs issued via Let's Encrypt as DNS propagated, application configs swapped to point at the new MariaDB socket. Post-cutover: 301 hosts left in place for the legacy subfolder, smoke spec ran across all admin pages, ~220 tests green.

We don't call it zero-downtime. We call it a single planned cutover with 301 redirects — which is what it was.

The Vertex AI flip

AI Studio was returning ~50% HTTP-503 on our extraction workload for ~2 weeks before we flipped. Same Gemini 2.5 Flash model is also reachable via Vertex AI's asia-south1 region on a separate fleet. We added an env-var switch (BRAIN_USE_VERTEX_AI=true), created the service account vok-extraction@gen-lang-client-0119917682 and the JSON key at /etc/vok/gcp-vertex-sa.json, and flipped. HTTP-503 rate dropped from ~50% to ~0.

We keep the AI Studio path wired as a fallback. The audio worker stayed on AI Studio (latency characteristics differ; the long-running streaming use case is happier there).

The thing this is NOT: an accuracy win. The model is the same. The fix was availability. Frame it that way or you'll set the wrong expectation.

The anomaly catalogue

~20+ rules run daily across the operations stack — every rule a SQL query in anomaly_check.py plus a one-line description in Brain_diagnostics_model.php. Rule examples (excerpted):

billing_health — Vertex AI canary call; surfaces GCP billing closures inside ~24h instead of the 7-day window that bit us once.
webhook_dedupe_skips_24h — count of CDR rows where the dedupe key matched but the call ID didn't (added after the 2026-05-08 Bonvoice incident).
stuck_initiated_calls — calls stuck in initiated state for > 5 minutes; catches lost CDR webhooks.
whatsapp_archive_stale_24h — last successful importer run > 24h ago.
capsule_fact_value_conflict — two distinct fact values for the same key within the same suite (knowledge-graph integrity check).

Each anomaly fires once per day → RED in the diagnostics tile → a daily narrative pulse emails the owner with the day's red rules + a one-line learning per rule. We added the billing_health rule the day after a billing outage went undetected for 7 days — not before.

Backups + restore drill

Five layers designed. Two live today: nightly mariabackup (full DB + WP/Perfex docroots, age-encrypted) + 15-minute binlog shipping (RPO target). The other three layers — offsite replica, cold-storage archive, immutable air-gap — are scoped but not enabled.

Restore is exercised. The most recent drill: an 885 MB archive decrypted, replayed and verified across 8 checks — manifest match, mariabackup coherence, encryption-key readability, WP config validity, exact upload-count match, nginx vhosts present, replay step, and a final HTTP smoke against the restored stack. 8/8 passed.

6 weeks of telephony iteration

9+ versioned ships in ~6 weeks against a live vendor integration. v0.9.3 was a 1-line dedupe-skip fix on 2026-05-08 — caught the day after Bonvoice silently changed webhook semantics. v0.9.4 was the sibling-risk audit: every dedupe key in the receiver got an asymmetry check, and an anomaly rule was added so we'd see the next regression class within 24 hours.

v1.3 was the big one — JWT cache, sentinel-anchor migration so inbound CDRs create tblbrain_calls rows, DTMF column, callBackParentID chain wiring, and WebSocket accept-race protection. The 75% drop in auth-token mints came from v1.3.

What broke (and how)

2026-04-26 Lead Desk header overlap — a sub-agent placed a partial include before init_head(). Caught manually after ~6 hours of latent regression. Fix: the 09-doctype-first Playwright spec across 11 admin pages, plus a Stop hook nudge on partial-emit ships.
2026-05-08 Bonvoice dedupe-skip — vendor changed webhook semantics; our dedupe key matched the new payload. 1-line fix; new anomaly rule; vendor-change runbook added (WhatsApp from the vendor counts as code-equivalent input).
2026-05-18 GCP billing closure — the old GCP billing account closed; Vertex AI returned auth errors for ~7 days before we noticed. New billing_health rule + a budget cap + tiered email alerts.
2026-04-15 callback claim stuck — a missed-call claim path set missed_resolved_at = NOW() at intent time. When the callback failed (no_answer / busy), no release path existed; the row was stuck "Resolved" forever. Fix: split intent state (claimed_by) from outcome state (resolved_at); a release endpoint covers all three failure branches in the 2×2 truth table.

What's next

Customer Care and Channel Partners desks rolling out (Operations Brain). DeadlineGuard shadow mode behind a strict accuracy gate. WorkLens pre-pilot on our own desks first. AI Worker autonomous reply gated until Meta WABA onboarding clears — not before. Three more backup layers (offsite replica, cold-storage archive, immutable air-gap) to bring 2-of-5 up to 5-of-5.

If you want the underlying memo files (the architecture state docs, the design briefs, the anomaly catalogue source), email us — they're on disk, internal, but we'll send the relevant piece for a real conversation.

Want this kind of engineering on your own ops?

If the runbook above resonates — the discipline, the honest failure-mode list, the no-magic anomaly rules — we should talk. The same two people who wrote this take the discovery call.

Book a discovery call Read the build log

30 minutes · no obligation · we reply within 1 business day