Foundation · build plan Six docs · start with the Overview. GTM names it · Dashboard measures it · Journeys shows it · Foundation builds it · Quality verifies it.

Foundation · Build plan Marco Avila · 2026-05-15 · pulled from zunou-services @ main

The plumbing exists.
The guarantees don't.

Three audit passes against the codebase. Notifications, security, and the table-stakes UX flows that decide whether the first pilot becomes a reference account or a refund. Ordered by ROI — the work that prevents the most churn for the least engineering.

The principle

Notifications are not a feature. They are the contract.

Slack and WhatsApp win on delivery reliability — not on chat features. A user who misses one @mention loses trust in the inbox; twice and they pin the trusted app back on top. Today Zunou's notification path is fire-and-forget: no outbox, no retries, no DLQ, no idempotency, no priority. Robust queue + priority + retry + DLQ is non-negotiable for pilot #1, not a future concern. The good news: the primitives mostly exist — they just aren't wired together.

Tier-1 pilot blockers

Items that must ship before pilot #1

Weeks to pilot-safe

Focused, sequenced, two-engineer feasible

Wire-existing-primitives share

~60%

Most Tier-1 work reuses code already in repo

Silent failure modes today

Reliability + security + observability paths

Executive summary 5 bullets · 45 seconds

The notification path is already lossy at zero scale. No outbox, no DLQ, no retries, no idempotency. Every transient failure is a permanent miss — by design.
Failure trigger is customer team size, not user count. A 200-person team at one pilot customer breaks the same architecture that holds for 2,000 light individual users.
Three audit passes surfaced 13 pilot blockers across reliability, security, UX, and observability — each with file:line evidence and ROI ranking.
Most of it is wiring, not invention. An outbox table already exists (for live insights). Priority queues exist in code. Per-channel mute exists in the backend. ~60% of Tier 1 reuses primitives.
Three weeks of focused work for two engineers gets us pilot-safe. Stages 2–4 (Knock, AI triage, scale infra) stage cleanly as user count and customer team size grow.

01 Three pillars

How the writeups fit together.

Three documents, one product story. GTM names the destination. Foundation lists the work that must ship before the destination is reachable. Quality keeps it standing once it does.

The market plan

GTM

Japan-led 12-month strategy. The destination.

Open

What GTM stands on

You are here

Foundation

Engineering work that must ship before pilots. Ordered by ROI.

Reading now

How we keep it standing

Quality

Test coverage scoreboard. Path from D+ to a defensible B.

Open

02 Failure curve

It's not "at 10k users." It's at the first busy customer.

Failure trigger is customer team size, not total users. A single pilot customer with a 200-person team breaks the same architecture that holds for 2,000 light individual users. Pusher channel throttling (10 msg/sec) hits well before any documented vendor limit.

Stage	Total users	Largest customer	Peak fan-out / s	Architecture status	User-visible verdict
Pilot	<100	≤50 active	~4/s	Holds — barely	Each miss is personal
Early traction	~1k	100 active	~40/s	Cracks visible	0.5–1% silent miss rate
First scaling event	~5k	300 active	~200/s	Will fail without intervention	2–5% miss rate during bursts
Mid-market	~10k	500 active	~400/s	Catastrophic without rebuild	Multi-hour delivery degradation
Scale	~50k	1k active	~2k/s	Unrunnable on current shape	Existential

Estimates: ~20 user-affecting events per active user per day, avg fan-out 5 recipients per event, 4× peak factor over a JST workday.

03 Tier 1 · Pilot blockers

The 3-week sprint, ranked by what kills the pilot fastest.

Two parallel tracks — UX/onboarding leads because that's where users drop in the first session, not the first week. Notification reliability lives behind it: equally non-negotiable, but lower-mortality if delayed by a few days.

Order of mortality · when each problem kills the user

T + 0

first session

Onboarding · auth · core flow breaks

User never starts. Track A.

T + 1 day

daily use

Clunky UX · common action missing

User trades back to Slack. Track A.

T + 1 week

trust building

Missed @mention · unreliable push

User pins Slack on top. Track C.

T + 1 month

aggregate

Sustained miss rate · incident

Measurable churn. Track C.

Both tracks ship in parallel — engineering capacity supports it. The ordering matters only if we're forced to triage: a user who can't onboard never finds out the notifications are flaky. Security (Track B) sits between them — catastrophic if exploited, but cheap to close, so it ships alongside Track A.

How we picked these

Two filters. A blocker makes the list if (a) its absence is demoably broken or catastrophically exploitable in pilot — or (b) the engineering cost is so low that not shipping it is unjustifiable. Every item has file:line evidence in the source audit.

★★★★★ Both filters · pilot-killer ★★★★ One filter · still required ★★★ Pilot-pretty · ship if slack

Effort by area

UX & onboarding 5 items · ~9d

Security 4 items · ~8d

Reliability 2 items · ~14d

Observability 2 items · ~6d

Total ~37 person-days

≈ 3 weeks for two engineers in parallel

Verify onboarding end-to-end with a fresh pilot account

Onboarding was rebuilt in the last 157 commits and the invite-only wall was removed — but production behavior hasn't been validated by a non-team user. A broken first session means the user never starts. Highest-mortality bug. See full journey →

SignInUserMutation.php · buildPhases.ts · recent 'remove invite-only org flow' commit

1 day

★★★★★

Restore web DM flow

DirectMessagesPage import + route both commented out in App.tsx. Web users have no DM path at all. Pilot buyers will demo on a laptop. See in Journey 4 →

dashboard/src/App.tsx:54,213 · VitalsPage.tsx:46

2–3 days

★★★★★

Surface per-channel mute in conversation header

Per-channel noise controls already exist end-to-end (pulse_settings · dm_settings · topic_settings). Just buried in the global notifications panel. #1 reason teams pin Slack back on top.

NotificationsPanelContent.tsx:641–658 · SettingsPerPulse.tsx:342–370

1 day

★★★★★

Calendar timezone labels (JST / PST)

EventDetailScreen.tsx has zero TZ display. JP↔US scheduling is the killer case — one screenshot of an off-by-hours invite ends the deal. See in Journey 3 →

EventDetailScreen.tsx · CreateEventSheet.tsx:592

2 days

★★★★★

Live recording indicator + transcript-ready notification

NotetakerTab shows 'Recording Completed' post-hoc but nothing live. Execs don't trust the bot is running — even when it is.

NotetakerTab.tsx:731–746

2 days

★★★★

Security

Fail-closed JWT validation

auth.mjs:43–44 silently skips JWT validation if AUTH0_DOMAIN env is unset. A misconfigured deploy = totally open notification hub. Same pattern in ai-proxy. See in Journey 1 →

notification-hub/auth.mjs:43–44 · ai-proxy/index.mjs

1 day

★★★★★

Security

Google Calendar webhook signature verification

X-Goog-Channel-Token treated as an identity lookup, not a crypto signature. Any caller knowing a user UUID can trigger sync. Auth bypass.

GoogleCalendarWebhookController.php:17–18

2 days

★★★★★

Security

HMAC for API → Lambda calls

Today the API authenticates to the notification Lambda using a user JWT. A compromised JWT can fan notifications to arbitrary recipients. Industry pattern: HMAC for service-to-service (forwarding user JWTs across service boundaries is an anti-pattern).

notification-hub/index.mjs

3 days

★★★★

Security

PII review on push payloads

expo-push.mjs:523–536 puts message preview text directly into the APNs/FCM payload. Apple + Google US infra sees content. APPI exposure.

notification-hub/expo-push.mjs:523–536

2 days

★★★★

Reliability

Notification outbox + idempotency + DLQ

Today every transient failure is a permanent miss. The biggest reliability win in the audit — but its consequences land on day-7+, not day-1. Extends the existing live_insight_outbox pattern: copy-paste, not invent. Pattern reference: microservices.io · Transactional Outbox.

live_insight_outbox migration · notification-hub Lambda

1–2 wk

★★★★★

Reliability

Priority lanes for notifications (urgent / normal / digest)

SQS priority lanes already exist in code (onQueue('high') on 9 jobs). Notifications bypass them entirely. Wire them up at the same time as the outbox.

Laravel jobs onQueue('high'|'default') · 9 references

3 days

★★★★

Observability

Structured logs + Sentry on every Lambda

console.log plain text today; Sentry only on Laravel. Without structured logs, the SLO dashboard can't be built and incidents can't be queried.

notification-hub/index.mjs · all Lambda services

3 days

★★★★

Observability

Delivery SLO dashboard + CloudWatch alarms

% of @mentions delivered within 5s, last 24h. One number, checked daily. Today there are zero CloudWatch alarms in any CFN file. Becomes Dashboard #02 →

no AWS::CloudWatch::Alarm in CFN files

3 days

★★★★

Pilot-pretty (ship if there's slack, not blockers)

Send-silently post-send indicator

Misclick-resistant (chevron dropdown), but no post-send confirmation that it went silently. Risk of confused 'why didn't they reply' threads.

1 day

Per-Android notification channel (mention / task / digest)

All notifications use one 'default' HIGH-importance channel. Users can't tame importance per type without nuking the app.

1 day

Fix LiveInsightOutbox schema drift

Model declares remind_at, snooze_count, is_bookmarked, bookmarked_at in $fillable but those columns aren't in the migration. Silent insert failure waiting to happen.

1 hour

04 Roadmap

After pilot signal: stage by stage, no big-bang.

Each tier is independently shippable. Stop and reassess after every one. Lead time is the work; the trigger is when it must be done.

Stage 2 · 3–6 months

Multi-channel infra

Trigger: ~800 users or one customer with 200 active

Knock for fan-out + retries + digests + in-app feed

Pusher + Beams become Knock providers. Verify JP residency before committing.
Web Feed parity (port from nova)

Knock's React SDK ships the inbox UI. Closes web/mobile gap.
LINE channel for JP

Procurement-relevant. Knock includes provider.
Pusher cluster ap3 → ap1

Singapore → Tokyo. Closes ~115ms of inter-region RTT for JP users (AWS-measured) and brings notification transport into Japanese residency.
Settings consolidation (/settings)

Today scattered across 4+ panels. Single tabbed hub.
Expo receipt polling

Know what was actually delivered, not just accepted.
Auth0 region migration to JP tenant

Or accelerate WorkOS migration if SSO ask comes first.

Stage 3 · 6–12 months

AI triage layer

Trigger: ~5k users — the differentiator

AI triage classifier on every event

Urgent / batchable / digestable / auto-resolvable. The Slack/WhatsApp can't-do-this differentiator.
Voice triage surface — "what did I miss?"

Existing voice agent reads the prioritized Feed aloud.
Intent-aware quiet hours (urgent breaks through)

Today push is suppressed flat; no urgent override.
Inngest (or Trigger.dev) for durable agent workflows

Meeting-bot pipeline first. Lambda + RabbitMQ is the wrong tool for stateful loops.
Per-tenant token / cost accounting in ai-proxy

Currently absent — can't rate-limit a noisy customer.
Prompt-injection sanitization

User content interpolated raw into system prompts today.
AI fallback model + circuit breaker

Single fetch with no fallback today.
Admin i18n (Nova + Dashboard already done)

Last surface still in English.

Stage 4–5 · 12–18 months

Reliability at scale

Trigger: ~10k users + first JP enterprise procurement

Per-tenant queue isolation

Noisy-neighbor protection.
Customer-visible delivery SLAs per priority lane

Sales tool + accountability mechanism.
SSE endpoint for Claude / OpenAI streaming

Off Pusher. Cost + latency win.
DR plan documented + tested (RPO / RTO)

Zero documented today.
Distributed tracing (OpenTelemetry or X-Ray)

Cross-service request flow visibility.
AssemblyAI residency fix or consent flow

Audio leaves JP today; APPI scrutiny grows with size.

05 The bet

Where Zunou gets to be better than Slack and WhatsApp.

Match them on reliability — table stakes. Beat them on triage and voice — newer tech they can't ship without rebuilding. AI as the inbox layer, not just a chat feature.

Capability	Slack	WhatsApp	Zunou today	Move
Delivery reliability	✓	✓✓	Guarantees missing	Match — Tier 1
Per-(user, event) delivery state	✓	✓	✗	Build via outbox
Unified inbox	Partial	✗	✓ on nova	Web parity ↑
AI triage (urgent / FYI / digest)	✗	✗	✗	Win — Stage 3
Voice triage ("what did I miss?")	✗	✗	Voice agent exists	Plug Feed in — win
Intent-aware quiet hours	Partial	Partial	Push-suppress only	Win — Stage 3
Multi-channel (push / email / LINE)	Partial	✗	Push + realtime only	Win for JP — Stage 2

06 Explicitly deferred

What we are choosing not to do before pilot.

Each item is a real improvement. None is worth shipping before we know what pilot users actually do.

Knock migration

Tier 1 outbox + dashboard is the pilot version

Auth0 → WorkOS AuthKit

Until first SSO/SCIM ask

Multi-region US deployment

Until US enterprise demand

Inngest / Trigger.dev

Stage 3

SSE for Claude streaming

Stage 4

Full WCAG accessibility audit

Post-pilot

CSV bulk invites

Post-pilot · pre-provision pilot users

Full i18n framework migration

JP strings now; framework later

Cmd-K global search UI

Stage 3 — Feed covers most cases

Kafka / NATS / EventBridge expansion

Stage 5+ — SQS holds

AssemblyAI alternative provider

Stage 3 — consent flow buys time

Subpages per finding domain

After this overview lands — split when sections grow

The bottom line

Three weeks of focused work gets us pilot-safe.
Everything else stages cleanly as scale demands it.

The notification path is already lossy at zero scale and will systematically fail at the first busy customer team. Robust queue + priority + retry + DLQ is non-negotiable for pilot #1, not a future concern. The good news: most of the primitives are already in the codebase. Tier 1 is wiring, not invention.

The plumbing exists. The guarantees don't. That's the entire job before pilot.

Read the GTM strategy See the Quality scoreboard

The plumbing exists. The guarantees don't.

How the writeups fit together.

GTM

Foundation

Quality

It's not "at 10k users." It's at the first busy customer.

The 3-week sprint, ranked by what kills the pilot fastest.

Verify onboarding end-to-end with a fresh pilot account

Restore web DM flow

Surface per-channel mute in conversation header

Calendar timezone labels (JST / PST)

Live recording indicator + transcript-ready notification

Fail-closed JWT validation

Google Calendar webhook signature verification

HMAC for API → Lambda calls

PII review on push payloads

Notification outbox + idempotency + DLQ

Priority lanes for notifications (urgent / normal / digest)

Structured logs + Sentry on every Lambda

Delivery SLO dashboard + CloudWatch alarms

Pilot-pretty (ship if there's slack, not blockers)

After pilot signal: stage by stage, no big-bang.

Multi-channel infra

AI triage layer

Reliability at scale

Where Zunou gets to be better than Slack and WhatsApp.

What we are choosing not to do before pilot.

Three weeks of focused work gets us pilot-safe. Everything else stages cleanly as scale demands it.

The plumbing exists.
The guarantees don't.

Three weeks of focused work gets us pilot-safe.
Everything else stages cleanly as scale demands it.