The plumbing exists.
The guarantees don't.
Three audit passes against the codebase. Notifications, security, and the table-stakes UX flows that decide whether the first pilot becomes a reference account or a refund. Ordered by ROI — the work that prevents the most churn for the least engineering.
Notifications are not a feature. They are the contract.
Slack and WhatsApp win on delivery reliability — not on chat features. A user who misses one @mention loses trust in the inbox; twice and they pin the trusted app back on top. Today Zunou's notification path is fire-and-forget: no outbox, no retries, no DLQ, no idempotency, no priority. Robust queue + priority + retry + DLQ is non-negotiable for pilot #1, not a future concern. The good news: the primitives mostly exist — they just aren't wired together.
- The notification path is already lossy at zero scale. No outbox, no DLQ, no retries, no idempotency. Every transient failure is a permanent miss — by design.
- Failure trigger is customer team size, not user count. A 200-person team at one pilot customer breaks the same architecture that holds for 2,000 light individual users.
- Three audit passes surfaced 13 pilot blockers across reliability, security, UX, and observability — each with file:line evidence and ROI ranking.
- Most of it is wiring, not invention. An outbox table already exists (for live insights). Priority queues exist in code. Per-channel mute exists in the backend. ~60% of Tier 1 reuses primitives.
- Three weeks of focused work for two engineers gets us pilot-safe. Stages 2–4 (Knock, AI triage, scale infra) stage cleanly as user count and customer team size grow.
How the writeups fit together.
Three documents, one product story. GTM names the destination. Foundation lists the work that must ship before the destination is reachable. Quality keeps it standing once it does.
It's not "at 10k users." It's at the first busy customer.
Failure trigger is customer team size, not total users. A single pilot customer with a 200-person team breaks the same architecture that holds for 2,000 light individual users. Pusher channel throttling (10 msg/sec) hits well before any documented vendor limit.
| Stage | Total users | Largest customer | Peak fan-out / s | Architecture status | User-visible verdict |
|---|---|---|---|---|---|
| Pilot | <100 | ≤50 active | ~4/s | Holds — barely | Each miss is personal |
| Early traction | ~1k | 100 active | ~40/s | Cracks visible | 0.5–1% silent miss rate |
| First scaling event | ~5k | 300 active | ~200/s | Will fail without intervention | 2–5% miss rate during bursts |
| Mid-market | ~10k | 500 active | ~400/s | Catastrophic without rebuild | Multi-hour delivery degradation |
| Scale | ~50k | 1k active | ~2k/s | Unrunnable on current shape | Existential |
Estimates: ~20 user-affecting events per active user per day, avg fan-out 5 recipients per event, 4× peak factor over a JST workday.
The 3-week sprint, ranked by what kills the pilot fastest.
Two parallel tracks — UX/onboarding leads because that's where users drop in the first session, not the first week. Notification reliability lives behind it: equally non-negotiable, but lower-mortality if delayed by a few days.
User never starts. Track A.
User trades back to Slack. Track A.
User pins Slack on top. Track C.
Measurable churn. Track C.
Both tracks ship in parallel — engineering capacity supports it. The ordering matters only if we're forced to triage: a user who can't onboard never finds out the notifications are flaky. Security (Track B) sits between them — catastrophic if exploited, but cheap to close, so it ships alongside Track A.
Two filters. A blocker makes the list if (a) its absence is demoably broken or catastrophically exploitable in pilot — or (b) the engineering cost is so low that not shipping it is unjustifiable. Every item has file:line evidence in the source audit.
≈ 3 weeks for two engineers in parallel
Verify onboarding end-to-end with a fresh pilot account
Onboarding was rebuilt in the last 157 commits and the invite-only wall was removed — but production behavior hasn't been validated by a non-team user. A broken first session means the user never starts. Highest-mortality bug. See full journey →
SignInUserMutation.php · buildPhases.ts · recent 'remove invite-only org flow' commit
Restore web DM flow
DirectMessagesPage import + route both commented out in App.tsx. Web users have no DM path at all. Pilot buyers will demo on a laptop. See in Journey 4 →
dashboard/src/App.tsx:54,213 · VitalsPage.tsx:46
Surface per-channel mute in conversation header
Per-channel noise controls already exist end-to-end (pulse_settings · dm_settings · topic_settings). Just buried in the global notifications panel. #1 reason teams pin Slack back on top.
NotificationsPanelContent.tsx:641–658 · SettingsPerPulse.tsx:342–370
Calendar timezone labels (JST / PST)
EventDetailScreen.tsx has zero TZ display. JP↔US scheduling is the killer case — one screenshot of an off-by-hours invite ends the deal. See in Journey 3 →
EventDetailScreen.tsx · CreateEventSheet.tsx:592
Live recording indicator + transcript-ready notification
NotetakerTab shows 'Recording Completed' post-hoc but nothing live. Execs don't trust the bot is running — even when it is.
NotetakerTab.tsx:731–746
Fail-closed JWT validation
auth.mjs:43–44 silently skips JWT validation if AUTH0_DOMAIN env is unset. A misconfigured deploy = totally open notification hub. Same pattern in ai-proxy. See in Journey 1 →
notification-hub/auth.mjs:43–44 · ai-proxy/index.mjs
Google Calendar webhook signature verification
X-Goog-Channel-Token treated as an identity lookup, not a crypto signature. Any caller knowing a user UUID can trigger sync. Auth bypass.
GoogleCalendarWebhookController.php:17–18
HMAC for API → Lambda calls
Today the API authenticates to the notification Lambda using a user JWT. A compromised JWT can fan notifications to arbitrary recipients. Industry pattern: HMAC for service-to-service (forwarding user JWTs across service boundaries is an anti-pattern).
notification-hub/index.mjs
PII review on push payloads
expo-push.mjs:523–536 puts message preview text directly into the APNs/FCM payload. Apple + Google US infra sees content. APPI exposure.
notification-hub/expo-push.mjs:523–536
Notification outbox + idempotency + DLQ
Today every transient failure is a permanent miss. The biggest reliability win in the audit — but its consequences land on day-7+, not day-1. Extends the existing live_insight_outbox pattern: copy-paste, not invent. Pattern reference: microservices.io · Transactional Outbox.
live_insight_outbox migration · notification-hub Lambda
Priority lanes for notifications (urgent / normal / digest)
SQS priority lanes already exist in code (onQueue('high') on 9 jobs). Notifications bypass them entirely. Wire them up at the same time as the outbox.
Laravel jobs onQueue('high'|'default') · 9 references
Structured logs + Sentry on every Lambda
console.log plain text today; Sentry only on Laravel. Without structured logs, the SLO dashboard can't be built and incidents can't be queried.
notification-hub/index.mjs · all Lambda services
Delivery SLO dashboard + CloudWatch alarms
% of @mentions delivered within 5s, last 24h. One number, checked daily. Today there are zero CloudWatch alarms in any CFN file. Becomes Dashboard #02 →
no AWS::CloudWatch::Alarm in CFN files
Pilot-pretty (ship if there's slack, not blockers)
After pilot signal: stage by stage, no big-bang.
Each tier is independently shippable. Stop and reassess after every one. Lead time is the work; the trigger is when it must be done.
Multi-channel infra
Trigger: ~800 users or one customer with 200 active
- Knock for fan-out + retries + digests + in-app feedPusher + Beams become Knock providers. Verify JP residency before committing.
- Web Feed parity (port from nova)Knock's React SDK ships the inbox UI. Closes web/mobile gap.
- LINE channel for JPProcurement-relevant. Knock includes provider.
- Pusher cluster ap3 → ap1Singapore → Tokyo. Closes ~115ms of inter-region RTT for JP users (AWS-measured) and brings notification transport into Japanese residency.
- Settings consolidation (/settings)Today scattered across 4+ panels. Single tabbed hub.
- Expo receipt pollingKnow what was actually delivered, not just accepted.
- Auth0 region migration to JP tenantOr accelerate WorkOS migration if SSO ask comes first.
AI triage layer
Trigger: ~5k users — the differentiator
- AI triage classifier on every eventUrgent / batchable / digestable / auto-resolvable. The Slack/WhatsApp can't-do-this differentiator.
- Voice triage surface — "what did I miss?"Existing voice agent reads the prioritized Feed aloud.
- Intent-aware quiet hours (urgent breaks through)Today push is suppressed flat; no urgent override.
- Inngest (or Trigger.dev) for durable agent workflowsMeeting-bot pipeline first. Lambda + RabbitMQ is the wrong tool for stateful loops.
- Per-tenant token / cost accounting in ai-proxyCurrently absent — can't rate-limit a noisy customer.
- Prompt-injection sanitizationUser content interpolated raw into system prompts today.
- AI fallback model + circuit breakerSingle fetch with no fallback today.
- Admin i18n (Nova + Dashboard already done)Last surface still in English.
Reliability at scale
Trigger: ~10k users + first JP enterprise procurement
- Per-tenant queue isolationNoisy-neighbor protection.
- Customer-visible delivery SLAs per priority laneSales tool + accountability mechanism.
- SSE endpoint for Claude / OpenAI streamingOff Pusher. Cost + latency win.
- DR plan documented + tested (RPO / RTO)Zero documented today.
- Distributed tracing (OpenTelemetry or X-Ray)Cross-service request flow visibility.
- AssemblyAI residency fix or consent flowAudio leaves JP today; APPI scrutiny grows with size.
Where Zunou gets to be better than Slack and WhatsApp.
Match them on reliability — table stakes. Beat them on triage and voice — newer tech they can't ship without rebuilding. AI as the inbox layer, not just a chat feature.
| Capability | Slack | Zunou today | Move | |
|---|---|---|---|---|
| Delivery reliability | ✓ | ✓✓ | Guarantees missing | Match — Tier 1 |
| Per-(user, event) delivery state | ✓ | ✓ | ✗ | Build via outbox |
| Unified inbox | Partial | ✗ | ✓ on nova | Web parity ↑ |
| AI triage (urgent / FYI / digest) | ✗ | ✗ | ✗ | Win — Stage 3 |
| Voice triage ("what did I miss?") | ✗ | ✗ | Voice agent exists | Plug Feed in — win |
| Intent-aware quiet hours | Partial | Partial | Push-suppress only | Win — Stage 3 |
| Multi-channel (push / email / LINE) | Partial | ✗ | Push + realtime only | Win for JP — Stage 2 |
What we are choosing not to do before pilot.
Each item is a real improvement. None is worth shipping before we know what pilot users actually do.
The bottom line
Three weeks of focused work gets us pilot-safe.
Everything else stages cleanly as scale demands it.
The notification path is already lossy at zero scale and will systematically fail at the first busy customer team. Robust queue + priority + retry + DLQ is non-negotiable for pilot #1, not a future concern. The good news: most of the primitives are already in the codebase. Tier 1 is wiring, not invention.
The plumbing exists. The guarantees don't. That's the entire job before pilot.