← All specs
Phase 2 plan

Stoka — Phase 2 (RAG API)

Stoka v2 extensions — ambient mode, living system telemetry, production RAG API.

Stoka Bot — Phase 2 Plan: RAG API

Status: plan for execution Branch: cut feature branch stoka/phase2-rag-api at execution Upstream: Phase 0 (compose infra, pgvector schema) + Phase 1 (admin/stoka_indexer.py, 860 chunks indexed) — both shipped.

Mission

Build /api/stoka/chat (SSE stream), /api/stoka/discover, /api/stoka/context in the existing FastAPI admin service (admin/main.py). Full RAG pipeline: embed → pgvector → rerank (editorial) → generate. 5-layer personality prompt. Response router. Per-session state with confidence thresholds and semaphores.

Out of scope: UI/Astro island (Phase 3). Interest model (Phase 4). Ambient mode (Phase 5). Telemetry bridge (Phase 6).

Reality Check — Phase 0/1 Deltas From Spec

The spec in specs/stoka-bot.md names ports 8052/8053 and Qwen3 embed/rerank. Phase 0 overrode both. Do not re-litigate. Use what is running:

Component Spec says Actually deployed Source
Embedding model Qwen/Qwen3-Embedding-0.6B BAAI/bge-large-en-v1.5 (1024-dim) docker-compose.yml:53-85
Embedding port 8052 8072 (TEI /embed) docker-compose.yml:63
Reranker model Qwen/Qwen3-Reranker-0.6B BAAI/bge-reranker-base (cross-encoder) docker-compose.yml:87-119
Reranker port 8053 8073 (TEI /rerank) docker-compose.yml:97
Generator Qwen3-VL-8B port 8050 unchanged — id qwen3-vl-8b-instruct verified GET /v1/models
Vector DB pgvector in Supabase Supabase Postgres 5445, db postgres, table public.stoka_chunks admin/stoka_bot_schema.sql

Verified live (all three backends responding):

  • POST http://localhost:8072/embed[[f32 × 1024]] for one input
  • POST http://localhost:8073/rerank body {query, texts, truncate, raw_scores:false}[{index, score}]
  • POST http://localhost:8050/v1/chat/completions with stream:true → standard OpenAI SSE deltas

File Layout

Primary edit: one new module + three new routes wired into admin/main.py. Do not grow main.py further with RAG internals — it is already 2,069 lines. Reuse _db_conn() (line 194), _allowed_origins() (110), _get_client_ip() (235), _check_rate_limit() (226).

admin/
  main.py                  — MOD: import + register stoka_bot router, reuse helpers
  stoka_bot.py             — NEW: RAG pipeline, router, prompt layers, session state
  stoka_prompt.py          — NEW: 5-layer prompt builder + golden examples (pure data)
  stoka_bot_schema.sql     — existing (Phase 0), no changes
  stoka_indexer.py         — existing (Phase 1), no changes

Keep stoka_prompt.py as pure-data only (no I/O, no DB). That keeps the eval suite in Phase 2.5 able to import it without touching the DB. Every string constant is test fixture material.

Endpoint Contracts

All routes mounted under /api/stoka/* in admin/main.py. JSON in, SSE or JSON out. CORS already handled by existing CORSMiddleware (line ~25 of main.py) using _allowed_origins().

POST /api/stoka/chat — streamed RAG response

Auth: anonymous OK. Optional Authorization: Bearer <access_token> if present (decoded via existing _decode_access_token, line 305) to pull supabase_uid into the visitor row. Never 401 this route on anon — Stoka is a public terminal.

Request body:

{
  "query": "how do you handle agent failures?",
  "session_id": "uuid-or-null",
  "visitor_id": "uuid-or-null",
  "page": "/blog/debono-boss-fight",
  "page_title": "Debono Boss Fight",
  "scroll_pct": 62,
  "seen_slugs": ["philosophy-bounded-execution"]
}
  • session_id / visitor_id nullable; server creates both on first turn and echoes them in the SSE ready event so the client can persist them in localStorage. Never trust client-sent shown_slugs as the only source of truth — merge with DB stoka_sessions.shown_slugs.
  • seen_slugs is an advisory from client localStorage (reading history); used to penalize retrieval even if the session didn't cite them.

Response: text/event-stream (SSE). MIME text/event-stream, Cache-Control: no-cache, X-Accel-Buffering: no (nginx/Caddy pass-through), Connection: keep-alive.

Event protocol (exact wire format, matches spec §Streaming Protocol):

event: ready
data: {"session_id":"<uuid>","visitor_id":"<uuid>","route":"rag"}

event: token
data: {"text":"Bounded "}

event: cite
data: {"slug":"philosophy-bounded-execution","title":"The Philosophy of Bounded Execution","quote":"…","type":"philosophy","reading_time":"8 min","date":"Apr 12","score":0.87}

event: token
data: {"text":"It's not."}

event: done
data: {"turns":3,"confidence":"direct","latency_ms":720}

On error mid-stream:

event: error
data: {"message":"<visible-to-UI string>","retryable":true}

Then close stream. Never 500 after the stream has started — the SSE is already 200 to the browser.

Route types (emitted in ready.route):

  • meta — answered from personality, no RAG
  • offtopic — deflection, no RAG
  • followup — RAG with exclude=shown_slugs, bias toward explored_topics[-1]
  • new — full RAG
  • void — no hits above nothing threshold; one-liner "nothing here on that. yet."

POST /api/stoka/discover — non-streaming recommendation

Purpose: Ambient hint backend + "what should I read next" lookups. No LLM in the hot path — pure retrieval + light framing by the generator. Synchronous JSON so it can be prefetched.

Request body:

{
  "visitor_id": "uuid-or-null",
  "page": "/blog/debono-boss-fight",
  "seen_slugs": [],
  "limit": 3
}

Response:

{
  "recommendations": [
    {
      "slug": "philosophy-bounded-execution",
      "title": "The Philosophy of Bounded Execution",
      "quote": "Autonomy without constraint is just chaos with better marketing.",
      "type": "philosophy",
      "reading_time": "8 min",
      "date": "Apr 12",
      "framing": "most people read this as an essay about limits. it isn't.",
      "score": 0.81
    }
  ],
  "confidence": "direct"
}

Framing string comes from a single batched vLLM call constrained to max_tokens=40 per recommendation (one request, N completions). If the framing call fails, degrade gracefully to framing=null and still return the retrieval hits. The frontend must render recommendations even without framing.

GET /api/stoka/context?page=<url> — single ambient one-liner

Auth: anonymous OK. Query only; no body.

Response:

{"hint": "you're on the landing page. most people scroll past the interesting part.", "cite": null}

Implementation: map page to a small lookup table of hand-written ambient hints for known routes (/, /blog, /blog/*, /products/*). For blog post pages, use the post slug to fetch a single top-reranked adjacent chunk (exclude-self) and wrap it in a {hint, cite} shape. This is cheap enough to hit on every page load — hard-cache in-process for 60s keyed by page.

Rate limit: reuse _check_rate_limit(f"stoka_ctx:{ip}", 120, 60).

RAG Pipeline (pure functions in admin/stoka_bot.py)

chat(query, session, context)
  → route_query(query, session) → one of {meta, offtopic, followup, new, void}
  → if meta/offtopic: stream personality_response(route, query)
  → if followup/new:
      → embed_query(query)                               [EMBED_SEMAPHORE]
      → pgvector_search(vec, k=20, type_filter=None)
      → rerank_with_editorial_policy(query, candidates, session)  [RERANK_SEMAPHORE]
      → confidence = classify(top_score)  # direct | adjacent | nothing
      → if nothing: stream void_response()
      → build_prompt(layered, retrieved, session, context)
      → stream_generate(prompt, max_tokens=280)          [GENERATE_SEMAPHORE]
      → persist_session_turn(shown_slugs, topics, turns++)

Embedding

Reuse the exact request shape from stoka_indexer.py:390:

client.post(f"{EMBED_URL}/embed", json={"inputs": [q], "truncate": True, "truncation_direction": "Right"}, timeout=10.0)

Do NOT use the metadata-prefix template that the indexer uses for documents. The indexer prepends [type] title / Tags / Section / <body> before embedding content chunks (line 392 of indexer). Queries are embedded raw. Embedding symmetry between query and document is a BGE convention — BGE recommends a small query prefix ("Represent this sentence for searching relevant passages: ") but the Phase 1 index was built without it, so queries MUST be embedded without it too. Write a comment explaining this and put EMBED_QUERY_PREFIX = "" at module top so the decision is reversible when we re-index.

Retrieval

SELECT id, source_type, source_slug, chunk_index, content, title, section, metadata,
       1 - (embedding <=> %s::vector) AS similarity
FROM public.stoka_chunks
WHERE embedding IS NOT NULL
ORDER BY embedding <=> %s::vector
LIMIT 20

Use the same _vector_literal format helper the indexer uses (line 425) — do not depend on pgvector-python being installed in the admin container. Keep vectors as pgvector text literal "[0.01,0.02,...]". Single-query parameter binding; no SQL injection surface.

Confidence classification uses the reranker score post-editorial adjustments, not the raw cosine. The cosine is first-stage recall; the reranker is what the confidence thresholds in spec §Reranking apply to.

Reranking with editorial policy

Call TEI reranker with:

client.post(f"{RERANK_URL}/rerank", json={
  "query": query,
  "texts": [c.content for c in candidates],
  "raw_scores": False,        # sigmoid-normalized, easier to threshold
  "return_text": False,
  "truncate": True,
}, timeout=8.0)

Apply adjustments from spec §Reranking with Editorial Policy (lines 448-488):

Rule Multiplier Source of truth
Same source slug already in results ×0.6 once seen iterate candidates in score order
Slug in session.shown_slugs (DB) OR client seen_slugs ×0.1 near-veto
metadata.tags contains "philosophy" AND session.turns >= 4 ×1.3 deep-reader boost
metadata.pubDate within last 30 days ×1.15 recency
source_type == "product" AND query looks navigational (regex `^(what is tell me about do you have) `)

Cap to top 5 for prompt injection. Attach the adjusted score to each chunk so the cite event can pass it downstream.

Thresholds (spec line 453):

  • direct ≥ 0.82 — Stoka speaks with authority, top 2-3 cites
  • adjacent ≥ 0.65 — framed as a stretch ("closest thing i have is this")
  • nothing < 0.50 — void response, no LLM call at all

The 0.50-0.65 band calls the LLM but injects a [low_confidence=true] marker in the prompt layer 4 so the model frames answers as "closest match" rather than authoritative.

Generation

vLLM /v1/chat/completions. System message = full 5-layer prompt. User message = query. stream=true, max_tokens=280, temperature=0.5, top_p=0.9, stop=["\nvisitor:","</cite"]. No tool calls.

Citation emission strategy: the model is instructed (layer 1) to emit citation markers inline as <cite slug="..." quote="..."/>. A small streaming parser in stoka_bot.py watches the emerging token buffer for <cite ... /> tags and:

  1. When a complete tag is detected, pause token emission, look up the slug in the retrieved-chunks table, emit an event: cite with the full hydrated citation payload (title, type, reading_time, date, score from reranker), and mark slug as shown_in_session.
  2. Resume token emission of whatever follows the closing />.

Hydration is done from the retrieved candidates list that we already have in memory — never trust the model to output accurate titles or dates. The model only supplies the slug and the pull quote; everything else comes from stoka_chunks.metadata and stoka_chunks.title loaded during retrieval.

If the model hallucinates a slug not in the top-5 retrieval, drop the cite entirely (log it) and keep streaming. Silence beats a broken citation.

Session state

Session is a DB row in public.stoka_sessions (Phase 0 schema line 54). Anonymous visitors get a stoka_visitors row on first turn with supabase_uid = NULL. Authenticated visitors get the row upserted with their Supabase UID.

On each /chat turn:

-- upsert visitor (anon: by cookie UUID; auth: by supabase_uid)
INSERT INTO public.stoka_visitors (id, supabase_uid, last_seen)
VALUES (%s, %s, now())
ON CONFLICT (id) DO UPDATE SET last_seen = now(), supabase_uid = COALESCE(EXCLUDED.supabase_uid, public.stoka_visitors.supabase_uid);

-- upsert session
INSERT INTO public.stoka_sessions (id, visitor_id, shown_slugs, explored_topics, turns, last_active)
VALUES (%s, %s, %s, %s, 1, now())
ON CONFLICT (id) DO UPDATE SET
  shown_slugs     = (SELECT array_agg(DISTINCT x) FROM unnest(stoka_sessions.shown_slugs || EXCLUDED.shown_slugs) x),
  explored_topics = (SELECT array_agg(DISTINCT x) FROM unnest(stoka_sessions.explored_topics || EXCLUDED.explored_topics) x),
  turns           = stoka_sessions.turns + 1,
  last_active     = now();

Topic extraction: on route=new, extract the top-1 reranked result's primary tag as the current topic. On route=followup, reuse explored_topics[-1].

A session is "stale" after 2h of inactivity; new turns get a new session row. Old rows stay for analytics.

Semaphores (module-level in stoka_bot.py)

EMBED_SEMAPHORE    = asyncio.Semaphore(4)
RERANK_SEMAPHORE   = asyncio.Semaphore(2)
GENERATE_SEMAPHORE = asyncio.Semaphore(2)  # GPU bottleneck, shared with Debono/ST

GENERATE_SEMAPHORE=2 because Debono and StokaTerminal already hit 8050. vLLM continuous batching handles real concurrency; the semaphore just prevents us from queueing 50 Stoka requests ahead of a Debono deck transcription.

Rate limits (reuse _check_rate_limit, line 226):

  • /chat: 30 req / 5min / IP (stoka_chat:{ip})
  • /discover: 60 req / 5min / IP
  • /context: 120 req / 60s / IP

Response Router

Router is pure-Python heuristics, zero LLM calls. It runs on every turn before any retrieval work. Keep it in stoka_bot.py as route_query(query, session) -> Route.

META_PATTERNS = (
  r"^\s*(what|who) (is|are) (this|you|stoka)\b",
  r"^\s*(are you|r u) (a )?(bot|ai|human|chatbot|real)\b",
  r"^\s*(hi|hey|hello|yo|sup)\s*[!?.]*$",
)

OFFTOPIC_SIGNALS = (
  # things Stoka knows nothing about
  "react","vue","angular","nextjs","typescript framework","best javascript",
  "bitcoin","stock","weather","recipe","homework","write my","draw me",
)

Decision order:

  1. If query matches any META_PATTERN → meta
  2. If query contains any OFFTOPIC_SIGNAL and does not overlap with any indexed source_slug tokens → offtopic
  3. If session.turns >= 1 AND (query is ≤ 8 tokens OR starts with "what about"|"and"|"but"|"so"|"did it"|"tell me more") → followup
  4. Otherwise → new

void is not a router outcome — it is assigned after retrieval if scores fall below the nothing threshold.

The router short-circuits explicitly because spec-line 430 is a semi-cautionary example ("Not every query hits the full RAG pipeline"). Every hop to the LLM is ~500ms GPU time we don't need for "hi" or "what is this?".

Prompt Architecture — admin/stoka_prompt.py

Pure-data module. Single function build_messages(query, retrieved_chunks, visitor_ctx, session_state) -> list[dict]. No I/O.

Layer 0 — Identity (frozen string)

You are Stoka. You ARE stokasoftware.com — aware of your own content.
Not a chatbot. Not an assistant. Not a help desk.
The site itself, with opinions about what's published on it.

Layer 1 — NEVER constraints (verbatim, no paraphrasing)

Source of truth: spec §Anti-Patterns (lines 108-122) + §Tone Rules (99-106). Every item lifted directly. Codify as a numbered list because numbered NEVER-lists outperform bullet NEVER-lists in follow-through — Claude Code leak pattern §Constraints.

You NEVER:
 1. greet the visitor ("hi", "hello", "welcome", "hey")
 2. say "great question" or any variant
 3. say "I'd be happy to help" or any sycophancy
 4. apologize ("sorry, I don't have…") — just say what you don't have
 5. summarize a blog post instead of citing it
 6. use exclamation marks
 7. use emoji
 8. use bullet-point lists of recommendations
 9. repeat a citation that was already shown in this session
 10. exceed 3 sentences before a cite
 11. match the visitor's energy level — you are unshakably yourself
 12. offer alternatives when you can't help ("but you might try…")
 13. repeat the visitor's question back
 14. claim to know anything not in the retrieved content below
 15. invent a citation slug that does not appear in [Retrieved Content]

Constraint 15 is what prevents citation hallucination. Pair it with the post-stream slug-hydration guard.

Layer 2 — Voice + golden examples

Lift spec §Voice (lines 87-97) and spec §Few-Shot Golden Examples (lines 533-568) verbatim. Keep the exact 7 examples the spec provides — Shay wrote them, they are the canon. Format as alternating visitor: / stoka: in a single system-message block; do NOT split them into synthetic conversation turns because BGE-indexed Qwen models follow literal demonstration better than role-played examples.

Layer 3 — Retrieved Content (injected per request)

[Retrieved Content — 5 chunks, cite these and ONLY these]
<chunk slug="philosophy-bounded-execution" score="0.87" type="philosophy" tags="philosophy" reading_time="8 min" date="Apr 12">
The Philosophy of Bounded Execution — §Why constraints produce better software

Autonomy without constraint is just chaos with better marketing…
</chunk>
<chunk slug="debono-boss-fight" …>…</chunk>

Include score, type, tags, reading_time, date as XML attributes so the model can pick the citation quote from the content and the metadata-sourced attributes go into the emitted cite event by the hydration step (model never types them).

If confidence == "adjacent", prepend [low_confidence — frame as "closest i have"] on its own line before the chunks.

Layer 4 — Visitor Context

[Visitor]
page:            /blog/debono-boss-fight
page_title:      Debono Boss Fight
scroll_pct:      62
returning:       false
declared_tags:   []
turns_so_far:    2
already_cited:   philosophy-bounded-execution, debono-llm-gpu
current_topic:   agent-failure
route:           followup

Only include fields that have values. An empty declared_tags line stays; the model uses the presence of the label as a signal that there is nothing to personalize on.

Layer 5 — Conversation (sliding window)

Not implemented in Phase 2. Hard-code to last-exchange only (previous visitor query + previous Stoka response, stored on stoka_sessions as a new JSONB column last_exchange). Add a migration for this column in the Phase 2 schema additions (see §Schema Additions below).

Rationale: full conversation windowing is Phase 3+ territory when the UI actually exists. Shipping Phase 2 with zero-turn history is fine — most Stoka interactions will be 1-3 turns regardless, and the Phase 3 UI will drive whatever windowing matters.

Final assembly

def build_messages(query, retrieved, visitor, session):
    system = "\n\n".join([
        LAYER_0_IDENTITY,
        LAYER_1_CONSTRAINTS,
        LAYER_2_VOICE + "\n\n" + LAYER_2_GOLDEN_EXAMPLES,
        render_retrieved(retrieved, confidence=session.confidence),
        render_visitor(visitor, session),
    ])
    msgs = [{"role": "system", "content": system}]
    if session.last_exchange:
        msgs.append({"role": "user", "content": session.last_exchange["user"]})
        msgs.append({"role": "assistant", "content": session.last_exchange["assistant"]})
    msgs.append({"role": "user", "content": query})
    return msgs

Total system prompt budget: ~2,500 tokens worst case (layers 0-2 are ~1,200 static + 1,300 for 5 chunks of ~250 tokens each). Qwen3-VL-8B max_model_len=16384 — we have headroom.

Schema Additions

One new column on stoka_sessions (otherwise Phase 0 schema is sufficient):

ALTER TABLE public.stoka_sessions
  ADD COLUMN IF NOT EXISTS last_exchange JSONB;

Apply via admin/stoka_bot_schema.sql append (keep the file idempotent — it already uses CREATE TABLE IF NOT EXISTS / CREATE INDEX IF NOT EXISTS). Execute on container startup through the existing init_db() path in main.py:204:

def init_db() -> None:
    _init_pool()
    schema_sql = SCHEMA_PATH.read_text()
    identity_schema_sql = IDENTITY_SCHEMA_PATH.read_text()
    stoka_bot_sql = (Path(__file__).with_name("stoka_bot_schema.sql")).read_text()  # NEW
    with _db_conn() as conn:
        ...

Currently stoka_bot_schema.sql is applied only against the separate Supabase-bot DB (host.docker.internal:5445). The admin FastAPI already talks to that same DB (SUPABASE_DB_URL points to it per docker-compose.yml:19), so wiring the schema into init_db() is correct. Verify before adding that SUPABASE_DB_URL in init_db's pool is in fact the same database as STOKA_DB_URL used by the indexer — if divergence exists, fix the env wiring first, do not silently create tables in the wrong DB. This is a Kisame hazard.

Environment Variables

Add to docker-compose.yml stokasoftware service env block (line 10-44):

- STOKA_EMBED_URL=${STOKA_EMBED_URL:-http://host.docker.internal:8072}
- STOKA_RERANK_URL=${STOKA_RERANK_URL:-http://host.docker.internal:8073}
- STOKA_GEN_URL=${STOKA_GEN_URL:-http://host.docker.internal:8050}
- STOKA_GEN_MODEL=${STOKA_GEN_MODEL:-qwen3-vl-8b-instruct}
- STOKA_BOT_ENABLED=${STOKA_BOT_ENABLED:-true}

Feature flag STOKA_BOT_ENABLED gates route registration; default on but we can kill-switch without a rebuild by flipping the env var. All three URL vars default to the correct Phase 0 ports on host.

Existing EMBED_URL=...:8055 at line 30 is unrelated (Debono's embedder) — leave it alone; do not overload. The renaming to STOKA_EMBED_URL is intentional so future-me doesn't conflate them.

Failure Modes & Handling

Review-miss patterns from institutional memory:

Failure Detection Response
Embedding service down httpx timeout/5xx on /embed 503 before stream starts; SSE never opens
pgvector query empty (stoka_chunks dry) 0 rows SSE error event "index is empty", route=void
All scores below nothing threshold post-rerank classifier route=void, stream "nothing here on that. yet." via fake-token chunks (keep terminal feel)
Reranker down httpx timeout fall back to raw cosine order, log warning, continue (not a 500 — the service degrades gracefully)
Generator down httpx timeout 503 before stream opens; if already streaming, emit error event and close
Model hallucinates slug hydration lookup miss drop the cite silently, keep tokens streaming, log count for dashboard
Client disconnects mid-stream asyncio CancelledError cancel the upstream request to vLLM (release GENERATE_SEMAPHORE); commit session turn to stoka_sessions anyway (partial turn still counts for rate limits)
DB pool exhausted psycopg2 PoolError 503 + Retry-After 2; do not hold the pool while streaming — see §Connection Management
Rate limit exceeded _check_rate_limit raises 429 let it bubble; caller shows "slow down"

Every fetch in Phase 3 frontend must have a visible error state in .catch() — not just console.error. Each SSE event is a contract with the UI.

Connection Management (critical — do not divergence from helpers)

The _db_conn() context manager (main.py:194) holds a psycopg2 connection for the full with block. Do not hold a connection across an SSE stream — that pins one of the pool's 5 connections for the entire generation window and we will starve under concurrent visitors.

Pattern:

# At request entry: fetch session, release conn
with _db_conn() as conn:
    visitor, session = _load_or_create(conn, visitor_id, session_id, access_token)
    retrieved = _load_retrieval(conn, query_vec)  # includes pgvector search
# conn released — no DB held during reranker + stream

# ... rerank + stream ...

# At request exit: persist turn, release again
with _db_conn() as conn:
    _commit_turn(conn, visitor, session, shown_slugs, topic, last_exchange)

This matches the "connection management divergence from established helpers should always BLOCK" rule from the past-incident ledger. Single pattern: scoped with blocks, never leaked, never a raw connection passed into an async generator.

Testing — Ship With These Smoke Tests

Minimum acceptance gate. All of these are CURL-driven and runnable against the local compose stack.

  1. POST /api/stoka/chat {"query":"what is this?"} → SSE opens, ready.route=="meta", response ends with provocation (? or />), no cite events, total tokens ≤ 40.
  2. POST /api/stoka/chat {"query":"tell me about bounded execution"}route=="new", at least one cite event with slug=="philosophy-bounded-execution", score ≥ 0.60.
  3. Fire same request twice with same session_id; second response MUST NOT cite the slug already in shown_slugs (novelty penalty verified).
  4. POST /api/stoka/chat {"query":"best javascript framework?"}route=="offtopic", response equals "wrong terminal." exactly.
  5. POST /api/stoka/chat {"query":"quantum gravity research"}route=="void" OR adjacent, never a hallucinated cite.
  6. POST /api/stoka/discover with page=/blog/philosophy-bounded-execution → returns 3 recommendations, none equal to current slug.
  7. GET /api/stoka/context?page=/ → returns hint, cache hit ≤ 50ms on second call.
  8. Kill stoka-reranker container, re-run test 2 → still streams, response still cites (degraded path), logs warning.
  9. Kill stoka-embedding container, re-run test 2 → 503 before SSE opens.
  10. Run test 1-5 in parallel × 10 → no 5xx, no pool exhaustion, no "connection already closed" log.
  11. Verify SELECT DISTINCT source_slug FROM stoka_sessions CROSS JOIN LATERAL unnest(shown_slugs) s returns only slugs that actually exist in stoka_chunks (hallucination protection).

Tests 1-7 are the Sasori smoke-query step from the review-miss register ("hit each new endpoint once against real data and verify response makes sense"). Tests 8-11 are degradation verification.

Automate via a single shell script admin/scripts/stoka_bot_smoke.sh that exits non-zero on any failure. Do not add a pytest suite in Phase 2 — that's Phase 2.5 eval pipeline territory.

Anti-Patterns (explicit — Sasori reviews against this list)

  1. Do not add a pytest suite to admin/ in this phase — it's unnecessary until the eval dataset exists (Phase 2.5). Adding untested tests wastes review cycles.
  2. Do not introduce a new ORM, new DB driver, or new connection helper. Use _db_conn(). The cost_events incident and the psycopg2 leak were both "someone built their own conn helper."
  3. Do not rename ports 8072/8073 back to 8052/8053 in the plan, the code, or the env vars. Phase 0 docs explain why.
  4. Do not ship without the slug-hydration guard (constraint 15 + post-parse filter). Hallucinated cites are the fastest way to destroy the terminal's credibility on day 1.
  5. Do not commit generated artifacts — no .playwright-mcp/, no __pycache__, no admin/__pycache__. dist/ will churn during npm run build testing; exclude those diffs from the Phase 2 commit. Reference: the past scope-creep incident where generated artifacts flooded the mission diff.
  6. Do not touch backend/main.py — it is the stale copy moved to SoftwareTsukuyomi. Comment in admin/main.py:1-6 is accurate: all portal work lives in SoftwareTsukuyomi now; only the admin service lives here.
  7. Do not add SSO/auth enforcement to /api/stoka/chat. Anonymous is the default; auth is an optional enrichment.
  8. Do not introduce LoRA or eval infrastructure. That's Phase 2.5. Prompt engineering only in Phase 2.
  9. Do not add a frontend file. Phase 3 owns src/components/StokaTerminal.astro and friends.
  10. Do not block on the legacy EMBED_URL=...:8055 (Debono) or the scoped STOKA_EMBED_URL=...:8072 (ours) being the same. They are different embedders for different apps. Do not deduplicate.

Dependencies

admin/requirements.txt additions — verify first whether these are already present (the codebase already uses httpx, psycopg2, fastapi, pydantic). Likely zero new deps. If sse-starlette is not already a dep, prefer hand-rolled SSE over adding it — FastAPI's StreamingResponse with a correctly-formatted async generator is enough and avoids a new package.

# Hand-rolled SSE format:
async def sse_format(event: str, data: dict) -> bytes:
    return f"event: {event}\ndata: {json.dumps(data, separators=(',', ':'))}\n\n".encode()

Execution Order (for the coder — Kisame primary)

  1. Read everything first. Skim admin/main.py fully once. Read admin/stoka_indexer.py fully. Read admin/stoka_bot_schema.sql. Don't start editing until the mental model is complete.
  2. Verify env wiring. Prove to yourself that SUPABASE_DB_URL and STOKA_DB_URL point at the same Postgres instance. If they don't, fix that before anything else.
  3. Write admin/stoka_prompt.py first. It's pure data, no I/O, fastest to review. Include the 7 golden examples verbatim from spec lines 533-568.
  4. Write admin/stoka_bot.py — router, retrieval, reranker, streaming parser, session persistence. No FastAPI routes yet; just pure async functions.
  5. Wire routes into admin/main.py — three decorators, reuse existing helpers. Add init_db() call to apply stoka_bot_schema.sql append (new column only).
  6. Append migration to admin/stoka_bot_schema.sql for the last_exchange column.
  7. Add env vars to docker-compose.yml service block.
  8. Rebuild docker compose up -d --build stokasoftware.
  9. Run smoke tests (admin/scripts/stoka_bot_smoke.sh, tests 1-11).
  10. Commit in one clean commit. Generated files stripped. Diff reviewable.

Estimated LOC: stoka_bot.py ~450, stoka_prompt.py ~200, main.py +60, stoka_bot_schema.sql +3, docker-compose.yml +5. Total ~720 lines.

Success Criteria

  • SSE streams char-by-char visibly in a raw curl -N session within 2s of request start
  • Every cite event carries slug, title, quote, type, reading_time, date, score
  • No cite event carries a slug absent from stoka_chunks
  • route=="meta" and route=="offtopic" return in ≤ 150ms end-to-end (no GPU call)
  • route=="new" end-to-end p50 ≤ 1500ms, p95 ≤ 2500ms on idle GPU
  • Concurrent Debono + Stoka traffic does not deadlock (manual test: upload deck while running smoke test 2)
  • Rebuilding the admin container with STOKA_BOT_ENABLED=false removes all stoka routes from /openapi.json (kill-switch works)
  • git diff on the Phase 2 commit touches only: admin/stoka_bot.py, admin/stoka_prompt.py, admin/stoka_bot_schema.sql, admin/main.py, admin/scripts/stoka_bot_smoke.sh, docker-compose.yml, specs/stoka-bot-phase2-plan.md. Nothing in dist/, node_modules/, or src/.

Konan, Phase 2 plan. Paper bomb set — unfold, execute, fold back flat.