Stoka Bot — Phase 2 Plan: RAG API
Status: plan for execution
Branch: cut feature branch stoka/phase2-rag-api at execution
Upstream: Phase 0 (compose infra, pgvector schema) + Phase 1 (admin/stoka_indexer.py, 860 chunks indexed) — both shipped.
Mission
Build /api/stoka/chat (SSE stream), /api/stoka/discover, /api/stoka/context in the existing FastAPI admin service (admin/main.py). Full RAG pipeline: embed → pgvector → rerank (editorial) → generate. 5-layer personality prompt. Response router. Per-session state with confidence thresholds and semaphores.
Out of scope: UI/Astro island (Phase 3). Interest model (Phase 4). Ambient mode (Phase 5). Telemetry bridge (Phase 6).
Reality Check — Phase 0/1 Deltas From Spec
The spec in specs/stoka-bot.md names ports 8052/8053 and Qwen3 embed/rerank. Phase 0 overrode both. Do not re-litigate. Use what is running:
| Component | Spec says | Actually deployed | Source |
|---|---|---|---|
| Embedding model | Qwen/Qwen3-Embedding-0.6B |
BAAI/bge-large-en-v1.5 (1024-dim) |
docker-compose.yml:53-85 |
| Embedding port | 8052 | 8072 (TEI /embed) |
docker-compose.yml:63 |
| Reranker model | Qwen/Qwen3-Reranker-0.6B |
BAAI/bge-reranker-base (cross-encoder) |
docker-compose.yml:87-119 |
| Reranker port | 8053 | 8073 (TEI /rerank) |
docker-compose.yml:97 |
| Generator | Qwen3-VL-8B port 8050 | unchanged — id qwen3-vl-8b-instruct |
verified GET /v1/models |
| Vector DB | pgvector in Supabase | Supabase Postgres 5445, db postgres, table public.stoka_chunks |
admin/stoka_bot_schema.sql |
Verified live (all three backends responding):
POST http://localhost:8072/embed→[[f32 × 1024]]for one inputPOST http://localhost:8073/rerankbody{query, texts, truncate, raw_scores:false}→[{index, score}]POST http://localhost:8050/v1/chat/completionswithstream:true→ standard OpenAI SSE deltas
File Layout
Primary edit: one new module + three new routes wired into admin/main.py. Do not grow main.py further with RAG internals — it is already 2,069 lines. Reuse _db_conn() (line 194), _allowed_origins() (110), _get_client_ip() (235), _check_rate_limit() (226).
admin/
main.py — MOD: import + register stoka_bot router, reuse helpers
stoka_bot.py — NEW: RAG pipeline, router, prompt layers, session state
stoka_prompt.py — NEW: 5-layer prompt builder + golden examples (pure data)
stoka_bot_schema.sql — existing (Phase 0), no changes
stoka_indexer.py — existing (Phase 1), no changes
Keep stoka_prompt.py as pure-data only (no I/O, no DB). That keeps the eval suite in Phase 2.5 able to import it without touching the DB. Every string constant is test fixture material.
Endpoint Contracts
All routes mounted under /api/stoka/* in admin/main.py. JSON in, SSE or JSON out. CORS already handled by existing CORSMiddleware (line ~25 of main.py) using _allowed_origins().
POST /api/stoka/chat — streamed RAG response
Auth: anonymous OK. Optional Authorization: Bearer <access_token> if present (decoded via existing _decode_access_token, line 305) to pull supabase_uid into the visitor row. Never 401 this route on anon — Stoka is a public terminal.
Request body:
{
"query": "how do you handle agent failures?",
"session_id": "uuid-or-null",
"visitor_id": "uuid-or-null",
"page": "/blog/debono-boss-fight",
"page_title": "Debono Boss Fight",
"scroll_pct": 62,
"seen_slugs": ["philosophy-bounded-execution"]
}
session_id/visitor_idnullable; server creates both on first turn and echoes them in the SSEreadyevent so the client can persist them inlocalStorage. Never trust client-sent shown_slugs as the only source of truth — merge with DBstoka_sessions.shown_slugs.seen_slugsis an advisory from client localStorage (reading history); used to penalize retrieval even if the session didn't cite them.
Response: text/event-stream (SSE). MIME text/event-stream, Cache-Control: no-cache, X-Accel-Buffering: no (nginx/Caddy pass-through), Connection: keep-alive.
Event protocol (exact wire format, matches spec §Streaming Protocol):
event: ready
data: {"session_id":"<uuid>","visitor_id":"<uuid>","route":"rag"}
event: token
data: {"text":"Bounded "}
event: cite
data: {"slug":"philosophy-bounded-execution","title":"The Philosophy of Bounded Execution","quote":"…","type":"philosophy","reading_time":"8 min","date":"Apr 12","score":0.87}
event: token
data: {"text":"It's not."}
event: done
data: {"turns":3,"confidence":"direct","latency_ms":720}
On error mid-stream:
event: error
data: {"message":"<visible-to-UI string>","retryable":true}
Then close stream. Never 500 after the stream has started — the SSE is already 200 to the browser.
Route types (emitted in ready.route):
meta— answered from personality, no RAGofftopic— deflection, no RAGfollowup— RAG withexclude=shown_slugs, bias towardexplored_topics[-1]new— full RAGvoid— no hits abovenothingthreshold; one-liner"nothing here on that. yet."
POST /api/stoka/discover — non-streaming recommendation
Purpose: Ambient hint backend + "what should I read next" lookups. No LLM in the hot path — pure retrieval + light framing by the generator. Synchronous JSON so it can be prefetched.
Request body:
{
"visitor_id": "uuid-or-null",
"page": "/blog/debono-boss-fight",
"seen_slugs": [],
"limit": 3
}
Response:
{
"recommendations": [
{
"slug": "philosophy-bounded-execution",
"title": "The Philosophy of Bounded Execution",
"quote": "Autonomy without constraint is just chaos with better marketing.",
"type": "philosophy",
"reading_time": "8 min",
"date": "Apr 12",
"framing": "most people read this as an essay about limits. it isn't.",
"score": 0.81
}
],
"confidence": "direct"
}
Framing string comes from a single batched vLLM call constrained to max_tokens=40 per recommendation (one request, N completions). If the framing call fails, degrade gracefully to framing=null and still return the retrieval hits. The frontend must render recommendations even without framing.
GET /api/stoka/context?page=<url> — single ambient one-liner
Auth: anonymous OK. Query only; no body.
Response:
{"hint": "you're on the landing page. most people scroll past the interesting part.", "cite": null}
Implementation: map page to a small lookup table of hand-written ambient hints for known routes (/, /blog, /blog/*, /products/*). For blog post pages, use the post slug to fetch a single top-reranked adjacent chunk (exclude-self) and wrap it in a {hint, cite} shape. This is cheap enough to hit on every page load — hard-cache in-process for 60s keyed by page.
Rate limit: reuse _check_rate_limit(f"stoka_ctx:{ip}", 120, 60).
RAG Pipeline (pure functions in admin/stoka_bot.py)
chat(query, session, context)
→ route_query(query, session) → one of {meta, offtopic, followup, new, void}
→ if meta/offtopic: stream personality_response(route, query)
→ if followup/new:
→ embed_query(query) [EMBED_SEMAPHORE]
→ pgvector_search(vec, k=20, type_filter=None)
→ rerank_with_editorial_policy(query, candidates, session) [RERANK_SEMAPHORE]
→ confidence = classify(top_score) # direct | adjacent | nothing
→ if nothing: stream void_response()
→ build_prompt(layered, retrieved, session, context)
→ stream_generate(prompt, max_tokens=280) [GENERATE_SEMAPHORE]
→ persist_session_turn(shown_slugs, topics, turns++)
Embedding
Reuse the exact request shape from stoka_indexer.py:390:
client.post(f"{EMBED_URL}/embed", json={"inputs": [q], "truncate": True, "truncation_direction": "Right"}, timeout=10.0)
Do NOT use the metadata-prefix template that the indexer uses for documents. The indexer prepends [type] title / Tags / Section / <body> before embedding content chunks (line 392 of indexer). Queries are embedded raw. Embedding symmetry between query and document is a BGE convention — BGE recommends a small query prefix ("Represent this sentence for searching relevant passages: ") but the Phase 1 index was built without it, so queries MUST be embedded without it too. Write a comment explaining this and put EMBED_QUERY_PREFIX = "" at module top so the decision is reversible when we re-index.
Retrieval
SELECT id, source_type, source_slug, chunk_index, content, title, section, metadata,
1 - (embedding <=> %s::vector) AS similarity
FROM public.stoka_chunks
WHERE embedding IS NOT NULL
ORDER BY embedding <=> %s::vector
LIMIT 20
Use the same _vector_literal format helper the indexer uses (line 425) — do not depend on pgvector-python being installed in the admin container. Keep vectors as pgvector text literal "[0.01,0.02,...]". Single-query parameter binding; no SQL injection surface.
Confidence classification uses the reranker score post-editorial adjustments, not the raw cosine. The cosine is first-stage recall; the reranker is what the confidence thresholds in spec §Reranking apply to.
Reranking with editorial policy
Call TEI reranker with:
client.post(f"{RERANK_URL}/rerank", json={
"query": query,
"texts": [c.content for c in candidates],
"raw_scores": False, # sigmoid-normalized, easier to threshold
"return_text": False,
"truncate": True,
}, timeout=8.0)
Apply adjustments from spec §Reranking with Editorial Policy (lines 448-488):
| Rule | Multiplier | Source of truth |
|---|---|---|
| Same source slug already in results | ×0.6 once seen |
iterate candidates in score order |
Slug in session.shown_slugs (DB) OR client seen_slugs |
×0.1 |
near-veto |
metadata.tags contains "philosophy" AND session.turns >= 4 |
×1.3 |
deep-reader boost |
metadata.pubDate within last 30 days |
×1.15 |
recency |
source_type == "product" AND query looks navigational (regex `^(what is |
tell me about | do you have) `) |
Cap to top 5 for prompt injection. Attach the adjusted score to each chunk so the cite event can pass it downstream.
Thresholds (spec line 453):
direct ≥ 0.82— Stoka speaks with authority, top 2-3 citesadjacent ≥ 0.65— framed as a stretch ("closest thing i have is this")nothing < 0.50— void response, no LLM call at all
The 0.50-0.65 band calls the LLM but injects a [low_confidence=true] marker in the prompt layer 4 so the model frames answers as "closest match" rather than authoritative.
Generation
vLLM /v1/chat/completions. System message = full 5-layer prompt. User message = query. stream=true, max_tokens=280, temperature=0.5, top_p=0.9, stop=["\nvisitor:","</cite"]. No tool calls.
Citation emission strategy: the model is instructed (layer 1) to emit citation markers inline as <cite slug="..." quote="..."/>. A small streaming parser in stoka_bot.py watches the emerging token buffer for <cite ... /> tags and:
- When a complete tag is detected, pause token emission, look up the slug in the retrieved-chunks table, emit an
event: citewith the full hydrated citation payload (title, type, reading_time, date, score from reranker), and mark slug as shown_in_session. - Resume token emission of whatever follows the closing
/>.
Hydration is done from the retrieved candidates list that we already have in memory — never trust the model to output accurate titles or dates. The model only supplies the slug and the pull quote; everything else comes from stoka_chunks.metadata and stoka_chunks.title loaded during retrieval.
If the model hallucinates a slug not in the top-5 retrieval, drop the cite entirely (log it) and keep streaming. Silence beats a broken citation.
Session state
Session is a DB row in public.stoka_sessions (Phase 0 schema line 54). Anonymous visitors get a stoka_visitors row on first turn with supabase_uid = NULL. Authenticated visitors get the row upserted with their Supabase UID.
On each /chat turn:
-- upsert visitor (anon: by cookie UUID; auth: by supabase_uid)
INSERT INTO public.stoka_visitors (id, supabase_uid, last_seen)
VALUES (%s, %s, now())
ON CONFLICT (id) DO UPDATE SET last_seen = now(), supabase_uid = COALESCE(EXCLUDED.supabase_uid, public.stoka_visitors.supabase_uid);
-- upsert session
INSERT INTO public.stoka_sessions (id, visitor_id, shown_slugs, explored_topics, turns, last_active)
VALUES (%s, %s, %s, %s, 1, now())
ON CONFLICT (id) DO UPDATE SET
shown_slugs = (SELECT array_agg(DISTINCT x) FROM unnest(stoka_sessions.shown_slugs || EXCLUDED.shown_slugs) x),
explored_topics = (SELECT array_agg(DISTINCT x) FROM unnest(stoka_sessions.explored_topics || EXCLUDED.explored_topics) x),
turns = stoka_sessions.turns + 1,
last_active = now();
Topic extraction: on route=new, extract the top-1 reranked result's primary tag as the current topic. On route=followup, reuse explored_topics[-1].
A session is "stale" after 2h of inactivity; new turns get a new session row. Old rows stay for analytics.
Semaphores (module-level in stoka_bot.py)
EMBED_SEMAPHORE = asyncio.Semaphore(4)
RERANK_SEMAPHORE = asyncio.Semaphore(2)
GENERATE_SEMAPHORE = asyncio.Semaphore(2) # GPU bottleneck, shared with Debono/ST
GENERATE_SEMAPHORE=2 because Debono and StokaTerminal already hit 8050. vLLM continuous batching handles real concurrency; the semaphore just prevents us from queueing 50 Stoka requests ahead of a Debono deck transcription.
Rate limits (reuse _check_rate_limit, line 226):
/chat: 30 req / 5min / IP (stoka_chat:{ip})/discover: 60 req / 5min / IP/context: 120 req / 60s / IP
Response Router
Router is pure-Python heuristics, zero LLM calls. It runs on every turn before any retrieval work. Keep it in stoka_bot.py as route_query(query, session) -> Route.
META_PATTERNS = (
r"^\s*(what|who) (is|are) (this|you|stoka)\b",
r"^\s*(are you|r u) (a )?(bot|ai|human|chatbot|real)\b",
r"^\s*(hi|hey|hello|yo|sup)\s*[!?.]*$",
)
OFFTOPIC_SIGNALS = (
# things Stoka knows nothing about
"react","vue","angular","nextjs","typescript framework","best javascript",
"bitcoin","stock","weather","recipe","homework","write my","draw me",
)
Decision order:
- If query matches any META_PATTERN →
meta - If query contains any OFFTOPIC_SIGNAL and does not overlap with any indexed
source_slugtokens →offtopic - If
session.turns >= 1AND (query is ≤ 8 tokens OR starts with"what about"|"and"|"but"|"so"|"did it"|"tell me more") →followup - Otherwise →
new
void is not a router outcome — it is assigned after retrieval if scores fall below the nothing threshold.
The router short-circuits explicitly because spec-line 430 is a semi-cautionary example ("Not every query hits the full RAG pipeline"). Every hop to the LLM is ~500ms GPU time we don't need for "hi" or "what is this?".
Prompt Architecture — admin/stoka_prompt.py
Pure-data module. Single function build_messages(query, retrieved_chunks, visitor_ctx, session_state) -> list[dict]. No I/O.
Layer 0 — Identity (frozen string)
You are Stoka. You ARE stokasoftware.com — aware of your own content.
Not a chatbot. Not an assistant. Not a help desk.
The site itself, with opinions about what's published on it.
Layer 1 — NEVER constraints (verbatim, no paraphrasing)
Source of truth: spec §Anti-Patterns (lines 108-122) + §Tone Rules (99-106). Every item lifted directly. Codify as a numbered list because numbered NEVER-lists outperform bullet NEVER-lists in follow-through — Claude Code leak pattern §Constraints.
You NEVER:
1. greet the visitor ("hi", "hello", "welcome", "hey")
2. say "great question" or any variant
3. say "I'd be happy to help" or any sycophancy
4. apologize ("sorry, I don't have…") — just say what you don't have
5. summarize a blog post instead of citing it
6. use exclamation marks
7. use emoji
8. use bullet-point lists of recommendations
9. repeat a citation that was already shown in this session
10. exceed 3 sentences before a cite
11. match the visitor's energy level — you are unshakably yourself
12. offer alternatives when you can't help ("but you might try…")
13. repeat the visitor's question back
14. claim to know anything not in the retrieved content below
15. invent a citation slug that does not appear in [Retrieved Content]
Constraint 15 is what prevents citation hallucination. Pair it with the post-stream slug-hydration guard.
Layer 2 — Voice + golden examples
Lift spec §Voice (lines 87-97) and spec §Few-Shot Golden Examples (lines 533-568) verbatim. Keep the exact 7 examples the spec provides — Shay wrote them, they are the canon. Format as alternating visitor: / stoka: in a single system-message block; do NOT split them into synthetic conversation turns because BGE-indexed Qwen models follow literal demonstration better than role-played examples.
Layer 3 — Retrieved Content (injected per request)
[Retrieved Content — 5 chunks, cite these and ONLY these]
<chunk slug="philosophy-bounded-execution" score="0.87" type="philosophy" tags="philosophy" reading_time="8 min" date="Apr 12">
The Philosophy of Bounded Execution — §Why constraints produce better software
Autonomy without constraint is just chaos with better marketing…
</chunk>
<chunk slug="debono-boss-fight" …>…</chunk>
Include score, type, tags, reading_time, date as XML attributes so the model can pick the citation quote from the content and the metadata-sourced attributes go into the emitted cite event by the hydration step (model never types them).
If confidence == "adjacent", prepend [low_confidence — frame as "closest i have"] on its own line before the chunks.
Layer 4 — Visitor Context
[Visitor]
page: /blog/debono-boss-fight
page_title: Debono Boss Fight
scroll_pct: 62
returning: false
declared_tags: []
turns_so_far: 2
already_cited: philosophy-bounded-execution, debono-llm-gpu
current_topic: agent-failure
route: followup
Only include fields that have values. An empty declared_tags line stays; the model uses the presence of the label as a signal that there is nothing to personalize on.
Layer 5 — Conversation (sliding window)
Not implemented in Phase 2. Hard-code to last-exchange only (previous visitor query + previous Stoka response, stored on stoka_sessions as a new JSONB column last_exchange). Add a migration for this column in the Phase 2 schema additions (see §Schema Additions below).
Rationale: full conversation windowing is Phase 3+ territory when the UI actually exists. Shipping Phase 2 with zero-turn history is fine — most Stoka interactions will be 1-3 turns regardless, and the Phase 3 UI will drive whatever windowing matters.
Final assembly
def build_messages(query, retrieved, visitor, session):
system = "\n\n".join([
LAYER_0_IDENTITY,
LAYER_1_CONSTRAINTS,
LAYER_2_VOICE + "\n\n" + LAYER_2_GOLDEN_EXAMPLES,
render_retrieved(retrieved, confidence=session.confidence),
render_visitor(visitor, session),
])
msgs = [{"role": "system", "content": system}]
if session.last_exchange:
msgs.append({"role": "user", "content": session.last_exchange["user"]})
msgs.append({"role": "assistant", "content": session.last_exchange["assistant"]})
msgs.append({"role": "user", "content": query})
return msgs
Total system prompt budget: ~2,500 tokens worst case (layers 0-2 are ~1,200 static + 1,300 for 5 chunks of ~250 tokens each). Qwen3-VL-8B max_model_len=16384 — we have headroom.
Schema Additions
One new column on stoka_sessions (otherwise Phase 0 schema is sufficient):
ALTER TABLE public.stoka_sessions
ADD COLUMN IF NOT EXISTS last_exchange JSONB;
Apply via admin/stoka_bot_schema.sql append (keep the file idempotent — it already uses CREATE TABLE IF NOT EXISTS / CREATE INDEX IF NOT EXISTS). Execute on container startup through the existing init_db() path in main.py:204:
def init_db() -> None:
_init_pool()
schema_sql = SCHEMA_PATH.read_text()
identity_schema_sql = IDENTITY_SCHEMA_PATH.read_text()
stoka_bot_sql = (Path(__file__).with_name("stoka_bot_schema.sql")).read_text() # NEW
with _db_conn() as conn:
...
Currently stoka_bot_schema.sql is applied only against the separate Supabase-bot DB (host.docker.internal:5445). The admin FastAPI already talks to that same DB (SUPABASE_DB_URL points to it per docker-compose.yml:19), so wiring the schema into init_db() is correct. Verify before adding that SUPABASE_DB_URL in init_db's pool is in fact the same database as STOKA_DB_URL used by the indexer — if divergence exists, fix the env wiring first, do not silently create tables in the wrong DB. This is a Kisame hazard.
Environment Variables
Add to docker-compose.yml stokasoftware service env block (line 10-44):
- STOKA_EMBED_URL=${STOKA_EMBED_URL:-http://host.docker.internal:8072}
- STOKA_RERANK_URL=${STOKA_RERANK_URL:-http://host.docker.internal:8073}
- STOKA_GEN_URL=${STOKA_GEN_URL:-http://host.docker.internal:8050}
- STOKA_GEN_MODEL=${STOKA_GEN_MODEL:-qwen3-vl-8b-instruct}
- STOKA_BOT_ENABLED=${STOKA_BOT_ENABLED:-true}
Feature flag STOKA_BOT_ENABLED gates route registration; default on but we can kill-switch without a rebuild by flipping the env var. All three URL vars default to the correct Phase 0 ports on host.
Existing EMBED_URL=...:8055 at line 30 is unrelated (Debono's embedder) — leave it alone; do not overload. The renaming to STOKA_EMBED_URL is intentional so future-me doesn't conflate them.
Failure Modes & Handling
Review-miss patterns from institutional memory:
| Failure | Detection | Response |
|---|---|---|
| Embedding service down | httpx timeout/5xx on /embed |
503 before stream starts; SSE never opens |
pgvector query empty (stoka_chunks dry) |
0 rows | SSE error event "index is empty", route=void |
All scores below nothing threshold |
post-rerank classifier | route=void, stream "nothing here on that. yet." via fake-token chunks (keep terminal feel) |
| Reranker down | httpx timeout | fall back to raw cosine order, log warning, continue (not a 500 — the service degrades gracefully) |
| Generator down | httpx timeout | 503 before stream opens; if already streaming, emit error event and close |
| Model hallucinates slug | hydration lookup miss | drop the cite silently, keep tokens streaming, log count for dashboard |
| Client disconnects mid-stream | asyncio CancelledError | cancel the upstream request to vLLM (release GENERATE_SEMAPHORE); commit session turn to stoka_sessions anyway (partial turn still counts for rate limits) |
| DB pool exhausted | psycopg2 PoolError | 503 + Retry-After 2; do not hold the pool while streaming — see §Connection Management |
| Rate limit exceeded | _check_rate_limit raises 429 |
let it bubble; caller shows "slow down" |
Every fetch in Phase 3 frontend must have a visible error state in .catch() — not just console.error. Each SSE event is a contract with the UI.
Connection Management (critical — do not divergence from helpers)
The _db_conn() context manager (main.py:194) holds a psycopg2 connection for the full with block. Do not hold a connection across an SSE stream — that pins one of the pool's 5 connections for the entire generation window and we will starve under concurrent visitors.
Pattern:
# At request entry: fetch session, release conn
with _db_conn() as conn:
visitor, session = _load_or_create(conn, visitor_id, session_id, access_token)
retrieved = _load_retrieval(conn, query_vec) # includes pgvector search
# conn released — no DB held during reranker + stream
# ... rerank + stream ...
# At request exit: persist turn, release again
with _db_conn() as conn:
_commit_turn(conn, visitor, session, shown_slugs, topic, last_exchange)
This matches the "connection management divergence from established helpers should always BLOCK" rule from the past-incident ledger. Single pattern: scoped with blocks, never leaked, never a raw connection passed into an async generator.
Testing — Ship With These Smoke Tests
Minimum acceptance gate. All of these are CURL-driven and runnable against the local compose stack.
POST /api/stoka/chat {"query":"what is this?"}→ SSE opens,ready.route=="meta", response ends with provocation (? or />), nociteevents, total tokens ≤ 40.POST /api/stoka/chat {"query":"tell me about bounded execution"}→route=="new", at least oneciteevent withslug=="philosophy-bounded-execution", score ≥ 0.60.- Fire same request twice with same
session_id; second response MUST NOT cite the slug already inshown_slugs(novelty penalty verified). POST /api/stoka/chat {"query":"best javascript framework?"}→route=="offtopic", response equals"wrong terminal."exactly.POST /api/stoka/chat {"query":"quantum gravity research"}→route=="void"ORadjacent, never a hallucinated cite.POST /api/stoka/discoverwithpage=/blog/philosophy-bounded-execution→ returns 3 recommendations, none equal to current slug.GET /api/stoka/context?page=/→ returns hint, cache hit ≤ 50ms on second call.- Kill
stoka-rerankercontainer, re-run test 2 → still streams, response still cites (degraded path), logs warning. - Kill
stoka-embeddingcontainer, re-run test 2 → 503 before SSE opens. - Run test 1-5 in parallel × 10 → no 5xx, no pool exhaustion, no "connection already closed" log.
- Verify
SELECT DISTINCT source_slug FROM stoka_sessions CROSS JOIN LATERAL unnest(shown_slugs) sreturns only slugs that actually exist instoka_chunks(hallucination protection).
Tests 1-7 are the Sasori smoke-query step from the review-miss register ("hit each new endpoint once against real data and verify response makes sense"). Tests 8-11 are degradation verification.
Automate via a single shell script admin/scripts/stoka_bot_smoke.sh that exits non-zero on any failure. Do not add a pytest suite in Phase 2 — that's Phase 2.5 eval pipeline territory.
Anti-Patterns (explicit — Sasori reviews against this list)
- Do not add a pytest suite to
admin/in this phase — it's unnecessary until the eval dataset exists (Phase 2.5). Adding untested tests wastes review cycles. - Do not introduce a new ORM, new DB driver, or new connection helper. Use
_db_conn(). The cost_events incident and the psycopg2 leak were both "someone built their own conn helper." - Do not rename ports 8072/8073 back to 8052/8053 in the plan, the code, or the env vars. Phase 0 docs explain why.
- Do not ship without the slug-hydration guard (constraint 15 + post-parse filter). Hallucinated cites are the fastest way to destroy the terminal's credibility on day 1.
- Do not commit generated artifacts — no
.playwright-mcp/, no__pycache__, noadmin/__pycache__.dist/will churn duringnpm run buildtesting; exclude those diffs from the Phase 2 commit. Reference: the past scope-creep incident where generated artifacts flooded the mission diff. - Do not touch
backend/main.py— it is the stale copy moved to SoftwareTsukuyomi. Comment inadmin/main.py:1-6is accurate: all portal work lives in SoftwareTsukuyomi now; only the admin service lives here. - Do not add SSO/auth enforcement to
/api/stoka/chat. Anonymous is the default; auth is an optional enrichment. - Do not introduce LoRA or eval infrastructure. That's Phase 2.5. Prompt engineering only in Phase 2.
- Do not add a frontend file. Phase 3 owns
src/components/StokaTerminal.astroand friends. - Do not block on the legacy
EMBED_URL=...:8055(Debono) or the scopedSTOKA_EMBED_URL=...:8072(ours) being the same. They are different embedders for different apps. Do not deduplicate.
Dependencies
admin/requirements.txt additions — verify first whether these are already present (the codebase already uses httpx, psycopg2, fastapi, pydantic). Likely zero new deps. If sse-starlette is not already a dep, prefer hand-rolled SSE over adding it — FastAPI's StreamingResponse with a correctly-formatted async generator is enough and avoids a new package.
# Hand-rolled SSE format:
async def sse_format(event: str, data: dict) -> bytes:
return f"event: {event}\ndata: {json.dumps(data, separators=(',', ':'))}\n\n".encode()
Execution Order (for the coder — Kisame primary)
- Read everything first. Skim
admin/main.pyfully once. Readadmin/stoka_indexer.pyfully. Readadmin/stoka_bot_schema.sql. Don't start editing until the mental model is complete. - Verify env wiring. Prove to yourself that
SUPABASE_DB_URLandSTOKA_DB_URLpoint at the same Postgres instance. If they don't, fix that before anything else. - Write
admin/stoka_prompt.pyfirst. It's pure data, no I/O, fastest to review. Include the 7 golden examples verbatim from spec lines 533-568. - Write
admin/stoka_bot.py— router, retrieval, reranker, streaming parser, session persistence. No FastAPI routes yet; just pure async functions. - Wire routes into
admin/main.py— three decorators, reuse existing helpers. Addinit_db()call to applystoka_bot_schema.sqlappend (new column only). - Append migration to
admin/stoka_bot_schema.sqlfor thelast_exchangecolumn. - Add env vars to
docker-compose.ymlservice block. - Rebuild
docker compose up -d --build stokasoftware. - Run smoke tests (
admin/scripts/stoka_bot_smoke.sh, tests 1-11). - Commit in one clean commit. Generated files stripped. Diff reviewable.
Estimated LOC: stoka_bot.py ~450, stoka_prompt.py ~200, main.py +60, stoka_bot_schema.sql +3, docker-compose.yml +5. Total ~720 lines.
Success Criteria
- SSE streams char-by-char visibly in a raw
curl -Nsession within 2s of request start - Every
citeevent carriesslug,title,quote,type,reading_time,date,score - No cite event carries a slug absent from
stoka_chunks route=="meta"androute=="offtopic"return in ≤ 150ms end-to-end (no GPU call)route=="new"end-to-end p50 ≤ 1500ms, p95 ≤ 2500ms on idle GPU- Concurrent Debono + Stoka traffic does not deadlock (manual test: upload deck while running smoke test 2)
- Rebuilding the admin container with
STOKA_BOT_ENABLED=falseremoves all stoka routes from/openapi.json(kill-switch works) git diffon the Phase 2 commit touches only:admin/stoka_bot.py,admin/stoka_prompt.py,admin/stoka_bot_schema.sql,admin/main.py,admin/scripts/stoka_bot_smoke.sh,docker-compose.yml,specs/stoka-bot-phase2-plan.md. Nothing indist/,node_modules/, orsrc/.
Konan, Phase 2 plan. Paper bomb set — unfold, execute, fold back flat.