When I started building Debono, the obvious choice was to call OpenAI’s API. Most startups do. But after running the numbers, I chose a different path: self-hosted open-source models on my own GPUs.

The Cost Math

A single pharmacy student generating flashcards from a 200-slide lecture deck might trigger 50-100 API calls. At GPT-4 pricing, that’s $2-5 per study session. Multiply by thousands of students and the margins evaporate.

Running Qwen3-VL-8B on my own RTX 3090s costs effectively nothing per inference after the hardware investment. That’s how I offer a $15/month plan that would be unprofitable at $150/month on OpenAI’s API.

Speed

Self-hosted models on local GPUs respond in 50-200ms. No network round-trip to a cloud API. No queue. No rate limits. When a student is grinding through flashcards, that latency difference is the difference between flow state and frustration.

Privacy

Pharmacy students study real drug interactions, dosing protocols, and clinical scenarios. That data shouldn’t leave the building. With self-hosted models, it doesn’t. No terms of service to worry about. No data training opt-outs to configure. The data never leaves the server.

The Tradeoff

Self-hosting means I maintain the infrastructure myself. GPU drivers, model updates, memory management, failover, it’s real work. But for this use case, the math is clear: better margins, faster responses, stronger privacy guarantees.

Not every product needs self-hosted AI. But if you’re building something where cost-per-inference matters and data privacy is non-negotiable, it’s worth doing the math yourself.

I apply this same self-hosted approach across the entire portfolio: Debono for pharmacy NAPLEX prep, Distilio for general-purpose study tools, and Stoka for AI-powered life coaching. Every product runs on my own GPUs with sub-200ms response times and zero data leaving the servers.