Most AI writing tools work the same way: you paste a sample into a system prompt, ask the model to “match this style,” and hope for the best. It works okay for tone. It falls apart for everything else, sentence rhythm, vocabulary grade, hedging patterns, contraction frequency, the specific way you transition between ideas.

GhostRite doesn’t prompt-engineer your voice. We train it into the model weights.

This post explains exactly how, at three levels of depth. Pick your lane.


The 30-Second Version

You upload writing samples. We extract 17 measurable voice features from your text. Then we fine-tune a language model on your actual writing using a technique called LoRA, the model literally learns your patterns at the weight level. When you generate content, it’s not imitating you. It is you, computationally.


The 5-Minute Version

Stage 1: Voice Feature Extraction

When you upload writing samples, our voice analyzer doesn’t just read your text, it dissects it. We extract 17 distinct features across five categories:

Readability metrics: Flesch-Kincaid grade, Gunning Fog index, automated readability index. These tell us whether you write at a 6th-grade level or a graduate level, and how consistently.

Syntax patterns: Average sentence length, sentence length variance, clause depth, question frequency. Some people write in short punchy fragments. Others build complex nested clauses. Both are valid voices, but they’re measurably different.

Vocabulary profile: Type-token ratio (vocabulary diversity), hapax legomena ratio (how many words you use exactly once), average word length, contraction frequency. This is where “sounds like you” becomes quantifiable.

Discourse markers: Hedging frequency (“perhaps,” “it seems,” “arguably”), transition patterns, paragraph length distribution. These are the unconscious habits that make your writing yours.

Style signals: Passive voice ratio, first-person pronoun frequency, exclamation usage, parenthetical asides. The stuff you’d never think to specify in a prompt.

These 17 features become your voice fingerprint, a numerical profile that’s unique to you. We use it during training to weight which patterns matter most, and during inference as a quality gate.

Stage 2: Embedding and Chunking

Your writing samples get chunked into passages and embedded using sentence-transformers (all-MiniLM-L6-v2). These embeddings serve two purposes:

  1. RAG retrieval during generation, when GhostRite writes new content, it retrieves your most relevant passages to use as context, grounding the output in your actual writing rather than a statistical average.

  2. Training data preparation, passages are formatted into conversation pairs where the model learns to produce text that matches your patterns, not generic assistant output.

Stage 3: LoRA Fine-Tuning

This is where GhostRite diverges from every other tool.

LoRA (Low-Rank Adaptation) is a technique that adds small trainable matrices to a frozen base model. Instead of retraining billions of parameters, we train a lightweight “adapter”, typically 0.1-0.5% of the model’s total parameters, that shifts the model’s behavior toward your writing style.

Here’s what matters about our approach:

  • Base model: Qwen2.5-7B-Instruct (text-only, no vision overhead)
  • Quantization: 4-bit QLoRA (NF4) so the base model fits alongside our production inference server on a single RTX 3090
  • Target modules: We apply LoRA to the attention layers, specifically q_proj, k_proj, v_proj, and o_proj. These are the layers that control what the model pays attention to and how it combines information, which is exactly where writing style lives.
  • Loss masking: We only train on the assistant (output) tokens. The prompt/context tokens are masked with -100 labels. This means the model learns to produce text in your voice, not just to recognize it.

The result is an adapter file, typically 50-100MB, that, when loaded alongside the base model, causes it to generate text in your specific voice.

Stage 4: Inference

When you request content, the pipeline works like this:

  1. Your prompt + any reference documents go through the voice analyzer to extract context
  2. RAG retrieval pulls your most relevant writing samples
  3. The base model + your personal LoRA adapter generates the output
  4. Post-processing validates the output against your voice fingerprint, checking that readability, sentence patterns, and vocabulary stay within your measured range

If the output drifts too far from your profile, it gets flagged for revision.


The Deep Dive

Why LoRA Instead of Full Fine-Tuning?

Full fine-tuning a 7B model requires 56+ GB of VRAM in fp16, more than our entire GPU budget. More importantly, it’s overkill. Your voice isn’t a fundamentally different capability from what the base model already has. It’s a bias, a tendency to prefer certain word choices, structures, and rhythms over others.

LoRA captures this elegantly. By decomposing weight updates into low-rank matrices (rank 16 in our config), we’re saying: "the difference between generic writing and your writing can be expressed in a relatively low-dimensional space." This turns out to be true. Writing style is surprisingly compressible.

Our configuration:

lora:
  r: 16           # Rank, expressiveness of the adapter
  lora_alpha: 32   # Scaling factor (alpha/r = 2x effective learning rate)
  lora_dropout: 0.05
  target_modules:  # Only attention projection layers
    - q_proj
    - k_proj
    - v_proj
    - o_proj

Rank 16 gives us enough capacity to capture voice nuance without overfitting on small datasets. We tested r=8 (too compressed, lost vocabulary diversity) and r=32 (overfitting, started memorizing specific sentences). 16 is the sweet spot for voice.

Alpha 32 (2x rank) means the adapter’s influence is amplified. At alpha=16 (1x), the voice effect was too subtle. At alpha=64 (4x), it became a caricature. 2x produces writing that reads naturally while maintaining clear voice identity.

The Loss Masking Problem

A naive approach would train on the entire conversation: prompt + response. This teaches the model to predict your prompts as well as your responses, which is wasteful and can cause the model to “leak” prompt patterns into generated text.

Our custom DataCollatorForCompletionOnly solves this by finding the <|im_start|>assistant\n boundary in each training example and masking everything before it:

[MASKED -100] <|im_start|>user
What's your take on remote work?<|im_end|>
[TRAINED] <|im_start|>assistant
Honestly, I've gone back and forth on this...<|im_end|>

The model only learns from the tokens it needs to generate. This is critical for voice quality, without it, the adapter tries to learn both “how to read prompts” and “how to sound like you,” diluting both.

QLoRA: Making It Fit

Our production setup runs on 2x RTX 3090s (24GB each). The primary inference model (Qwen3-VL-8B) occupies both GPUs via tensor parallelism. Voice training needs to coexist.

4-bit NF4 quantization reduces the 7B base model from ~14GB (fp16) to ~4GB, leaving room for the LoRA adapter, optimizer states, and gradient buffers. Double quantization (bnb_4bit_use_double_quant=True) squeezes out another ~0.4GB by quantizing the quantization constants themselves.

The training config is tuned for memory efficiency:

training:
  per_device_train_batch_size: 2
  gradient_accumulation_steps: 8    # Effective batch size: 16
  gradient_checkpointing: true      # Trade compute for memory
  bf16: true                        # BFloat16 for stability
  max_seq_length: 1024              # Voice passages need more context

Gradient checkpointing is the big one, it reduces peak memory by ~40% by recomputing activations during the backward pass instead of storing them. The trade-off is ~30% slower training, which is fine for a job that runs in minutes, not hours.

Voice Evaluation: How We Know It Works

Training loss going down doesn’t mean the voice is right. We run blind evaluations:

  1. A/B test: Mix real writing samples with GhostRite output. Can evaluators tell which is which? Our target is <30% correct identification (chance = 50%).

  2. Feature distance: Compare the 17-feature voice fingerprint of generated text against the training data fingerprint. We measure L2 distance, anything above threshold triggers a training restart with adjusted hyperparameters.

  3. Vocabulary overlap: Generated text should use your vocabulary, not the base model’s default vocabulary. We measure Jaccard similarity of word distributions.

  4. Degeneration checks: Watch for repetition loops, vocabulary collapse, or sudden formality shifts that indicate overfitting or mode collapse.

What’s Next

The current system trains one adapter per user on their writing samples. The roadmap:

  • V2: Reactive voice: Train on your reactions to external content (comments, replies, annotations) so the model can respond to new articles and discussions the way you would.
  • V3: Long-form generation: Multi-section documents where the voice needs to stay consistent across 2,000+ words. Requires curriculum training and longer sequence lengths.
  • V4: Multi-domain adapters: Your email voice isn’t your blog voice. Train separate LoRA adapters per context and switch at inference time. The base model stays frozen; only the adapter swaps.

The Bottom Line

Prompt engineering is like describing your voice to a stranger and hoping they can imitate it. LoRA fine-tuning is like handing them a recording and letting them practice until they get it right, except the “practice” is mathematically optimized gradient descent across millions of parameters.

The result: content that passes the “did I write this?” test. Not because it was carefully crafted to sound generic-but-acceptable, but because the model literally learned your patterns.

That’s what “baked into AI” means.


GhostRite is built by Stoka Software. Try it free at ghostrite.ai.