Since our rms. AI chatbot is based on a RAG architecture (ChromaDB) and we can flexibly switch between models such as GPT, Gemini, Mistral or Llama, we have four levers to stop overthinking: ReRanking, data quality, retrieval logic and prompting.
1. the secret weapon: the reranker as a quality filter
A classic problem with vector databases (Chroma): They find documents that are semantically similar but do not necessarily contain the answer. If we give the AI 10 mediocre chunks, it tries to laboriously link them - the overthinking begins.
The solution: Two-stage retrieval
- Stage 1 (Chroma): In a first step, we retrieve the top 20 most relevant text passages.
- Stage 2 (Reranker): A special model (Zeroentropy or Jina) re-evaluates these 20 passages and sorts them strictly according to relevance.
The result: the AI only receives the top hits. Less noise means less pondering. The AI no longer has to resolve contradictions between "almost correct" documents.
2. structure in memory: semantic chunking
A model often overthinks because it receives information in "chunks". If a paragraph is separated in the middle of a sentence by a hard character limit (e.g. after 1000 characters), the AI is missing facts. It tries to fill this gap by making logical assumptions.
Activation of semantic chunking: We use algorithms that separate texts at thematic boundaries. A coherent thought remains a coherent block.
Advantage: The AI no longer has to "guess" what was in the missing part of the sentence. This immediately reduces the cognitive load.
3. data hygiene in Chroma: no duplication, no legacy data
Reasoning models are extremely allergic to redundancy.
- No duplications: If there are two identical project reports in the database, the AI spends minutes analyzing whether there are tiny differences.
- No contradictions: Outdated information (e.g. travel expenses guideline 2022 vs. 2024) must be corrected. If the AI finds both, it reaches a logical impasse as to which document now has priority.
4 The "universal prompt" for all LLMs
Regardless of whether we are currently using Gemini, Mistral, Llama or ChatGPT - these strategies in the system prompt prevent the models from digressing:
The "Rereading" strategy (RE2)
We instruct the AI to first fully penetrate the context before the logic engine starts. This looks something like this:
"Critically read the documents provided from the database twice. Identify the facts and ignore irrelevant filler sentences. Only then answer."
Channel the thought process
Instead of "Think", we say: "Work efficiently".
- Constraint prompting: "Only use chain-of-thought for complex calculations. For factual questions from the database, answer directly and without a rambling introduction."
- Output format: Force the model into tables or bullet points if it makes sense for the desired answer. This ties up energy in structuring the answer, not in philosophizing about it.
Model check: Who needs what?
| Technology | Strategy against overthinking |
|---|---|
| Reranker | Reduces context noise and prevents the AI from mulling over irrelevant chunks. |
| Open AI | Requires clear "negative constraints" (what should NOT be analyzed?) |
| Gemini | Benefits massively from semantic chunking to keep the huge context window clean. |
| Llama / Mistral | Respond best to Few-Shot examples (show the AI a perfect, short answer). |
Conclusion: quality beats quantity
Overthinking in our company chatbot is usually a sign that the AI has been fed too much or too little information. By using a reranker and switching to semantic chunking, we give the AI exactly what it needs: The naked truth, without noise.