Introduction
Humans routinely forget details but understand meaning; LLMs do the opposite.
They remember a statistical imprint of the world, yet lack access to the documents we rely on every day. In a world where information changes faster than any model can be retrained, the challenge is not to create a model that knows everything, but a system that knows where to look.
Retrieval-Augmented Generation exists to bridge this gap.
Retrieval-Augmented Generation (RAG) has become one of the most practical architectural patterns for building AI systems that need to answer questions using private, domain-specific or frequently updated information. Instead of relying solely on a large language model, a RAG system retrieves relevant context from an external knowledge base and provides it to the model at query time. This allows the system to produce answers that are factual, grounded in real documents, and far less prone to hallucination.
The solution described in this post is the RAG system developed at Biotz: a Clojure-based application that processes documents, indexes them in a vector database, and uses them as trusted context to generate precise, well-grounded answers. This post walks through how the system works internally, from ingestion to retrieval.
Technical Introduction (TL;DR):
At a high level, the RAG system operates in two phases:
Phase 1: Ingestion and Indexing:
OCR → Splitting → Embedding each segment → Upserting embeddings into Qdrant → Store original text and metadata in Postgres
Phase 2: Retrieval and Generation:
Embed the user’s query → Perform vector search in Qdrant → Rerank results* → Build the final prompt (system prompt + context +query) → Call the configured LLM → Return the grounded answer.
Other important concepts:
Buckets: independent knowledge bases with isolated document sets.
System prompts: Each bucket can define its own system prompt in Postgres, defining assistant behaviour.
Infrastructure: Clojure + Integrant + Reitit/http-kit, Qdrant, Postgres
Deployment: Can run both in the cloud(AWS, GCP, Azure) or fully on-prem (Mac Mini, Mac Studio…)
* Reranking is optional and depends on configuration.
Architecture overview
Before diving into ingestion and retrieval, it is useful to understand how Genie is structured internally. First, what is important to understand is that Genie is a Clojure application wrapped around an LLM, with two responsibilities: Turning the documents uploaded by the user into something an LLM can understand, by splitting, embedding and storing them in Qdrant. Then, turning the user’s question into a meaningful answer, by embedding the question, and letting LangChain4j assemble the final prompt with the retrieved context.
All the infrastructure part, this is, HTTP server, the database connections, the configuration exist simply to enable these two flows.
The backbone of Genie is Integrant, a Clojure library that behaves like a tiny dependency injection framework. The system is not built manually, instead, it declares what the components are, and Integrant wires everything together at startup, in which the entrypoint will look like this:
(defn -main
"Invoke me with clojure -M -m smart-genie.core"
[& _args]
(config/init-system :migrate true))This line alone handles: Starting the HTTP server, setting up Qdrant and Postgres, running DB migrations, Loading the RAG configuration (embedding model,splitter,rearanker…), Loading the system prompts (the behavioural instruction for the LLM), Building all the HTTP routes (including upload + chat). Everything is data-driven.
Basically, Integrant lets you declare the system as data. At startup, it reads the configuration map, constructs the components in the correct order, and connects them together.
The three pillars: Postgres, Qdrant and the LLM. Even though Genie handles many things internally, the architecture revolves around only these three external actors. For a start, in Postgres is stored the original document text and the system prompt (per bucket), among other things. In reference to Qdrant, this is where the knowledge lives, whenever a file is uploaded, Genie splits it into segments, embeds each one into a vector and stores it in Qdrant. In this way, Qdrant becomes the private semantic search index. Finally, the Chat Model (OpenAI, Mistral, HF, etc.) is the actual LLM. Genie doesn’t train models, it simply prompts them intelligently. The LLM itself stays outside the system; Genie talks to it through LangChain4j via HTTP/glue code. The important part is how Genie constructs the prompt: with the question, system instructions, and the retrieved chunks.
How do these different purpose pieces connect? The whole system boils down to a very short path: Request → Handler → Assistant → Retriever → Qdrant → LLM → Answer
For example, when the HTTP server receives a request to /chat, Reitit calls this:
(make-chat-handler store embedding-model datasource ai)And inside that handler it is assembled everything: the retriever(vector search engine), the bucket filter (not to mix data among users), the dynamic relevance threshold (which is implemented as a minimum similarity score in the retriever), the system prompt provider (the “behaviour contract” of the assistant) The LLM client and the chat memory. Once that handler is built, each call to:
answer (-> state! deref :assistant deref (.answer bucket-id query))Which triggers the full retrieval and generation pipeline.
Ingestion
Once the general architecture is understood, the ingestion phase becomes easier to follow. This is the part where the system takes raw documents from the user and turns them into vectors that can later be searched semantically. In practice, this entire process is triggered when a user uploads a file to the /upload endpoint.
The ingestion pipeline can be summarized in a few steps:
Upload handling and document preparation: When a file arrives through the upload route, Genie assigns it a fingerprint. This fingerprint makes documents uniquely identifiable regardless of the bucket they belong to. If the document is a PDF that requires OCR, text extraction is performed automatically. In most cases, the system simply loads the text directly via lazy-load-documents-from-path.
After loading the text, Genie constructs an internal Document object and attaches metadata such as the fingerprint and creation timestamp. Additional metadata (filename, bucket ID, job status) is stored in Postgres and later used to filter results and manage uploads. This metadata will later be used to filter search results in the retrieval phase.
Splitting documents into meaningful segments (chunking strategy): Before embedding can be computed, the document must be split into smaller segments, commonly called chunks. Chunking is one of the most important stages of a RAG system and directly affects retrieval quality. Genie uses a recursive splitting strategy (via the LangChain4j recursive text splitter). It first tries to split by larger logical boundaries and progressively falls back to smaller ones, only splitting by character length as a last resort. This helps preserve semantic meaning and avoids cutting sentences arbitrarily. Well-formed chunks preserve semantic meaning, and therefore, produce embeddings that lead to better retrieval performance.
A simplified version of the splitting process looks like:
(let [segment-list (map-indexed (fn [idx segment]
(text-clean/preprocess-for-embedding
(document/add-metadata segment {"segment-index" (int idx)})))
(.split document-splitter document))]Here, the splitter returns a sequence of text segments, each of which is enriched with a segment-index. This index will later help retrieve neighbouring segments during the retrieval phase.
Embedding and storing in Qdrant: Once the text has been split into segments, Genie prepares them for embedding. Each segment receives an additional piece of metadata, its segment-index, which marks its position inside the document. This information will later be used to retrieve neighbouring segments as additional context.
The embedding process is handled by the embedding model configured in the system (OpenAI, sentence transformers, ONNX, etc.). Each segment is embedded into a fixed-length vector and stored in Qdrant along with metadata.
The core of this step looks essentially like:
(doseq [segment segment-list]
(try (.add store (.content (.embed model segment)) segment)
(catch Exception e (log/error (ex-message e)))))This operation transforms the document into a searchable vector space. After ingestion completes, Qdrant becomes the private semantic search index for that document.
Alongside storing embeddings in Qdrant, the system also records key information in Postgres: the full original document text, the fingerprint and filename, the bucket assignment and the upload job status (processing, embedding, finished, failed). Postgres holds the “administrative memory” of the documents, while Qdrant holds the semantic representation.
It is a one-time operation, it is important to highlight that ingestion is performed once per document. After a file is uploaded,split,embedded, and stored, it becomes part of the knowledge base permanently. Retrieval later reuses the stored vectors and metadata; no re-embedding is needed unless the document changes.
Retrieval and Generation:
Once the documents have been split, embedded, and stored, the second phase of the RAG system becomes active. This is the phase responsible for turning the user’s natural-language question into an answer grounded in the previously indexed material. It is triggered each time a request reaches the /chat endpoint and it is, in practice, the part where most of the “AI magic” happens.
From the /chat endpoint to the assistant: HTTP request arrives, and when reaches /chat endpoint, Reitit dispatches it to the handler constructed by:
(make-chat-handler store embedding-model datasource ai)Inside this function, several components are assembled: The retriever, responsible for semantic search over Qdrant. The bucket filter, ensuring they query only searches the relevant knowledge base. The dynamic relevance threshold, controlling how strict vector matching is. The system prompt provider, loading domain-specific rules. The LLM client, created through LangChain4j and the chat memory.
Once built, the handler stores its state in an atom and calls the assistant:
answer (-> state! deref :assistant deref (.answer bucket-id query))From this point on, retrieval and generation begin.
Embedding the question: The user’s query is embedded using the same embedding model that was used during ingestion. This guarantees that distances in the vector space remain meaningful.
(let [q-embedding (.embed embedding-model query)]
...)
Building the retriever (dynamic filters + min-score): Before performing vector search, Genie builds a retriever configured with several search constraints, such as bucket filtering, relevance threshold and the maximum number of results to return.
(emb/make-content-retriever
store embedding-model
{:dynamic-filter (make-bucket-id-filter datasource state!)
:dynamic-min-score (make-dynamic-min-score state!)
:max-results 10})Bucket filter: Every document is tagged with a fingerprint and a bucket. The bucket filter ensures that a query only searches inside the user’s chosen bucket, preventing cross-bucket issues, and providing data isolation.
Dynamic relevance threshold: The user can specify the minimum relevance through a relevance parameter 0 to 10, this becomes internally a dynamic cosine-similarity threshold.
Once the retriever is configured, Genie performs the vector search in Qdrant, obtaining an initial list of relevant segments.
These results contain: The text of each chunk, all metadata and the Qdrant point id. Each result represents a semantically similar fragment of the uploaded documents.
Retrieving neighbouring segments: The initial results contain segments that are semantically similar to the question. However, important context might be located just before or after those segments in the original document.
To avoid missing important context, Genie retrieves nearby segments using the stored (during ingestion) segment-index:
(let [neighbors (scored-points->contents
(scroll-neighbors client collection-name fingerprint index 3))]
(.addAll results neighbors))This ensures that for example, a section header or a follow-up sentence is included in the context even if the exact match was only found in one segment. The neighbour expansion is known to be an effective way to improve context integrity in a RAG system.
Optional reranking: If enabled, Genie uses an ONNX-based cross-encoder reranker to sort the retrieved segments more accurately:
let [content-list (base-neighboring query)
segments (ArrayList. (map (memfn textSegment)
(.toArray content-list)))
scores (.scoreAll scoring-model segments (str query))
scored (map (fn [s c]
{:score s
:content (add-content-metadata
c {ContentMetadata/RERANKED_SCORE (double s)})})
(.content scores) content-list)
top (take 3 (sort-by :score > scored))]
(ArrayList. (map :content top))Differently to vector similarity, which as I said, only compares embeddings, a cross-encoder analyzes query + segment together and produces a more precise relevance score. Only the top 3 results (look to the code) are kept.
Constructing the prompt: Once relevant context is selected, Genie provides the system prompt, the retrieved segments and the user question to the LangChain4j Assistant, which then assembles the final prompt internally.
The system prompt is stored per bucket in Postgres and injected through a system message provider:
(make-system-message-provider datasource state!)This allows Genie to enforce domain restrictions such as: “Only answer using the content of the manuals”, “If the answer is not in the documents, tell the user you don’t know” etc.
Calling the LLM: The final assembled prompt is passed to the configured chat model:
answer (-> state! deref :assistant deref (.answer bucket-id query))]Under the hood, LangChain4j sends the prompt through the selected provider.
The result contains: The generated answer and the list of the retrieved sources.
At this point, the retrieval and generation phase ends and the response is returned to the client.
Buckets, System Prompts & Data Isolation
Once the retrieval pipeline is understood, the next important concept in Genie’s architecture is the idea of buckets. Buckets are the mechanism used to organize documents, enforce data isolation, and give each knowledge base its own assistant behaviour, which makes the system practical for real-world usage when multiple teams or domains are involved.
A bucket is essentially a container for documents. Each uploaded file belongs to one bucket, and each bucket acts as an independent knowledge base.
This design provides several advantages, for instance: Data isolation, where queries only search inside the selected bucket. Also provides multi-tenancy, where multiple users or teams can share the same system without mixing data. During retrieval, Qdrant only considers embeddings that belong to the active bucket:
{:dynamic-filter (make-bucket-id-filter datasource state!)
:dynamic-min-score (make-dynamic-min-score state!)
:max-results 10}As I said, this prevents cross-bucket contamination, and ensures that queries remain tightly grounded in the correct domain.
System prompts: Every bucket also has an associated system prompt, stored in Postgres. This defines how the assistant should behave when responding to queries related to that bucket. The system prompt is injected automatically into the final prompt via a system message provider:
system-message-provider (make-system-message-provider datasource state!)This mechanism allows each bucket to enforce its own set of rules: What tone the assistant should use, which domain it must remain inside, whether it should refuse to answer out-of-scope questions, how to respond when information is missing or what is considered a “fact” for that bucket(as previously mentioned).
FAQ:
Some RAG users frequently raise the following doubts. Collecting them here helps clarify how the system behaves in practice:
One of the most common ones is whether it is possible to search across multiple buckets at the same time. The short answer is no. A RAG system is only as accurate as the context it retrieves, and if multiple unrelated buckets were searched in a single query, the model could easily mix contexts or return misleading answers. Genie enforces data isolation on purpose: when you select a bucket, only the documents associated with that bucket are considered during retrieval. This is what makes the system reliable for multi-department usage.
Another typical question is what happens when two buckets contain similar or even identical documents. Users often worry about “cross-contamination”, but the system already protects against that. Even if the content overlaps, retrieval always happens inside the active bucket only. Qdrant stores metadata for each segment, including the bucket, and the retriever filters the search accordingly, so results never leak between buckets.
A related doubt is whether users must manually select a bucket for every query. In most deployments, the answer is yes: the bucket determines both the knowledge base and the system prompt that the LLM will use. However, nothing prevents the UI from remembering the last used bucket or even hiding this selection entirely. Genie’s backend does not require the user to click anything; it only needs the bucket id. The application layer is free to abstract this detail away.
Sometimes users also ask whether the system can “hide” internal documents when showing sources. For example, if the assistant retrieves context that comes from proprietary manuals, should the UI display it? The system already includes all metadata, and the UI can decide what to show. Some clients prefer transparency, others prefer to show only public-facing URLs. The backend provides the data; the frontend chooses the policy.
There is also the recurring concern about whether the assistant “invents” answers. This is actually a system-prompt issue, not a retrieval problem. Because each bucket has its own system prompt stored in Postgres, the assistant can be instructed to refuse off-topic questions or to explicitly say “I don’t know” when the information is missing. This behaviour is not hardcoded; it depends entirely on the configuration of the bucket. A strict bucket will enforce strict answers. A permissive prompt will allow the model to improvise.
Genie’s responses:
Real-world example: Genie answering a BMW owners manual.
To demonstrate how Genie behaves with real-world documentation, I uploaded the full 2016 BMW 3 Series owner’s manual (257 pages) into the system. I then queried Genie with a set of practical, multi-purpose questions a driver might ask. Each example includes both Genie’s answer and the specific context retrieved from the manual, making it easy to verify how faithfully the system grounds its responses.
1.Locating the speed limit camera.
Query: Where is the speed limit detection camera located?
Answer: The speed limit detection camera is located near the interior rearview mirror. It is crucial to keep the windshield in the area behind the interior rearview mirror clean and clear for optimal operation. This camera detects traffic signs at the side of the road and variable overhead sign posts. It also uses the information stored in the navigation system to display speed limits even on routes without signs. The system's effectiveness can be impacted by factors such as heavy fog, rain, or snowfall, signs concealed by objects, driving closely behind another vehicle, bright lights, or if the camera viewing field or windshield is dirty or covered.
Document context(p.85-86):
“The camera in the area of the interior rearview mirror detects traffic signs”
“The camera is found near the interior rearview mirror.”
“In heavy fog, rain or snowfall… When signs are concealed by objects… When driving toward bright lights…”
Highlights:
Faithful retrieval (exact phrase: ‘near the interior rearview mirror’).
Structured synthesis across Concept / Camera / System limits.
Clear explanation without adding anything not present in the document.
2.Switching on the front fog lights
Query: “How do I switch on the front fog lights in this car?”
Genie’s answer: "To switch on the front fog lights in this car, you need to follow these steps: Make sure the low beams are switched on, as the fog lights only operate in conjunction with them. Press the specified button to turn on the fog lights. The green indicator lamp should light up to confirm they're on. Please note, if the automatic headlight control is activated, the low beams will come on automatically when you switch on the front fog lights. Conversely, when the high beams or headlight flasher are activated, the front fog lights will not switch on."
Document context(p.95):
“The low beams must be switched on.”
“Press the button. The green indicator lamp lights up.”
“If the automatic headlight control … is activated, the low beams will come on automatically.”
“When the high beams or headlight flasher are activated, the front fog lights are not switched on.”
Highlights:
Strictly grounded in the retrieved text; no external knowledge added.
The manual provides several independent sentences; Genie reorganizes them into a simple two-step procedure, making the instructions much easier to follow.
Genie accurately preserves the operational conditions: fog lights only work with low beams, automatic headlight control can activate them automatically, high beams disable fog lights.
3.Pedestrian detection and automatic braking.
Query: “How does the pedestrian braking function work?”
Genie’s answer: "The pedestrian warning with city braking function in this car is designed to help prevent accidents with pedestrians. Here is how it works: The system is controlled by a camera located in the area of the rearview mirror. It operates at speeds from about 6 mph/10 km/h to about 35 mph/60 km/h. The system issues a warning when it detects possible collisions with pedestrians within the vehicle's detection range. The detection area in front of the vehicle is divided into two areas: central and expanded. The central area is directly in front of the vehicle, while the expanded area extends to the right and left. When pedestrians are located within the central area and a collision is imminent, the system issues a warning, and shortly before a collision, the system supports you with a braking intervention. Beyond warning, the system also includes a braking function that can bring the vehicle to a complete stop at low speeds if necessary. However, the pedestrian detection has its limitations such as partially covered pedestrians, those outside the detection range, or small-sized individuals (less than 32 in/80 cm). Limitations also include scenarios when the camera's field view is impaired or when there are constant blinding effects from oncoming lights, during heavy weather conditions, and when driving in the dark or in tight curves. The system doesn't replace driver vigilance, it is meant to assist the driver. Therefore, it is important to always stay alert and be aware of the surroundings while driving. The driver can adjust the sensitivity of the warning settings, although increasing sensitivity too much can lead to an excess of false warnings."
Document Context(p.112-114(mostly)):
“The system can help prevent accidents with pedestrians.”
“In daylight the system warns … at speeds from about 6 mph / 10 km/h to about 35 mph / 60 km/h.”
“The detection area is divided into two areas: central and expanded.”
“If a collision … is imminent, a warning symbol appears and a signal sounds.”
“The system can assist with some braking intervention … vehicles may thus come to a complete stop.”
“Partially covered pedestrians… pedestrians <80 cm… heavy fog, rain, snowfall… tight curves… dirty windshield… bright light… when it is dark.”
Highlights:
Multi-page retrieval and synthesis: the pedestrian warning system spans several sections, headings and warnings scattered across pages. Genie successfully and precisely merges them.
Exact technical details preserved, the exact speed range (10–60 km/h), The 32 in / 80 cm size limitation, the two-zone detection model, central vs expanded, and the exact environmental impairments from the manual.
No hallucinations about ADAS features, stays strictly constrained to what the manual says.
4.Listing ability, safety recommendations:
Query: "List all seating and safety recommendations"
Genie’s Answer:
"Based on the text from "Sitting Safely" section of the manual, following seating and safety recommendations are provided: The ideal seating position can make a vital contribution to relaxed, fatigue-free driving. The seating position plays a crucial role in an accident, in combination with safety belts, head restraints, and airbags. Avoid adjusting the driver's seat while driving as it could cause unexpected movements leading to loss of vehicle control. Do not incline the backrest too far to the rear. During an accident, there's a risk of slipping under the safety belt, eliminating the protection it provides. Make sure to keep the movement area unobstructed during changing the seat position. Otherwise, it could lead to people getting injured or objects getting damaged. The seat can be manually adjusted in terms of forward/backward movement, thigh support, seat tilt, backrest width, lumbar support, height, and backrest tilt. Safety belt reminder for driver's and passenger's seat will light up and signal. Correctly adjusting the headrest reduces the risk of injury to cervical vertebrae in the event of an accident. Adjust the head restraint so that its center is approximately at ear level. Also, adjust the distance so that the head restraint is as close as possible to the back of the head. Only one person per safety belt - never allow more than one person to wear a single safety belt. Lay the belt snugly across the lap and shoulders, without twisting, as close to the body as possible. Make sure that the belt lies low around the hips in the lap area and does not press on the abdomen. Avoid wearing bulky clothing and keep the safety belt from getting jammed as it reduces the restraining effect."
Document context: (pages 49-55 mostly) Extracts from the manual in the code:
“Do not adjust the driver’s seat while driving”
“Do not incline the backrest too far to the rear”
“Keep the movement area unobstructed”
“Lay the belt, without twisting, snugly across the lap and shoulders”
“Adjust the head restraint so that its center is approximately at ear level…”
Highlights:
Consolidates dozens of safety guidelines into a single, coherent checklist.
Extracts only what appears in the manual; no invented recommendations.
Demonstrates Genie’s ability to reorganize fragmented content into a structured list.
5.Asking a totally distinct query:
Query: “How do I make homemade flan?”
Genie’s answer: “Sorry, it seems there is a misunderstanding. The text you provided includes maintenance instructions for a car, but not the step-by-step process for making homemade flan.”
Highlight:
Refuses to answer when the document does not contain relevant information.
Demonstrates grounding guarantees and non-hallucination behaviour.
Conclusion:
The BMW manual experiments make one thing clear: The intelligence comes from how the system embeds text, retrieves context, and constructs prompts that give the LLM exactly what it needs to reason correctly over a specific document set.
What makes this approach powerful is that the system remains fully under our control. The quality of the answers depends on the ingestion pipeline, the way we structure buckets, the clarity of the system prompt, and the constraints we impose on retrieval. Instead of asking users to “trust the model”, we design a system that forces the model to stay inside the boundaries of the documents they uploaded. When the answer is in the manual, Genie finds it and explains it; when it isn’t (like the flan recipe), Genie simply says so.
The BMW case also makes embeddings and vector search feel much less abstract. A document becomes a set of segments; a question becomes a vector; Qdrant returns the closest pieces of meaning; and the LLM completes the last step by turning that context into a clear answer. Everything else – Integrant, Postgres, routes and handlers – is just infrastructure to support this flow.
From these experiments, RAG gives us three practical benefits:
Grounded answers instead of hallucinations.
Reusable infrastructure that can be pointed at any domain, not just cars.
A clear mental model of how data flows from raw documents to final answers.
Hopefully this walkthrough helps others understand not just how Genie works internally, but why Retrieval-Augmented Generation is such a powerful pattern for anyone who needs trustworthy answers from their own documents.






