Ask your question and get a summary of the document by referencing this page and the AI provider of your choice
By integrating the Intlayer MCP Server to your favourite AI assistant can retrieve all the doc directly from ChatGPT, DeepSeek, Cursor, VSCode, etc.
See MCP Server docThe content of this page was translated using an AI.
See the last version of the original content in EnglishIf you have an idea for improving this documentation, please feel free to contribute by submitting a pull request on GitHub.
GitHub link to the documentationCopy doc Markdown to clipboard
Building a RAG-Powered Documentation Assistant (Chunking, Embeddings, and Search)
What you get
I built a RAG-powered documentation assistant and packaged it into a boilerplate you can use immediately.
- Comes with a ready-to-use application (Next.js + OpenAI API)
- Includes a working RAG pipeline (chunking, embeddings, cosine similarity)
- Provides a complete chatbot UI built in React
- All UI components are fully editable with Tailwind CSS
- Logs every user query to help identify missing docs, user pain points, and product opportunities
👉 Live demo 👉 Code boilerplate
Introduction
If you’ve ever been lost in documentation, scrolling endlessly for one answer, you know how painful it can be. Docs are useful, but they’re static and searching them often feels clunky.
That’s where RAG (Retrieval-Augmented Generation) comes in. Instead of forcing users to dig through text, we can combine retrieval (finding the right parts of the doc) with generation (letting an LLM explain it naturally).
In this post, I’ll walk you through how I built a RAG-powered documentation chatbot and how it doesn’t just help users find answers faster, but also gives product teams a new way to understand user pain points.
Why Use RAG for Documentation?
RAG has become a popular approach for a reason: it’s one of the most practical ways to make large language models actually useful.
For documentation, the benefits are clear:
- Instant answers: users ask in natural language, and get relevant replies.
- Better context: the model only sees the most relevant doc sections, reducing hallucinations.
- Search that feels human: more like Algolia + FAQ + chatbot, rolled into one.
- Feedback loop: by storing queries, you uncover what users really struggle with.
That last point is crucial. A RAG system doesn’t just answer questions, it tells you what people are asking. That means:
- You discover missing info in your docs.
- You see feature requests emerging.
- You spot patterns that can even guide product strategy.
So, RAG isn’t just a support tool. It’s also a product discovery engine.
How the RAG Pipeline Works
At a high level, here’s the recipe I used:
- Chunking the documentation Large Markdown files are split into chunks. Chunking allows you to provide as context only the relevant parts of the documentation.
- Generating embeddings Each chunk is turned into a vector using OpenAI’s embedding API (text-embedding-3-large) or a vector database (Chroma, Qdrant, Pinecone).
- Indexing & storing Embeddings are stored in a simple JSON file (for my demo), but in production, you’d likely use a vector DB.
- Retrieval (R in RAG) A user query is embedded, cosine similarity is computed, and the top-matching chunks are retrieved.
- Augmentation + Generation (AG in RAG) Those chunks are injected into the prompt for ChatGPT, so the model answers with actual doc context.
- Logging queries for feedback Every user query is stored. This is invaluable for understanding pain points, missing documentation, or new opportunities.
Step 1: Reading the Docs
The first step was straightforward: I needed a way to scan a docs/ folder for all .md files. Using Node.js and glob, I fetched the content of each Markdown file into memory.
This keeps the pipeline flexible: instead of Markdown, you could fetch docs from a database, a CMS, or even an API.
Step 2: Chunking the Documentation
Why chunk? Because language models have context limits. Feeding them an entire book of docs won’t work.
So the idea is to break text into manageable chunks (e.g. 500 tokens each) with overlap (e.g. 100 tokens). Overlap ensures continuity so you don’t lose meaning at chunk boundaries.
Example:
- Chunk 1 → “…the old library that many had forgotten. Its towering shelves were filled with books…”
- Chunk 2 → “…shelves were filled with books from every imaginable genre, each whispering stories…”
The overlap ensures both chunks contain shared context, so retrieval remains coherent.
This trade-off (chunk size vs overlap) is key for RAG efficiency:
- Too small → you get noise.
- Too large → you blow up context size.
Step 3: Generating Embeddings
Once the docs are chunked, we generate embeddings — high-dimensional vectors representing each chunk.
I used OpenAI’s text-embedding-3-large model, but you could use any modern embedding model.
Example embedding:
Copy the code to the clipboard
[ -0.0002630692, -0.029749284, 0.010225477, -0.009224428, -0.0065269712, -0.002665544, 0.003214777, 0.04235309, -0.033162255, -0.00080789323, //...+1533 elements];
Each vector is a mathematical fingerprint of the text, enabling similarity search.
Step 4: Indexing & Storing Embeddings
To avoid regenerating embeddings multiple times, I stored them in embeddings.json.
In production, you’d likely want a vector database such as:
- Chroma
- Qdrant
- Pinecone
- FAISS, Weaviate, Milvus, etc.
Vector DBs handle indexing, scalability, and fast search. But for my prototype, a local JSON worked fine.
Step 5: Retrieval with Cosine Similarity
When a user asks a question:
- Generate an embedding for the query.
- Compare it to all doc embeddings using cosine similarity.
- Keep only the top N most similar chunks.
Cosine similarity measures the angle between two vectors. A perfect match scores 1.0.
This way, the system finds the closest doc passages to the query.
Step 6: Augmentation + Generation
Now comes the magic. We take the top chunks and inject them into the system prompt for ChatGPT.
That means the model answers as if those chunks were part of the conversation.
The result: accurate, doc-grounded responses.
Step 7: Logging User Queries
This is the hidden superpower.
Every question asked is stored. Over time, you build a dataset of:
- Most frequent questions (great for FAQs)
- Unanswered questions (docs are missing or unclear)
- Feature requests disguised as questions (“Does it integrate with X?”)
- Emerging use cases you hadn’t planned for
This turns your RAG assistant into a continuous user research tool.
What Does It Cost?
One common objection to RAG is cost. In practice, it’s surprisingly cheap:
- Generating embeddings for ~200 docs takes about 5 minutes and costs 1–2 euros.
- The searching doc feature is 100% free.
- For queries, we use gpt-4o-latest without “thinking” mode. On Intlayer, we see around 300 chat queries per month, and the OpenAI API bill rarely exceeds $10.
On top of that, you can include the hosting cost.
Implementation Details
Stack:
- Monorepo: pnpm workspace
- Doc package: Node.js / TypeScript / OpenAI API
- Frontend: Next.js / React / Tailwind CSS
- Backend: Node.js API route / OpenAI API
The @smart-doc/docs package is a TypeScript package that handles documentation processing. When a markdown file is added or modified, the package includes a build script that rebuilds the documentation list in each language, generates embeddings, and stores them in an embeddings.json file.
For the frontend, we use a Next.js application that provides:
- Markdown to HTML rendering
- A search bar to find relevant documentation
- A chatbot interface for asking questions about the docs
To perform a documentation search, the Next.js application includes an API route that calls a function in the @smart-doc/docs package to retrieve doc chunks matching the query. Using these chunks, we can return a list of documentation pages relevant to the user's search.
For the chatbot functionality, we follow the same search process but additionally inject the retrieved doc chunks into the prompt sent to ChatGPT.
Here's an example of a prompt sent to ChatGPT:
System prompt :
Copy the code to the clipboard
You are a helpful assistant that can answer questions about the Intlayer documentation.Related chunks :-----docName: "getting-started"docChunk: "1/3"docUrl: "https://example.com/docs/en/getting-started"---# How to get started...-----docName: "another-doc"docChunk: "1/5"docUrl: "https://example.com/docs/en/another-doc"---# Another doc...
User query :
Copy the code to the clipboard
How to get started?
We use SSE to stream the response from the API route.
As mentioned, we use gpt-4-turbo without "thinking" mode. Responses are relevant, and latency is low. We experimented with gpt-5, but latency was too high (sometimes up to 15 seconds for a reply). However, we will revisit that in the future.
👉 Try the demo here 👉 Check the code template on GitHub
Going Further
This project is a minimal implementation. But you can extend it in many ways:
MCP server → the doc research function to a MCP server to connect the documentation to any AI assistant
- Vector DBs → scale to millions of doc chunks
- LangChain / LlamaIndex → ready-made frameworks for RAG pipelines
- Analytics dashboards → visualise user queries and pain points
- Multi-source retrieval → pull not just docs, but database entries, blog posts, tickets, etc.
- Improved prompting → reranking, filtering, and hybrid search (keyword + semantic)
Limitations We Hit
- Chunking and overlap are empirical. The right balance (chunk size, overlap percentage, number of retrieved chunks) requires iteration and testing.
- Embeddings are not auto-regenerated when docs change. Our system resets embeddings for a file only if the number of chunks differs from what’s stored.
- In this prototype, embeddings are stored in JSON. This works for demos but pollutes Git. In production, a database or dedicated vector store is preferable.
Why This Matters Beyond Docs
The interesting part is not just the chatbot. It’s the feedback loop.
With RAG, you don’t just answer:
- You learn what confuses users.
- You discover which features they expect.
- You adapt your product strategy based on real queries.
Example:
Imagine launching a new feature and instantly seeing:
- 50% of questions are about the same unclear setup step
- Users repeatedly ask for an integration you don’t support yet
- People search for terms that reveal a new use case
That’s product intelligence straight from your users.
Conclusion
RAG is one of the simplest, most powerful ways to make LLMs practical. By combining retrieval + generation, you can turn static docs into a smart assistant and, at the same time, gain a continuous stream of product insights.
For me, this project showed that RAG isn’t just a technical trick. It’s a way to transform documentation into:
- an interactive support system
- a feedback channel
- a product strategy tool
👉 Try the demo here 👉 Check the code template on GitHub
And if you’re experimenting with RAG too, I’d love to hear how you’re using it.