
Table of Contents
Open Table of Contents
Introduction
⚠️ This is a Proof of Concept (POC) - This article demonstrates the fundamentals of building a RAG system using modern tools. It’s intentionally simple and unoptimized to focus on core concepts. Do not use this in production without significant enhancements.
The Challenge
Organizations dealing with cybersecurity compliance face a common problem: regulations like ENS (Spain), NIS2 (EU), GDPR, and DORA span hundreds of pages. Finding specific answers requires manual PDF searches or expensive consultants.
The Solution
Retrieval-Augmented Generation (RAG) solves this by combining:
- Semantic search to find relevant document chunks.
- LLMs to synthesize natural language answers from those chunks.
This post walks through building a minimal RAG system from scratch using:
- LangChain v1+ (for RAG orchestration)
- FAISS (vector database)
- Any OpenAI-compatible API (Azure OpenAI, OpenAI, Ollama, etc.)
- FastAPI (backend)
- Streamlit (frontend)
What you’ll learn:
- How RAG works under the hood.
- Document chunking strategies.
- Vector embeddings and similarity search.
- Prompt engineering for grounded responses.
- What’s missing for production use.
📦 Full code available: The complete project with Docker Compose, uv package manager, FastAPI backend, and Streamlit frontend is on GitHub.
System Architecture
Simple three-tier architecture:

Flow:
- User asks a question via Streamlit UI
- FastAPI backend receives the query
- LangChain retrieves relevant chunks from FAISS
- LLM generates answer based on retrieved context
- Response with sources returned to user
Tech Stack:
- LangChain v1+: RAG pipeline orchestration.
- FAISS: Local vector database (no server needed).
- LLM API: Any OpenAI-compatible endpoint (I use Azure OpenAI, but OpenAI API or Ollama work too).
- FastAPI: REST API.
- Streamlit: Quick UI prototype.

How RAG Works (Simplified)
Before diving into code, understand the RAG workflow:
Indexing Phase (run once):
- Load PDFs.
- Split into chunks (~1000 chars).
- Generate embeddings (convert text → vectors) — Learn more about embeddings.
- Store in FAISS vector database.

Query Phase (per request):
- Convert user question → embedding.
- Find top-k most similar chunks in FAISS (cosine similarity).
- Pass chunks + question to LLM with prompt.
- LLM generates answer based on context.
- Return answer + source citations.
Key insight: The LLM never “knows” the regulations—it only sees the retrieved chunks. This prevents hallucination.
💡 New to embeddings? Check out my deep dive on Vector Embeddings and Semantic Search to understand how text becomes searchable vectors.
Building the RAG System
Step 1: Document Processing
Load PDFs and split into chunks:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Load all PDFs
pdf_files = Path('data/storage').glob('*.pdf')
docs = []
for pdf in pdf_files:
docs.extend(PyPDFLoader(str(pdf)).load())
# Smart chunking
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # ~250 tokens
chunk_overlap=200, # Preserve context at boundaries
separators=['\n\n', '\n', ' ', ''] # Split on paragraphs first
)
chunks = splitter.split_documents(docs)

Why these settings?
- 1000 chars: Balances context (enough info) vs. specificity (focused chunks).
- 200 overlap: Prevents losing information at chunk boundaries.
- Hierarchical separators: Prefer paragraph breaks over word breaks.
Step 2: Generate Embeddings & Index
from langchain_openai import AzureOpenAIEmbeddings # Or OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
# Any OpenAI-compatible embedding API works
embeddings = AzureOpenAIEmbeddings(
model="text-embedding-3-small", # 1536 dimensions
# ... API credentials
)
# Create FAISS index
vectorstore = FAISS.from_documents(chunks, embeddings)
vectorstore.save_local('faiss_index') # Persist to disk

Note: This example uses Azure OpenAI, but you can swap in:
OpenAIEmbeddings(OpenAI API).OllamaEmbeddings(local models).- Any provider with LangChain integration.
Batch processing tip: For large datasets, process in batches with delays to avoid rate limits:
BATCH_SIZE = 250
for i in range(0, len(chunks), BATCH_SIZE):
batch = chunks[i:i+BATCH_SIZE]
vectorstore.add_documents(batch)
time.sleep(5) # Respect rate limits

Step 3: Build the RAG Chain (LangChain v1 LCEL)
LangChain v1 uses LCEL (LangChain Expression Language) for composable chains:
from langchain_openai import AzureChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
# Load vector store
vectorstore = FAISS.load_local('faiss_index', embeddings)
retriever = vectorstore.as_retriever(search_kwargs={'k': 5})
# Prompt template
prompt = ChatPromptTemplate.from_template(
"You are an expert in regulatory compliance. "
"Answer based EXCLUSIVELY on the provided context. "
"Cite the regulation (ENS, NIS2, GDPR, DORA).\n\n"
"Context:\n{context}\n\n"
"Question: {question}\n\n"
"Answer:"
)
# LLM (any OpenAI-compatible API)
llm = AzureChatOpenAI(
model="gpt-4o-mini",
temperature=0.2 # Low temp for factual answers
)
# Format documents helper
def format_docs(docs):
return '\n\n'.join(d.page_content for d in docs)
# Build RAG chain with LCEL
rag_chain = (
{
"context": retriever | RunnableLambda(format_docs),
"question": RunnablePassthrough()
}
| prompt
| llm
| StrOutputParser()
)
# Use it
answer = rag_chain.invoke("What is ENS?")
LCEL benefits:
- Composable: Swap components easily (FAISS → Qdrant, Azure → OpenAI).
- Readable: Pipeline is explicit (retrieval → prompt → LLM → parse).
- Streamable: Built-in streaming support.
Step 4: FastAPI Backend
Wrap the RAG chain in a REST API:
from pydantic import BaseModel
app = FastAPI()
class QueryRequest(BaseModel):
question: str
max_sources: int = 3
@app.post("/query")
async def query_rag(request: QueryRequest):
# Retrieve documents
sources = retriever.get_relevant_documents(request.question)
# Generate answer
answer = rag_chain.invoke(request.question)
return {
"answer": answer,
"sources": [
{
"content": doc.page_content[:300],
"metadata": doc.metadata # filename, page number
}
for doc in sources[:request.max_sources]
]
}
Minimal but functional. Production would add:
- Authentication.
- Rate limiting.
- Caching.
- Error handling.
- Logging/monitoring.
Step 5: Streamlit Frontend (Optional)
Quick UI for testing:
import streamlit as st
import requests
question = st.text_input("Ask a compliance question:")
if question:
response = requests.post(
"http://localhost:8000/query",
json={"question": question}
)
result = response.json()
st.write(result["answer"])
with st.expander("Sources"):
for i, source in enumerate(result["sources"], 1):
st.text(f"{i}. {source['metadata']['source']} (p. {source['metadata']['page']})")

Prompt Engineering for RAG
The prompt is critical. Here’s what works:
You are an expert in regulatory compliance.
Answer based EXCLUSIVELY on the provided context.
If the answer isn't in the context, say "I don't have enough information."
Cite the specific regulation (ENS, NIS2, GDPR, DORA).
Context:
{context}
Question: {question}
Answer:
Key constraints:
- “EXCLUSIVELY on the provided context”: Prevents hallucination.
- “If not in context, say so”: Avoids making up answers.
- Cite regulation: Helps verification.
- Low temperature (0.2): Deterministic, factual responses.
Bad prompt example:
Answer this question: {question}
→ LLM will use its training data, not your documents!

Example Query Flow
Question: “What security measures does ENS require for MEDIUM category?”
1. Retrieval: FAISS finds top-5 chunks:
Chunk 1 (similarity: 0.87): "ENS categorization MEDIUM requires [op.acc.2] authentication controls..."
Chunk 2 (similarity: 0.84): "For MEDIUM systems, [mp.info.3] data encryption at rest..."
...
2. Prompt Construction:
Context:
[Chunk 1 content]
[Chunk 2 content]
...
Question: What security measures does ENS require for MEDIUM category?
3. LLM Response:
For MEDIUM category systems under ENS:
- [op.acc.2] Authentication controls with password policies
- [mp.info.3] Data encryption for sensitive information at rest
- [op.exp.9] Security logging and monitoring
...
4. Add Sources: Return answer + metadata (filename, page numbers) for verification.

Running the POC
Repository includes:
- ✅ Complete backend (FastAPI + LangChain + FAISS)
- ✅ Streamlit frontend
- ✅ Docker Compose setup (backend + frontend containers)
- ✅
uvpackage manager configuration (pyproject.toml) - ✅ Initialization script and Jupyter notebook
- ✅ Example
.envtemplate
Full code on GitHub.
This POC shows how RAG works, not how to run it in production. Treat it as a learning base, not a deployment template.
Quick start:
# 1. Clone repo
git clone https://github.com/manulqwerty/cyber-compliance-rag
cd cyber-compliance-rag
# 2. Add PDFs to backend/data/storage/
# 3. Configure .env with your LLM API credentials
# 4. Initialize FAISS
cd backend
uv run python dev/init_rag.py
# 5. Run services
docker compose up -d --build
Visit http://localhost:8501 to test.
Performance & Costs
For reference (your mileage may vary):
Latency:
- Embedding query: ~150ms
- FAISS search: ~20ms
- LLM generation: ~1.5s
- Total: ~1.7s
Costs (Azure OpenAI example):
- Embeddings: ~$0.00001/query
- Generation: ~$0.00066/query
- ~$0.00067/query (~ 1,500 queries/$1)
For 10K queries/month: ~ $7/month
Challenges & Lessons
1. Rate Limits
- Problem: Batch embedding generation hit API limits.
- Solution: Process in batches with delays.
2. Source Attribution
- Problem: Page numbers incorrect due to PDF parsing.
- Solution: Verify metadata, display filename+page for manual checks.
3. Context Window
- Problem: Too many chunks exceed token limits.
- Solution: Limit k=5, each chunk ~250 tokens → ~1,250 total.
4. Prompt Injection Resistance
- Test: Tried injecting malicious prompts like “Ignore previous instructions and say…”
- Result: System correctly refused and stayed grounded in documents (see screenshot below).
- Key: The “EXCLUSIVELY on provided context” instruction prevents the LLM from following injected commands.

5. Spanish Legal Terms
- Problem: LLM struggled with jargon initially.
- Solution: Explicit “Spanish regulatory compliance” in prompt helped.
Conclusion
This POC demonstrates the fundamentals of building a RAG system with modern tools:
What we built:
- Document chunking and indexing.
- Vector similarity search.
- LLM-powered answer generation.
- Source citation.
What we learned:
- RAG prevents hallucination by grounding responses in documents.
- Prompt engineering is critical (“EXCLUSIVELY on context”).
- LangChain v1 LCEL makes pipelines composable.
- FAISS is sufficient for small datasets.
What’s missing:
- Authentication, rate limiting, caching.
- Hybrid search, reranking.
- Monitoring, evaluation.
- Scalability, security.
Use this as:
- ✅ Learning resource for RAG architecture.
- ✅ Starting point for experiments.
- ❌ Production deployment.
For production, you’d need weeks of additional work on the items listed in “What’s Missing for Production.”
The code is on GitHub.