My - Blog

High-Level Architecture

At a glance, the system looks like this:

FastAPI → API, auth, request lifecycle
LangGraph → AI workflow orchestration
Qdrant → Vector database for RAG
Valkey (Redis-compatible) → Queues & analytics
Huggingface + Gemini → LLM & embeddings

Chat Flow:

Here’s what happens when a user asks a question:

Client sends message (POST /api/v1/chat)
Auth & context loaded (JWT, headers, page location)
Optional vision step (images → text)
LangGraph orchestrator runs
Intent is classified
Correct agent is selected
RAG retrieves documents (if needed)
Confidence is evaluated
Response is finalized or clarified

Intent Classification: The Brain Split

Every query is classified into one of four intents:

How‑To → step-by-step guidance
Policy → document-backed answers (RAG)
Page Help → UI-aware assistance
Unknown → fallback handling

Why LangGraph (Not Just Chains)

I chose LangGraph because:

It supports branching logic
Each agent is a node
State is explicit and inspectable
Confidence gating becomes trivial

RAG Document Ingestion

When an admin uploads a document:
Text is extracted (PDF/TXT/MD)
Content is chunked with overlap
Metadata is attached (type, title, status)
Embeddings are generated
Vectors are stored in Qdrant

Embedding Fallback

Primary: HuggingFace embeddings
Fallback: Gemini embeddings

Confidence-Aware Responses

After an agent generates a response, the system asks:

“How confident are we?”

If confidence ≥ threshold → answerIf

confidence < threshold → ask clarifying questions

Why Valkey

Every chat request is also:

Logged asynchronously
Analyzed for confidence drops
Tracked for usage patterns

Using Valkey lets me:

Avoid blocking requests
Scale workers independently
Improve the bot over time

Models Used (Free‑Tier Friendly)

Embedding Models:

Primary: BAAI/bge-small-en-v1.5 (Hugging Face)
- 384‑dimension embeddings
- Excellent semantic retrieval quality for RAG
- Fast and lightweight
Fallback: text-embedding-004 (Gemini)
- Automatically used if Hugging Face is rate‑limited or unavailable

Chat & Reasoning

Hugging Face Chat & Confidence Models: Qwen/Qwen2.5-1.5B-Instruct
- Used for intent classification, lightweight responses, and confidence scoring
- Fully open-source and free-tier compatible
Gemini Chat & Assistant: gemini-2.5-flash
- Handles complex reasoning and document-grounded responses
- Large context window for RAG answers

Vision / OCR

Hugging Face Vision Model: Salesforce/blip-image-captioning-large
- Generates image captions and analyzes visual content
Hugging Face OCR Model: microsoft/trocr-base-printed
- Extracts text from scanned documents and images
- Integrated into RAG ingestion pipeline
Gemini OCR fallback: gemini-2.5-flash
- Used if Hugging Face OCR hits limits

Confidence Evaluation Model

Confidence Model: gemini-2.5-flash
- Scores response reliability
- Powers the confidence gate that decides whether to answer or clarify

Why This Stack Is “Free”

FastAPI – open source
Qdrant – open source
Valkey – open source
LangGraph – open source
Hugging Face Inference API – generous free tier
Gemini API – free tier with key rotation

How I created a Free Stack AI Support System