Saturday, January 24, 2026

How I created a Free Stack AI Support System

authorImage

Parashar

2 min read
postMainImage

High-Level Architecture

At a glance, the system looks like this:

  • FastAPI → API, auth, request lifecycle
  • LangGraph → AI workflow orchestration
  • Qdrant → Vector database for RAG
  • Valkey (Redis-compatible) → Queues & analytics
  • Huggingface + Gemini → LLM & embeddings

Chat Flow:

Here’s what happens when a user asks a question:

  • Client sends message (POST /api/v1/chat)
  • Auth & context loaded (JWT, headers, page location)
  • Optional vision step (images → text)
  • LangGraph orchestrator runs
  • Intent is classified
  • Correct agent is selected
  • RAG retrieves documents (if needed)
  • Confidence is evaluated
  • Response is finalized or clarified

Intent Classification: The Brain Split

Every query is classified into one of four intents:

  • How‑To → step-by-step guidance
  • Policy → document-backed answers (RAG)
  • Page Help → UI-aware assistance
  • Unknown → fallback handling

Why LangGraph (Not Just Chains)

I chose LangGraph because:

  • It supports branching logic
  • Each agent is a node
  • State is explicit and inspectable
  • Confidence gating becomes trivial

RAG Document Ingestion

  • When an admin uploads a document:
  • Text is extracted (PDF/TXT/MD)
  • Content is chunked with overlap
  • Metadata is attached (type, title, status)
  • Embeddings are generated
  • Vectors are stored in Qdrant

Embedding Fallback

  • Primary: HuggingFace embeddings
  • Fallback: Gemini embeddings

Confidence-Aware Responses

After an agent generates a response, the system asks:

“How confident are we?”

If confidence ≥ threshold → answerIf

confidence < threshold → ask clarifying questions

Why Valkey

Every chat request is also:

  • Logged asynchronously
  • Analyzed for confidence drops
  • Tracked for usage patterns

Using Valkey lets me:

  • Avoid blocking requests
  • Scale workers independently
  • Improve the bot over time

Models Used (Free‑Tier Friendly)

Embedding Models:

  • Primary: BAAI/bge-small-en-v1.5 (Hugging Face)
    • 384‑dimension embeddings
    • Excellent semantic retrieval quality for RAG
    • Fast and lightweight
  • Fallback: text-embedding-004 (Gemini)
    • Automatically used if Hugging Face is rate‑limited or unavailable

Chat & Reasoning

  • Hugging Face Chat & Confidence Models: Qwen/Qwen2.5-1.5B-Instruct
    • Used for intent classification, lightweight responses, and confidence scoring
    • Fully open-source and free-tier compatible
  • Gemini Chat & Assistant: gemini-2.5-flash
    • Handles complex reasoning and document-grounded responses
    • Large context window for RAG answers

Vision / OCR

  • Hugging Face Vision Model: Salesforce/blip-image-captioning-large
    • Generates image captions and analyzes visual content
  • Hugging Face OCR Model: microsoft/trocr-base-printed
    • Extracts text from scanned documents and images
    • Integrated into RAG ingestion pipeline
  • Gemini OCR fallback: gemini-2.5-flash
    • Used if Hugging Face OCR hits limits

Confidence Evaluation Model

  • Confidence Model: gemini-2.5-flash
    • Scores response reliability
    • Powers the confidence gate that decides whether to answer or clarify

Why This Stack Is “Free”

  • FastAPI – open source
  • Qdrant – open source
  • Valkey – open source
  • LangGraph – open source
  • Hugging Face Inference API – generous free tier
  • Gemini API – free tier with key rotation
Leave a Comment
Share your thoughts and join the conversation