AI Chatbot with RAG

Personal portfolio chatbot using vector search and semantic similarity

Sat Jun 07 2025 00:00:00 GMT+0000 (Coordinated Universal Time)

AI Chatbot with Retrieval-Augmented Generation (RAG)

A sophisticated chatbot that can answer questions about my professional background, experience, and projects using semantic search and AI-powered responses.

Project Overview

This project implements a full-stack chatbot solution that:

  • Processes and indexes all content from my portfolio website
  • Uses semantic search to find relevant context for user questions
  • Generates personalized responses using OpenAI’s GPT models
  • Provides source citations and confidence scoring

Initial Development Plan

  1. Backend Setup: Create /backend directory for Python chatbot code
  2. Frontend Interface: Build React/TypeScript chat interface
  3. Core Dependencies: Install required libraries
pip install langchain openai faiss-cpu tiktoken python-dotenv

Tech Stack

Frontend

  • Astro + React + TypeScript: Static site generation with interactive components
  • CSS: Custom styling for chat interface
  • Netlify: Hosting and deployment

Backend

  • FastAPI: High-performance API framework
  • RAG Pipeline: FAISS + OpenAI embeddings + LangChain
  • AWS Lambda: Serverless container deployment
  • Docker: Containerization for consistent deployments

Infrastructure

  • OpenTofu (formerly Terraform): Infrastructure as Code
  • AWS API Gateway: REST API management
  • AWS ECR: Container registry
  • AWS SSM: Secure API key storage

Security Features

  • CORS protection for cross-origin requests
  • Input validation and content filtering
  • Rate limiting and API key management
  • Environment-specific configuration

Future Roadmap

  • Automated Updates: Rebuild vector stores when new content is published
  • Enhanced Personality: Improve response quality with better context and fine-tuning
  • Multi-Source Data: Integrate GitHub repos, social media profiles, and other data sources
  • Authentication: User management and rate limiting
  • Conversation Memory: Graph database for conversation history and context compression

Development Journey

September 2, 2025 Update

The Reality Check: This project turned out to be significantly more complex than initially anticipated!

While the core functionality worked locally within a day, deploying it safely to production involved navigating multiple AWS services, infrastructure-as-code complexities, and containerization challenges.

Infrastructure Evolution

Initial Terraform Configuration:

  • ZIP-based Lambda deployment via S3
  • API Gateway REST API setup
  • AWS SSM parameter management
  • IAM roles and policies
  • CloudWatch logging configuration

Migration to OpenTofu: During development, Terraform’s licensing changes led to adopting OpenTofu, which provided better open-source compatibility for this project.

The AWS ECR Learning Experience

The Problem: Initially avoided AWS ECR to stay within free tier limits, using S3-based ZIP deployments instead.

The Challenge: Encountered persistent “Cannot find module” errors due to conflicts between:

  • Dependencies bundled in the Lambda function ZIP
  • Dependencies in carefully size-optimized Lambda Layers

This led to an endless debugging cycle involving ZIP file validation, Terraform cache clearing, and Lambda Layer hash verification.

The Solution: Migrated to AWS ECR with Docker containers, which resolved dependency conflicts immediately and provided more reliable deployments.

Performance and Quality Improvements

Content Filtering: Implemented sophisticated filtering to maintain:

  • Professional topic focus
  • Personality consistency
  • API key abuse prevention
  • Cultural and contextual appropriateness

Response Enhancement Features:

  • Confidence Scoring: Assess and display response reliability
  • Source Citations: Clickable links to original content for verification
  • Response Summarization: Skills-focused, concise answers
  • Modular Architecture: Split functionality across multiple files for better maintainability

Current Architecture

chatbot/
├── main.py          # FastAPI app + endpoints (259 lines)
├── filters.py       # Content filtering functions
├── sources.py       # URL conversion + clickable links
├── confidence.py    # Response confidence scoring
├── summarization.py # Skills-focused response summarization
├── services.py      # Vectorstore + QA chain management
├── models.py        # Pydantic schemas
└── routes.py        # Additional API routes

Infrastructure Considerations

Terraform/OpenTofu Caching: Implemented hash-based change detection for efficient deployments while managing the inherent caching complexities of infrastructure-as-code tools.

Next Evolution: Planning migration to agentic workflows for more sophisticated conversation handling and multi-step reasoning capabilities.


Production deployment bombed. Took a break and came back to it - Cloudwatch logs are empty and decided I was too stingy to upgrade them. Had a chat with Claude and teased out that it is probably the REST API set to .env (not checked in) and I have to set a Netlify variable for the project. Duh.


Performance Optimization Analysis

🎯 Top Space Consumers:

  1. numpy.libs - 26MB (Linear algebra operations, used by FAISS)
  2. zstandard - 21MB (Compression library, used by LangSmith)
  3. numpy - 18MB (Numerical computing, required by FAISS)
  4. sqlalchemy - 14MB (Database ORM, used by LangChain)
  5. faiss - 13MB (Vector similarity search - your main index)
  6. langchain_community - 12MB (Community LangChain integrations)
  7. langchain - 7MB (Core LangChain framework)
  8. aiohttp - 6.5MB (Async HTTP client)
  9. openai - 5.9MB (OpenAI API client)
  10. pydantic_core - 4.5MB (Data validation)

📊 Optimization Potential:

High Impact Removals (if moving to Pinecone):

  • faiss + numpy + numpy.libs = 57MB saved (35% of app code)
  • Keep everything else for LangChain functionality

Medium Impact:

  • zstandard (21MB) - Used by LangSmith logging, could disable
  • sqlalchemy (14MB) - Used by LangChain, hard to remove

Low Impact:

  • Most other dependencies are essential for FastAPI/LangChain

Industry Standard Approaches for Vector Search in Production:

🏭 Enterprise/Large Scale (Standard Practice):

Managed Vector Databases (90% of production deployments):

  • Pinecone - Most popular, fully managed
  • Weaviate - Open source + managed options
  • Qdrant - Growing in popularity
  • OpenSearch - AWS native option
  • pgvector - PostgreSQL extension (trending)

Why this is standard:

  • Separation of concerns - Vector search ≠ application logic
  • Independent scaling - Vector DB scales separately from app
  • Multi-client access - Multiple apps can query same vectors
  • Updates without redeployment - Add/update vectors without touching code

🏢 Cloud-Native Standard Architecture:

Frontend → API Gateway → Lambda/Container → Vector DB (external) ↓ Document Store (S3)

Not: Frontend → API Gateway → Lambda Container (with embedded vectors)

📊 Approach Analysis:

Use CaseEmbedded FAISSExternal Vector DB
Prototypes/Demos✅ Common❌ Overkill
Personal Projects✅ Fine⚠️ May be expensive
Production Apps❌ Rare✅ Standard
Enterprise❌ Never✅ Always

🎯 My Current Approach vs Industry:

Current setup: Embedded FAISS (like SQLite - file-based, bundled) Industry standard: External vector DB (like PostgreSQL - service-based)

Ultimately, it comes down to cost and cost liability in the event my blog is discovered and abused heavily.

🚀 Migration Path (Standard Practice):

  1. Start with embedded (where you are) - Fast to prototype
  2. Shave off about 100MB if we used multi-stage docker deploys.
  3. Move to managed (Pinecone/Weaviate) - When you need reliability but adds ~5-20ms latency
  4. Consider self-hosted (Qdrant/Weaviate) - When you need control/cost savings

I looked at offloading the LLM packages to AWS Bedrock and leveraging the LLMs within the AWS ecosystem but the cost difference of $0.15/1M tokens vs $3/1M tokens really put me off.

I looked at AWS Opensearch as a vectorstore and the pricing also put me off.

💼 Real-World Examples:

  • Notion - Uses Pinecone for semantic search
  • GitHub Copilot - Uses custom vector infrastructure
  • ChatGPT plugins - Most use Pinecone/Weaviate
  • Startups - 80% start with Pinecone, some move to self-hosted

Okay, it is good enough for now.

Summary:

What I have is a single modal RAG chatbot that can answer questions about me from my blog, run stupidly cheap, secured and deployed in a serverless container in AWS Ecosystem. I tweaked the temps and prompts so it doesn’t sound completely boring and mechanical but it is still a bit robotic. I can live with that for now.

What makes a human, intelligent sounding chatbot is not just the LLM but the personality, context, and the infinite number of faucets to a person/human. It makes sense to extend this chatbot in a multi-modal way - add in RAG for sample code (Github data mining), essays on leadership and management, blog rants, technical decisions/principles/beliefs, and other things that make me, me.

To have multi-modal vector stores, that are queried and synthesized into a single response. To be able to handle long chat conversations by compacting the conversation to summarize and infer as to their interest and direct that in future responses may be the next iteration.