AI Chatbot with Retrieval-Augmented Generation (RAG)

A sophisticated chatbot that can answer questions about my professional background, experience, and projects using semantic search and AI-powered responses.

Project Overview

This project implements a full-stack chatbot solution that:

Processes and indexes all content from my portfolio website
Uses semantic search to find relevant context for user questions
Generates personalized responses using OpenAI’s GPT models
Provides source citations and confidence scoring

Initial Development Plan

Backend Setup: Create /backend directory for Python chatbot code
Frontend Interface: Build React/TypeScript chat interface
Core Dependencies: Install required libraries

pip install langchain openai faiss-cpu tiktoken python-dotenv

Tech Stack

Frontend

Astro + React + TypeScript: Static site generation with interactive components
CSS: Custom styling for chat interface
Netlify: Hosting and deployment

Backend

FastAPI: High-performance API framework
RAG Pipeline: FAISS + OpenAI embeddings + LangChain
AWS Lambda: Serverless container deployment
Docker: Containerization for consistent deployments

Infrastructure

OpenTofu (formerly Terraform): Infrastructure as Code
AWS API Gateway: REST API management
AWS ECR: Container registry
AWS SSM: Secure API key storage

Security Features

CORS protection for cross-origin requests
Input validation and content filtering
Rate limiting and API key management
Environment-specific configuration

Future Roadmap

Automated Updates: Rebuild vector stores when new content is published
Enhanced Personality: Improve response quality with better context and fine-tuning
Multi-Source Data: Integrate GitHub repos, social media profiles, and other data sources
Authentication: User management and rate limiting
Conversation Memory: Graph database for conversation history and context compression

Development Journey

September 2, 2025 Update

The Reality Check: This project turned out to be significantly more complex than initially anticipated!

While the core functionality worked locally within a day, deploying it safely to production involved navigating multiple AWS services, infrastructure-as-code complexities, and containerization challenges.

Infrastructure Evolution

Initial Terraform Configuration:

ZIP-based Lambda deployment via S3
API Gateway REST API setup
AWS SSM parameter management
IAM roles and policies
CloudWatch logging configuration

Migration to OpenTofu: During development, Terraform’s licensing changes led to adopting OpenTofu, which provided better open-source compatibility for this project.

The AWS ECR Learning Experience

The Problem: Initially avoided AWS ECR to stay within free tier limits, using S3-based ZIP deployments instead.

The Challenge: Encountered persistent “Cannot find module” errors due to conflicts between:

Dependencies bundled in the Lambda function ZIP
Dependencies in carefully size-optimized Lambda Layers

This led to an endless debugging cycle involving ZIP file validation, Terraform cache clearing, and Lambda Layer hash verification.

The Solution: Migrated to AWS ECR with Docker containers, which resolved dependency conflicts immediately and provided more reliable deployments.

Performance and Quality Improvements

Content Filtering: Implemented sophisticated filtering to maintain:

Professional topic focus
Personality consistency
API key abuse prevention
Cultural and contextual appropriateness

Response Enhancement Features:

Confidence Scoring: Assess and display response reliability
Source Citations: Clickable links to original content for verification
Response Summarization: Skills-focused, concise answers
Modular Architecture: Split functionality across multiple files for better maintainability

Current Architecture

chatbot/
├── main.py          # FastAPI app + endpoints (259 lines)
├── filters.py       # Content filtering functions
├── sources.py       # URL conversion + clickable links
├── confidence.py    # Response confidence scoring
├── summarization.py # Skills-focused response summarization
├── services.py      # Vectorstore + QA chain management
├── models.py        # Pydantic schemas
└── routes.py        # Additional API routes

Infrastructure Considerations

Terraform/OpenTofu Caching: Implemented hash-based change detection for efficient deployments while managing the inherent caching complexities of infrastructure-as-code tools.

Next Evolution: Planning migration to agentic workflows for more sophisticated conversation handling and multi-step reasoning capabilities.

Production deployment bombed. Took a break and came back to it - Cloudwatch logs are empty and decided I was too stingy to upgrade them. Had a chat with Claude and teased out that it is probably the REST API set to .env (not checked in) and I have to set a Netlify variable for the project. Duh.

Performance Optimization Analysis

🎯 Top Space Consumers:

numpy.libs - 26MB (Linear algebra operations, used by FAISS)
zstandard - 21MB (Compression library, used by LangSmith)
numpy - 18MB (Numerical computing, required by FAISS)
sqlalchemy - 14MB (Database ORM, used by LangChain)
faiss - 13MB (Vector similarity search - your main index)
langchain_community - 12MB (Community LangChain integrations)
langchain - 7MB (Core LangChain framework)
aiohttp - 6.5MB (Async HTTP client)
openai - 5.9MB (OpenAI API client)
pydantic_core - 4.5MB (Data validation)

📊 Optimization Potential:

High Impact Removals (if moving to Pinecone):

faiss + numpy + numpy.libs = 57MB saved (35% of app code)
Keep everything else for LangChain functionality

Medium Impact:

zstandard (21MB) - Used by LangSmith logging, could disable
sqlalchemy (14MB) - Used by LangChain, hard to remove

Low Impact:

Most other dependencies are essential for FastAPI/LangChain

Industry Standard Approaches for Vector Search in Production:

🏭 Enterprise/Large Scale (Standard Practice):

Managed Vector Databases (90% of production deployments):

Pinecone - Most popular, fully managed
Weaviate - Open source + managed options
Qdrant - Growing in popularity
OpenSearch - AWS native option
pgvector - PostgreSQL extension (trending)

Why this is standard:

Separation of concerns - Vector search ≠ application logic
Independent scaling - Vector DB scales separately from app
Multi-client access - Multiple apps can query same vectors
Updates without redeployment - Add/update vectors without touching code

🏢 Cloud-Native Standard Architecture:

Frontend → API Gateway → Lambda/Container → Vector DB (external) ↓ Document Store (S3)

Not: Frontend → API Gateway → Lambda Container (with embedded vectors)

📊 Approach Analysis:

Use Case	Embedded FAISS	External Vector DB
Prototypes/Demos	✅ Common	❌ Overkill
Personal Projects	✅ Fine	⚠️ May be expensive
Production Apps	❌ Rare	✅ Standard
Enterprise	❌ Never	✅ Always

🎯 My Current Approach vs Industry:

Current setup: Embedded FAISS (like SQLite - file-based, bundled) Industry standard: External vector DB (like PostgreSQL - service-based)

Ultimately, it comes down to cost and cost liability in the event my blog is discovered and abused heavily.

🚀 Migration Path (Standard Practice):

Start with embedded (where you are) - Fast to prototype
Shave off about 100MB if we used multi-stage docker deploys.
Move to managed (Pinecone/Weaviate) - When you need reliability but adds ~5-20ms latency
Consider self-hosted (Qdrant/Weaviate) - When you need control/cost savings

I looked at offloading the LLM packages to AWS Bedrock and leveraging the LLMs within the AWS ecosystem but the cost difference of $0.15/1M tokens vs $3/1M tokens really put me off.

I looked at AWS Opensearch as a vectorstore and the pricing also put me off.

💼 Real-World Examples:

Notion - Uses Pinecone for semantic search
GitHub Copilot - Uses custom vector infrastructure
ChatGPT plugins - Most use Pinecone/Weaviate
Startups - 80% start with Pinecone, some move to self-hosted

Okay, it is good enough for now.

Summary:

What I have is a single modal RAG chatbot that can answer questions about me from my blog, run stupidly cheap, secured and deployed in a serverless container in AWS Ecosystem. I tweaked the temps and prompts so it doesn’t sound completely boring and mechanical but it is still a bit robotic. I can live with that for now.

What makes a human, intelligent sounding chatbot is not just the LLM but the personality, context, and the infinite number of faucets to a person/human. It makes sense to extend this chatbot in a multi-modal way - add in RAG for sample code (Github data mining), essays on leadership and management, blog rants, technical decisions/principles/beliefs, and other things that make me, me.

To have multi-modal vector stores, that are queried and synthesized into a single response. To be able to handle long chat conversations by compacting the conversation to summarize and infer as to their interest and direct that in future responses may be the next iteration.

AI Chatbot with RAG