Trending keyword analysis in Reddit via generative AI, Python, Svelte, NLP, N-gram analysis,

🎯 What We’re Building: A comprehensive web crawler and analytics platform that analyzes romance/BookTok content from Reddit to identify trending terms, market opportunities, and brand development insights.

🏗️ Architecture:

Frontend: SvelteKit TypeScript web dashboard
Backend: Python analysis pipeline with Reddit API integration
Data Flow: Frontend settings → Python analysis → PDF reports
Stack: SvelteKit + Python + Reddit API + NLP processing

🔧 Current Session Focus: We’ve been perfecting the data persistence and PDF generation system to ensure complete traceability:

Settings Flow: Frontend settings page → API → /data/current_settings.json → Python script
PDF Generation: Analysis results + settings → comprehensive PDF reports
Data Traceability: Every PDF shows exactly what settings produced those results

📊 Key Features Implemented:

Reddit Data Collection: 7+ subreddits, configurable terms, intelligent caching
Advanced Text Analysis: NLP processing, sentiment analysis, semantic clustering
PDF Reports: Professional reports with settings audit trail
Analytics Dashboard: Performance tracking, term frequency analysis
Brand Insights: Market opportunities and naming suggestions

The Story

It always starts out simple - I wanted a web crawler to crawl the web for trending keywords. I decided to start small - Tiktok (they do not let you crawl, no API), Reddit (they have an API, do not like you to crawl), and Facebook/Instagram (TBD).

I usually start with a generic UI, and springboard off that. I am language agnostic these days - will use whatever, just use AI for a language reference. The language usually has some special guidelines for production deploys that is strongly recommended but not required for a tool or prototype. There is a common set of guidelines that pertain to production deploys.

It was quickly evident that the data that came back looked… dead, static, uninteresting and bland. I realized that the pure analysis of words pulled from a bunch of reddit posts lacked lexical and semantic meaning so I started looking and reading into what was required to give it a bit more context in terms of meaning in our analysis. Down the rabbit hole I go (it usually doesn’t ever end because there is always more to learn).

After 2 days, this is the report that it generated. After 4 days, this is the report that it generated.

Yes, this is beautiful to me. console log run part 1. console log run part 2.

It is not complete, the analysis configuration section looks terrible but it is a work in progress.

Tech Stack

Python Stack

praw - Reddit API client
nltk/spacy - NLP processing
pandas - Data manipulation
matplotlib/plotly - Visualization

Frontend Stack

Framework: Svelte/SvelteKit (lightweight, fast)
Styling: TailwindCSS + DaisyUI components
Charts: Chart.js / D3.js for data visualization
HTTP Client: Fetch API to Rust backend
Build: Vite for fast development

Key Components

1. Data Collectors

Reddit API Client: Focus on r/RomanceBooks, r/BookTok, r/Fantasy
Rate Limiting: Respect API limits and ToS
Data Models: Posts, comments, metadata

2. Enhanced Text Processing Pipeline

Stop Words Filtering: Remove fluff words (“the”, “a”, social media noise)
Domain-Specific Terms: Configurable keywords for any industry
Advanced N-gram Analysis: Multi-word phrases, collocations, brand-worthy terms
Semantic Phrase Extraction: Pattern-based trending phrase detection
Sentiment Analysis: Emotional impact scoring
Brand Scoring Algorithm: Ranks terms by marketing/SEO potential

3. Analytics Engine

Word Frequency Analysis: Most common terms by subreddit/platform
Relationship Graphs: Word associations and co-occurrence
Trend Detection: Rising/declining terms over time
Market Insights: Product gaps, naming opportunities

4. Web UI Dashboard

Home: Project overview and quick stats
Data Collection: Configure Reddit/TikTok sources
Analytics: Interactive charts and word clouds
Insights: Market opportunities and brand suggestions
N-gram Tester: Interactive testing interface for any domain
Settings: API keys, filtering preferences

Data Flow Architecture

Reddit Data → Text Processing Integration

The system uses a coordinated pipeline where reddit_client.py and text_processor.py are integrated through analyze_romantsy.py:

reddit_client.py → pandas DataFrame → text_processor.py → Analysis Results
      ↓                    ↓                   ↓                  ↓
1. Collect Posts    2. Structured Data   3. Text Analysis   4. Market Insights
   - r/RomanceBooks    - title            - Extract keywords    - Word frequencies
   - r/Fantasy         - selftext         - Remove stop words   - Sentiment analysis
   - r/BookTok         - created_datetime - Preserve romantsy   - Co-occurrences
   - Filter BookTok    - score/comments   - Multi-word phrases  - Brand opportunities
   - Add timestamps    - Save to CSV      - Generate word cloud - Visualizations

Data Processing Flow

Input: Raw Reddit posts (JSON from API) Stage 1: Structured data collection

Clean and normalize post data
Add datetime formatting (YYYY-MM-DD HH:MM:SS)
Flag BookTok-related content
Store in pandas DataFrame

Stage 2: Text analysis processing

Combine title + selftext for analysis
Apply configurable stop word filtering
Preserve romantsy-specific keywords
Extract multi-word phrases (“enemies to lovers”)
Calculate sentiment scores

Stage 3: Market insights generation

Word frequency analysis by subreddit
Co-occurrence mapping for relationship graphs
Trend detection and sentiment analysis
Brand naming opportunity identification

Output: Comprehensive analysis results

CSV files with raw data and timestamps
JSON files with analysis results and insights
PNG visualizations (charts and word clouds)
Market research recommendations

Target Analysis Areas

Genre Keywords: romantsy, BookTok, spicy reads
Character Archetypes: alpha, morally gray, book boyfriend
Tropes: enemies to lovers, grumpy sunshine, one bed
Emotional Descriptors: swoon, angst, steamy
Community Language: What resonates with readers

Enhanced Semantic Analysis Implementation Strategy

Phase 1: Enhanced N-gram Analysis

Enhanced Text Preprocessing Pipeline: Domain-agnostic preprocessing with configurable stop words
N-gram Analysis Engine: Bigram/trigram extraction with frequency filtering
Pattern-Based Phrase Detection: Regex patterns for trending phrase identification
Brand-Worthy Term Scoring: Algorithmic scoring for marketing potential
Frontend Testing Interface: Interactive domain testing with sample text loading
API Integration: Python backend processing with JSON API endpoints

Phase 2: Advanced Semantic Analysis

1. Enhanced Text Preprocessing Pipeline

Preserve important multi-word phrases before tokenization
Context-aware stop word removal (keep domain-specific terms)
Lemmatization with domain-specific exceptions
Domain-specific abbreviation handling (MMC, FMC, HEA, etc.)

2. Named Entity Recognition

Custom NER with spaCy integration and fallback patterns
Extract: character archetypes, trope names, genres
Romance/BookTok specific entity classification
Pattern-based entity extraction for domain terms

3. Semantic Clustering

Group synonymous terms/phrases automatically using K-means
Identify emerging vs established trends through clustering
Map brand-worthy term families and relationships
Content theme identification for marketing strategy

4. Enhanced TF-IDF Analysis

Weight terms by domain-specificity scoring with bonuses
Identify unique-to-community language patterns
Enhanced “brandability” scoring algorithms
Separate domain-specific vs general high-value terms

5. Integrated Market Insights Generation

Multi-source trending term aggregation
Enhanced brand naming suggestions from semantic analysis
Market entity analysis (tropes, archetypes, genres)
Content theme strategy recommendations
Emerging trend and opportunity identification

Conclusion

Every semantic analysis implementation probably (I did not measure it) adds about a 5-10% accuracy to the analysis. N-gram is significant (much higher than a 5-10%) but it assumes that each reddit poster uses the same manner of speech the algorithm is trained/tested for (likely proper English, not Gen Z slang, etc.). It reminds me, as a programmer, how much we are building on top of these foundational commonly used algorithms and the choices those programmers have made.

I can overthink and over-analyze, and require more context than most but I also am painfully aware that subtle changes in context is everything, it can be difference in a different logic branch altogether.

Just as I tweaked the creativity levels of my chatbot to be more creative, so it sounds more natural and less robotic, I wonder as we marvel and awe over the new capabilities of each newly-released LLM model, how much ‘tweaking’ was done so the models perform in a convincing manner so that the company is positioned for the next round of VC funding and pressure is maintained in an acceptable level by appropriately showing growth for/in shareholder value.