Okay, I have given up waiting for the verification email from the U of Bern in Swizterland who owns the OCR handwriting dataset.
Googling more (the internet is so noisy these days), I found it on Kaggle. https://www.kaggle.com/datasets/naderabdalghani/iam-handwritten-forms-dataset
pip install kaggle
Doh. So obvious in hindsight. Everything is.
There are free tutorials on Kaggle, I refrained from delving into them because learning is optimal when you have a use case as a goal. I have used pandas and roughly understood what sci-kit is for but this is the best time to delve into both of those.
Finished the ML Learning tutorial. Getting started on the pandas one.
Okay the plan is this:
- check the kaggle OCR handwriting dataset
- select a pretrained model for vectorization, looking at llama or mistral or gpt model that is free currently
- select a platform: huggingface spaces, runpod (have used), together.ai
- set up a vector database: pinecone or weaviate or milvus
- implement the RAG pipeline
- regression test and train
I should make life easy and do the shortcut route I found - train on openai, train on azure, click click done.