Blog Post

an artistic depicition of retrieval augmented generation

Day 1

The week began with a day off due to a religious celebration, providing a brief respite before diving back into the project.

Day 2

On the second day, I explored embeddings and vector databases, creating a notebook to handle vector database operations. This involved researching how to integrate these databases with Retrieval Augmented Generation (RAG) systems, setting the stage for more advanced functionalities in our project.

Day 3

The third day focused on integrating the OCR extraction and MySQL database with ChromaDB. I researched various ChromaDB clients and decided on the Persistent Client for development. An initial approach of simultaneously uploading extracted text to both MySQL and ChromaDB led to memory issues on my local machine. To address this, I implemented a system inspired by real-life libraries: querying a ChromaDB collection of paper titles first, then using the results to query a new collection generated from text chunks stored in MySQL. This chunk collection was removed once results were returned, solving the memory issue.

Day 4

On the fourth day, I uploaded the paper titles ChromaDB collection to the group's Google Drive and refactored the Tesseract to MySQL and ChromaDB integration code. This involved splitting the pipeline into more modular components to improve extensibility. I also sanitized the extraction output file names to address an edge case and worked on the research report.

Day 5

The final day of the week was packed with preparations for the RAG API implementation. I prepared OCR to database pipelines and added a script for listing documents in a ChromaDB collection. Exploring MySQL Workbench allowed for manual verification of chunk data. I set the number of ChromaDB catalog query results to three to provide the chatbot with multiple sources for context. In anticipation of the RAG API, I installed Django and its REST framework, setting up a basic Django RESTful API to serve the RAG system output. This involved researching how to work with Django and the REST framework to ensure a smooth implementation.