Blog Post

Day 1

The week began with configuring networking between the API and MySQL database containers. I researched networking in Docker using Docker Compose, experimenting with different approaches. Adding a hostname in the Docker Compose YAML file for the MySQL container and editing the API codebase to reflect this change was crucial. After several trials and errors, the connection between the API and MySQL database containers was successfully tested.

Day 2

On the second day, I ported the local setup to another workstation and reintegrated ChromaDB into the container network. Adjusting the API call and output involved adding a maximum number of results parameter into the API call URL format. I configured the API to return entire result dictionaries instead of only lists of documents. Additionally, I pushed the DB and API images to Docker Hub, studying the registry system, tagging, and pushing pertinent images.

Day 3

The third day involved assisting in the pulling of Docker Hub images to another workstation and testing the Docker Compose setup. Implementing file upload to the Django RESTful API required researching the Django REST Framework and coding and debugging the entrypoint. I also researched ways to make the entrypoint seamless for reintegration with the Tesseract UI component, ensuring smooth operation.

Day 4

On the fourth day, I integrated chunking and uploading to the MySQL database in the API entrypoint. Referencing the old Tesseract to DB pipeline components, I copied and reintegrated pertinent components into the new API backend, with successful integration tests. Addressing memory issues involved researching Langchain’s wrapper for ChromaDB and reading several related StackOverflow posts. I implemented a temporary workaround by setting a limit on the number of chunks saved to the temp_collection, with 95 chunks being the optimal limit. Additionally, I handled automatic sanitization of file names with Django, fixing errors caused by spaces in uploaded file names becoming underscores.

Day 5

The final day of the week was dedicated to feasibility testing on Pinecone DB’s Python integration. I researched Pinecone DB and its Python integration, utilizing the Universal Sentence Encoder from TensorFlow Hub as the embedding model for Pinecone DB. Testing the setup on my local machine yielded positive results. I also conducted feasibility testing on Postgres’ Python integration, pulling the latest Postgres Docker image, creating a container, and connecting using the psycopg2 module. Writing unit tests for database-related functions and recreating functions to load text from text files and chunk by paragraph involved writing new components from scratch and testing their functionality.