Blog Post

an image of people discussing and working on laptops around an elephant (postgres mascot)

Day 1

The week began with coding and testing database management components, including adding tables to the database and uploading documents with associated metadata into a local Postgres database. I worked on text management components by coding a component to generate hashes of strings for unique primary keys. Combining these components into a working demo, I successfully uploaded a .txt file of ‘War of the Worlds’ to the local Postgres database. Adjustments to unit tests were made to handle cases revealed during the demo.

Day 2

The second day focused on cleaning and improving existing components. I created a new component to generate random strings and added corresponding unit tests. The hash ID generation for chunks was improved by factoring in a random string, and unit tests were adjusted accordingly. Researching operations with a Postgres database in a Docker container, I debugged the connection and discovered the technicalities allowing access to the containerized database.

Day 3

On the third day, I added the demo and maintenance SQL queries file into the repository and set up the Postgres container for the demo. After clarifying priority tasks with the CEO, it was determined that exploring pgvector had caught their interest. This direction guided the subsequent tasks for the week.

Day 4

The fourth day involved creating a new Docker image of the database that automatically loads contents from a given Postgres dump file. I added a new filename column to the Postgres knowledge database and tested the integration of the database Docker container with the LLM side of the project. Research into the Universal Sentence Encoder’s possible integration with pgvector led to pulling the latest pgvector image, running a container, and testing the loading of vectors into a test database.

Day 5

The final day of the week was dedicated to further research and integration of pgvector. I added a vector column to the chunk table schema and extended unit tests to account for this new column. Attempting to use the Universal Sentence Encoder as an embedding model, I began the integration process with the OCR system, laying the groundwork for future developments.