Architecture of Multimodal Information Retrieval Tool

video1.0<iframe src="https://www.loom.com/embed/d86866bdfef24a91932369d438edf4de" frameborder="0" width="1922" height="1441" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>14411922Loomhttps://www.loom.com14411922https://cdn.loom.com/sessions/thumbnails/d86866bdfef24a91932369d438edf4de-1718030249994.gif345.283Architecture of Multimodal Information Retrieval ToolIn this video, I provide a quick walkthrough of the architecture of our multimodal information retrieval tool. I explain how we perform parsing of unstructured text documents and leverage a hierarchical document parsing utility. I also discuss our powerful parsing utility that uses LayoutParser and Detektron2 for object detection. Additionally, I explain how we parse images and generate text descriptions using the OpenAI API. Lastly, I touch on our framework for Dspy and how we interface with the backend and frontend of our application.