<?xml version="1.0" encoding="UTF-8"?><oembed><type>video</type><version>1.0</version><html>&lt;iframe src=&quot;https://www.loom.com/embed/d86866bdfef24a91932369d438edf4de&quot; frameborder=&quot;0&quot; width=&quot;1922&quot; height=&quot;1441&quot; webkitallowfullscreen mozallowfullscreen allowfullscreen&gt;&lt;/iframe&gt;</html><height>1441</height><width>1922</width><provider_name>Loom</provider_name><provider_url>https://www.loom.com</provider_url><thumbnail_height>1441</thumbnail_height><thumbnail_width>1922</thumbnail_width><thumbnail_url>https://cdn.loom.com/sessions/thumbnails/d86866bdfef24a91932369d438edf4de-1718030249994.gif</thumbnail_url><duration>345.283</duration><title>Architecture of Multimodal Information Retrieval Tool</title><description>In this video, I provide a quick walkthrough of the architecture of our multimodal information retrieval tool. I explain how we perform parsing of unstructured text documents and leverage a hierarchical document parsing utility. I also discuss our powerful parsing utility that uses LayoutParser and Detektron2 for object detection. Additionally, I explain how we parse images and generate text descriptions using the OpenAI API. Lastly, I touch on our framework for Dspy and how we interface with the backend and frontend of our application.</description></oembed>