This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
Hi everyone! I recently open sourced a relatively large project (called "The Pipe"), and I hope it can help out anyone on here trying to work with or learn about multimodal AI.
What it is:
The Pipe is a tool for feeding visually complex files (pdf, docx, pptx, etc) and web pages into vision-language models such as GPT-4. It is entirely written in Python, so hopefully I posted this on the right place for those who try it out for yourself or learn from the source code.
Why it exists:
I tried to make an application to chat with my documents and web pages. Sounds simple right? Boy was I wrong. I struggled for months (yes, MONTHS) building absurdly complex custom scrapers (for pdf, powerpoints, word docs, websites, csv, git repos, slides, etc), since traditional scrapers wouldn't give GPT high quality text visual data in an LLM-ready prompt format.
I have also seen an explosion in "Chat with your X" apps that use GPT on the backend on this sub lately, so I hope this will help with those of you trying to build similar things.
What it does not do:
It does not give you free access to GPT-4 usage. You must use your own GPT-4 API key.
Thank you! You're spot on with the reason for PyTorch beinf a dependency. Also -- if you want to scrape text only, you can use the text_only parameter ;)
Hi biglewbowskii, yes -- you can use The Pipe with other LLMs by using a lightweight library aptly named "LiteLLM". There are more details in the readme :)
Subreddit
Post Details
- Posted
- 9 months ago
- Reddit URL
- View post on reddit.com
- External URL
- github.com/emcf/thepipe
Good question! I would recommend reading the getting started section of the README. it contains everything you need to start feeding whatever you want into GPT Vision.
If you're feeling up to learning something even more advanced, you can check out this guide to help you build a multimodal RAG system (a.k.a. a really smart "chat with your documents" app)