This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
Ok so OpenAI justĀ announcedĀ GPT-4o, a new model that can reason across audio, vision, and text in real time (unheard of for a model of this intelligence for those unfamiliar). According to OpenAI, GPT-4o "accepts as input any combination of text, audio, and image and generates any combination of text, audio, and image outputs". (SeeĀ OpenAI's demo on YouTube)
Recently, I released an open source library so you can extract data in multiple modalities to feed your AI-based Python projects. In this post, I'll show you how to use it alongside GPT-4o with the OpenAI API to build multimodal apps with it. I've got nothing better to do right now, so I'll walk through the steps of extracting all multimodal content from different sources, preparing the input for GPT-4o, sending it to the model for processing, and getting our results back.
before getting into the code, let's just stop and ask ourselves why we'd use GPT-4o over previous models like GPT-4-turbo:
Multi-modal Input and Output:
GPT-4o can handle text, audio, and image inputs and generate outputs in any of these formats.
Real-time Processing:
The model can respond to audio inputs in as little as 232 milliseconds, making it suitable for real-time applications.
Improved Performance:
GPT-4o matches GPT-4 Turbo performance on text in English and code, with significant improvements in non-English languages, vision, and audio understanding.
Cost and Speed:
GPT-4o is 50% cheaper and 2x faster than GPT-4 Turbo, with 5x higher rate limits.
Ok, let's get to the code lol:
Step 1: Extract!
This can be done usingĀ The Pipe APIĀ which can handle various file types and URLs, extracting text and images in a format that GPT-4o can understand.
For example, if we were analyzing a talk based on a scientific paper, we could combine the two sources to provide a comprehensive input to GPT-4o:
from thepipe_api import thepipe
# Extract multimodal content from a PDF
pdf = thepipe.extract("path/to/paper.pdf")
# Extract multimodal content from a YouTube video
vid = thepipe.extract("https://youtu.be/dQw4w9WgXcQ")
Step 2: Prepare the Input for GPT-4o
Here's an example of how to prepare the input prompt by simply combining the extracted content with a question from the user:
# Add a user query
query = [{
"role": "user",
"content": "Which figures from the paper would help answer the question at the end of the talk video?"
}]
# Combine the content to create the input prompt for GPT-4o
messages = pdf vid query
Step 3: Send the Input to GPT-4o
With the input prepared, you can now send it to GPT-4o using theĀ OpenAI API.Ā Make sure you have yourĀ OPENAI_API_KEY
Ā set in your environment variables.
from openai import OpenAI
# Initialize the OpenAI client
openai_client = OpenAI()
# Send the input to GPT-4o
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=messages,
)
# Print the response
print(response.choices[0].message.content)
All done!
PS:
If you have literally no idea what I'm talking about, check out theĀ OpenAI GPT-4O announcement!.
If you're a developer, feel free to access or contribute to The Pipe onĀ GitHub!Ā It is important to note that OpenAI's GPT-4o model is only accepting textual and visual modalities at release, however we will be carefully monitoring the new modalities released for GPT-4o in the coming weeks and updating the library accordingly.
Post Details
- Posted
- 6 months ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/Python/comm...