This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

Building Multimodal Apps With GPT-4O

Post Flair (click to view more posts with a particular flair)

Tutorial

Post Body

Ok so OpenAI just announced GPT-4o, a new model that can reason across audio, vision, and text in real time (unheard of for a model of this intelligence for those unfamiliar). According to OpenAI, GPT-4o "accepts as input any combination of text, audio, and image and generates any combination of text, audio, and image outputs". (See OpenAI's demo on YouTube)

Recently, I released an open source library so you can extract data in multiple modalities to feed your AI-based Python projects. In this post, I'll show you how to use it alongside GPT-4o with the OpenAI API to build multimodal apps with it. I've got nothing better to do right now, so I'll walk through the steps of extracting all multimodal content from different sources, preparing the input for GPT-4o, sending it to the model for processing, and getting our results back.

before getting into the code, let's just stop and ask ourselves why we'd use GPT-4o over previous models like GPT-4-turbo:

Multi-modal Input and Output:
GPT-4o can handle text, audio, and image inputs and generate outputs in any of these formats.

Real-time Processing:
The model can respond to audio inputs in as little as 232 milliseconds, making it suitable for real-time applications.

Improved Performance:
GPT-4o matches GPT-4 Turbo performance on text in English and code, with significant improvements in non-English languages, vision, and audio understanding.

Cost and Speed:
GPT-4o is 50% cheaper and 2x faster than GPT-4 Turbo, with 5x higher rate limits.

Ok, let's get to the code lol:

Step 1: Extract!

This can be done using The Pipe API which can handle various file types and URLs, extracting text and images in a format that GPT-4o can understand.

For example, if we were analyzing a talk based on a scientific paper, we could combine the two sources to provide a comprehensive input to GPT-4o:

from thepipe_api import thepipe

# Extract multimodal content from a PDF
pdf = thepipe.extract("path/to/paper.pdf")

# Extract multimodal content from a YouTube video
vid = thepipe.extract("https://youtu.be/dQw4w9WgXcQ")

Step 2: Prepare the Input for GPT-4o

Here's an example of how to prepare the input prompt by simply combining the extracted content with a question from the user:

# Add a user query
query = [{
    "role": "user",
    "content": "Which figures from the paper would help answer the question at the end of the talk video?"
}]

# Combine the content to create the input prompt for GPT-4o
messages = pdf   vid   query

Step 3: Send the Input to GPT-4o

With the input prepared, you can now send it to GPT-4o using the OpenAI API. Make sure you have your OPENAI_API_KEY set in your environment variables.

from openai import OpenAI

# Initialize the OpenAI client
openai_client = OpenAI()

# Send the input to GPT-4o
response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
)

# Print the response
print(response.choices[0].message.content)

All done!

PS:

If you have literally no idea what I'm talking about, check out the OpenAI GPT-4O announcement!.

If you're a developer, feel free to access or contribute to The Pipe on GitHub! It is important to note that OpenAI's GPT-4o model is only accepting textual and visual modalities at release, however we will be carefully monitoring the new modalities released for GPT-4o in the coming weeks and updating the library accordingly.

Author

User Disabled

Account Strength

Disabled 8 months ago

Account Age

9 years

Verified Email

Yes

Verified Flair

Total Karma

9,205

Link Karma

7,283

Comment Karma

1,805

Profile updated: 1 week ago

Posts updated: 8 months ago

Emcf

Subreddit

r/Python

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 8 months ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/Python/comm...