OpenAI's DevDay 2024 unveiled four major API updates to reshape development with GenAI. At the heart of the announcements were four major API updates, each opening new AI development and application frontiers.
Realtime API
The new Realtime API marks a significant leap forward in natural language processing, supporting seamless speech-to-speech conversations. Now, developers can build applications similar to ChatGPT's advanced voice mode. The API will allow for fast, speech-to-speech conversations, but it also brings audio input/output to the Chat Completions API, allowing developers to pass either text, audio, or even both as input.
Key features:
- Seamless speech-to-speech conversations
- Audio input/output for the Chat Completions
- API Support for text, audio, or both as input
The new RealtimeAPI works by utilizing a persistent WebSocket. Before, you had first to transcribe audio with models like OpenAI's Whisper and then pass that to a model for inference. This approach had many drawbacks, like the loss of tonality, like emphasizing certain words or a speaker's accent -- but worst, it was terribly slow and required additional pipelining to build a proper, voice-enhanced end-to-end system.
Additionally, we're excited to see the Realtime API supports function calling, opening up a world of possibilities for creating more dynamic and responsive applications across various industries. This new API feature can potentially create more intuitive, efficient, and powerful business tools by going beyond simple voice interactions to invoking functions and changing application behavior.
How it works:
- Utilizes a persistent WebSocket connection
- Eliminates the need for separate transcription and inference steps
- Preserves tonality, emphasis, and accent
- Reduces latency and simplifies the development process
Initial pricing seems quite high (at approximately $0.06/minute of input, and $0.20/minute of output), although it's expected to decrease over time, as with many other pricing tables for OpenAI products.
Pricing (as of 2024-10-04):
- Text: $5 per million input tokens, $20 per million output tokens
- Audio: $100 per million input tokens, $200 per million output tokens
Despite the costs, early adopters can gain a competitive edge by integrating this technology into their products. If you're eager to get started, use the gpt-4o-realtime-preview
model if you're building with WebSockets. If you're testing out the new audio capabilities in the Chat Completions API, use gpt-4o-audio-preview
.
The OpenAI dev team also released a repository for developers to quickly set up a demonstration of the Realtime API: openai-realtime-console
Vision Fine-tuning
The ability to fine-tune a model with images empowers developers to create specialized visual AI models, unlocking innovative multi-modal applications across industries.
Best of all, you can continue using the same methods to train previous models that only supported fine-tuning with text. In the same JSONL files, images can be provided as HTTP or data URLs containing base64 encoded images. Just make sure that your images are 10 MB or less, in JPEG, PNG, or WEBP format and the RGB or RGBA image mode. Don't use images featuring People, Faces, Children, or CAPTCHAs -- as OpenAI's moderation will remove those from your training dataset.
Each example can include up to 10 images;
{
"messages": [
{ "role": "system", "content": "You are an assistant that identifies common and uncommon garden weeds." },
{ "role": "user", "content": "Can you help me identify these weeds?" },
{ "role": "user", "content": [
{
"type": "image_url",
"image_url": {
"url": "https://example.com/garden_weed1.jpeg"
}
},
{
"type": "image_base64",
"image_base64": {
"data": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..."
}
}
]
},
{ "role": "assistant", "content": "These are Dandelion (Taraxacum officinale), a common weed with bright yellow flowers." }
]
}
Key points:
- Uses the same methods as text-based fine-tuning
- Supports images provided as HTTP or data URLs with base64 encoding
- Image requirements:
- Excludes images featuring people, faces, children, or CAPTCHAs
Prompt Caching
Prompt caching was first introduced by Google and later added by Anthropic, so we're happy to see OpenAI introducing this, too. Prompt Caching addresses a common pain point in AI application development: the cost and latency associated with repetitive API calls. At a basic level, Prompt Caching involves storing and reusing the results of previous computations for similar or identical prompts. Instead of processing the same or similar prompts from scratch each time, the system can retrieve pre-computed results, saving time, money, and computational resources.
Caching is available on GPT-4o
, GPT-4o mini
, o1-preview
and o1-mini
, and any fine-tuned versions as well. Expect savings of 50% of the costs of regular inputs and outputs.
The best part? Caching is automatic on prompts with 1024 tokens or more! The API will cache the entire message (system, user, and assistant), images, tools, and structured outputs.
Key benefits:
- Reduces processing time for repeated prompts
- Lowers costs associated with API calls
- Conserves computational resources
How it works:
- Automatic caching for prompts with 1024 tokens or more
- Caches entire messages (system, user, and assistant), images, tools, and structured outputs
You can monitor caching in the responses of the Chat Completion API under usage.prompt_tokens_details.cached_tokens
.
Model Distillation
This new feature, called, Model Distillation, allows you to create more cost-effective models by taking the outputs of a powerful (and expensive) model like o1-preview
, and fine-tuning a version of gpt-4o mini
with it. The entire process can be done within the OpenAI platform, too, and gives developers even more options for controlling the balance of performance and cost with less-capable models in the OpenAI arsenal.
Key advantages:
- Creates more affordable, specialized models
- Balances performance and cost effectively
- Integrates evaluation tools for quality assurance
Something we like most about this is the introduction of evals to the distillation process. Finally, OpenAI is giving us all the tooling necessary to train and fine-tune models and perform important quality checks on their outputs.
Looking Ahead
These updates represent significant advancements in OpenAI's development tooling. As the technologies mature and prices eventually decrease, we can expect to see a new wave of AI applications that are more responsive, efficient, and capable than ever before.