OpenAI's hexagonal purple logo against a gradient turquoise background symbolizes the intersection of artificial intelligence and software development. This minimalist design represents OpenAI's significant role in AI development tools and APIs that Cuttlesoft integrates into custom software solutions. The clean geometric pattern reflects the structured approach needed when implementing AI capabilities in enterprise applications, healthcare systems, and government software. As a technology-agnostic development company working with Python, React, and Ruby, Cuttlesoft closely follows OpenAI's developments to enhance our clients' applications with artificial intelligence and machine learning capabilities.

OpenAI's DevDay 2024 unveiled four major API updates to reshape development with GenAI. At the heart of the announcements were four major API updates, each opening new AI development and application frontiers.

Realtime API

The new Realtime API marks a significant leap forward in natural language processing, supporting seamless speech-to-speech conversations. Now, developers can build applications similar to ChatGPT's advanced voice mode. The API will allow for fast, speech-to-speech conversations, but it also brings audio input/output to the Chat Completions API, allowing developers to pass either text, audio, or even both as input.

Key features:

  • Seamless speech-to-speech conversations
  • Audio input/output for the Chat Completions
  • API Support for text, audio, or both as input

The new RealtimeAPI works by utilizing a persistent WebSocket. Before, you had first to transcribe audio with models like OpenAI's Whisper and then pass that to a model for inference. This approach had many drawbacks, like the loss of tonality, like emphasizing certain words or a speaker's accent -- but worst, it was terribly slow and required additional pipelining to build a proper, voice-enhanced end-to-end system.

Additionally, we're excited to see the Realtime API supports function calling, opening up a world of possibilities for creating more dynamic and responsive applications across various industries. This new API feature can potentially create more intuitive, efficient, and powerful business tools by going beyond simple voice interactions to invoking functions and changing application behavior.

How it works:

  • Utilizes a persistent WebSocket connection
  • Eliminates the need for separate transcription and inference steps
  • Preserves tonality, emphasis, and accent
  • Reduces latency and simplifies the development process

Initial pricing seems quite high (at approximately $0.06/minute of input, and $0.20/minute of output), although it's expected to decrease over time, as with many other pricing tables for OpenAI products.

Pricing (as of 2024-10-04):

  • Text: $5 per million input tokens, $20 per million output tokens
  • Audio: $100 per million input tokens, $200 per million output tokens

Despite the costs, early adopters can gain a competitive edge by integrating this technology into their products. If you're eager to get started, use the gpt-4o-realtime-preview model if you're building with WebSockets. If you're testing out the new audio capabilities in the Chat Completions API, use gpt-4o-audio-preview.

The OpenAI dev team also released a repository for developers to quickly set up a demonstration of the Realtime API: openai-realtime-console

Vision Fine-tuning

The ability to fine-tune a model with images empowers developers to create specialized visual AI models, unlocking innovative multi-modal applications across industries.

Best of all, you can continue using the same methods to train previous models that only supported fine-tuning with text. In the same JSONL files, images can be provided as HTTP or data URLs containing base64 encoded images. Just make sure that your images are 10 MB or less, in JPEG, PNG, or WEBP format and the RGB or RGBA image mode. Don't use images featuring People, Faces, Children, or CAPTCHAs -- as OpenAI's moderation will remove those from your training dataset.

Each example can include up to 10 images;

{
  "messages": [
    { "role": "system", "content": "You are an assistant that identifies common and uncommon garden weeds." },
    { "role": "user", "content": "Can you help me identify these weeds?" },
    { "role": "user", "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "https://example.com/garden_weed1.jpeg"
          }
        },
        {
          "type": "image_base64",
          "image_base64": {
            "data": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD..."
          }
        }
      ]
    },
    { "role": "assistant", "content": "These are Dandelion (Taraxacum officinale), a common weed with bright yellow flowers." }
  ]
}

Key points:

  • Uses the same methods as text-based fine-tuning
  • Supports images provided as HTTP or data URLs with base64 encoding
  • Image requirements:
  • Excludes images featuring people, faces, children, or CAPTCHAs

Prompt Caching

Prompt caching was first introduced by Google and later added by Anthropic, so we're happy to see OpenAI introducing this, too. Prompt Caching addresses a common pain point in AI application development: the cost and latency associated with repetitive API calls. At a basic level, Prompt Caching involves storing and reusing the results of previous computations for similar or identical prompts. Instead of processing the same or similar prompts from scratch each time, the system can retrieve pre-computed results, saving time, money, and computational resources.

Caching is available on GPT-4o, GPT-4o mini, o1-preview and o1-mini, and any fine-tuned versions as well. Expect savings of 50% of the costs of regular inputs and outputs.

The best part? Caching is automatic on prompts with 1024 tokens or more! The API will cache the entire message (system, user, and assistant), images, tools, and structured outputs.

Key benefits:

  • Reduces processing time for repeated prompts
  • Lowers costs associated with API calls
  • Conserves computational resources

How it works:

  • Automatic caching for prompts with 1024 tokens or more
  • Caches entire messages (system, user, and assistant), images, tools, and structured outputs

You can monitor caching in the responses of the Chat Completion API under usage.prompt_tokens_details.cached_tokens.

Model Distillation

This new feature, called, Model Distillation, allows you to create more cost-effective models by taking the outputs of a powerful (and expensive) model like o1-preview, and fine-tuning a version of gpt-4o mini with it. The entire process can be done within the OpenAI platform, too, and gives developers even more options for controlling the balance of performance and cost with less-capable models in the OpenAI arsenal.

Key advantages:

  • Creates more affordable, specialized models
  • Balances performance and cost effectively
  • Integrates evaluation tools for quality assurance

Something we like most about this is the introduction of evals to the distillation process. Finally, OpenAI is giving us all the tooling necessary to train and fine-tune models and perform important quality checks on their outputs.

Looking Ahead

These updates represent significant advancements in OpenAI's development tooling. As the technologies mature and prices eventually decrease, we can expect to see a new wave of AI applications that are more responsive, efficient, and capable than ever before.

Related Posts

A conceptual illustration shows a chat bubble icon at the center of a complex maze, representing the challenges of evaluating Large Language Models for commercial applications. The intricate blue-tinted labyrinth symbolizes the many considerations Cuttlesoft navigates when implementing AI solutions in enterprise software - from API integration and cost management to security compliance. This visual metaphor captures the complexity of choosing the right LLM technology for custom software development across healthcare, finance, and enterprise sectors. The centered message icon highlights Cuttlesoft's focus on practical communication AI applications while the maze's structure suggests the methodical evaluation process used to select appropriate AI tools and frameworks for client solutions.
September 12, 2024 • Frank Valcarcel

Benchmarking AI: Evaluating Large Language Models (LLMs)

Large Language Models like GPT-4 are revolutionizing AI, but their power demands rigorous assessment. How do we ensure these marvels perform as intended? Welcome to the crucial world of LLM evaluation.

app developer working with client
April 17, 2020 • Micah Widen

How To Find And Hire Top App Developers

Finding a great app developer may feel overwhelming. Our advice: know what you need and want from your app before you choose a development partner.