Intelligent interfaces: AI goes beyond text into voice and vision with OpenAI and Google.

May 17, 2024

Wow, was this a big week for AI announcements. OpenAI and Google went toe-to-toe with their new product releases, from multimodal agents like GPT-4o and Astra (intelligence interfaces that goes beyond chat to voice and computer vision) to Google’s hint of their text-to-video model Veo, a likely rival to OpenAI’s Sora.

There are dozens of GPT-4o demos, including this one from Khan Academy founder, Sal Khan, working through a tricky math problem with his son, using GPT-4o. The future of education will be incredibly different. Imagine a personalized tutor in your pocket.

I also found this “live translation” demo pretty compelling.

OpenAI GPT-4o real-time translation

This is "OpenAI GPT-4o real-time translation" by OpenAI on Vimeo, the home for high quality videos and the people who love them.

vimeo.com/945587808

Amid all the hubbub, Sony music quietly sent 700 letters to AI companies explicitly stating that the music label giant is opting-out of any model training. Whether the companies comply is unclear — but their next moves will likely be significant for the creative industries. This came hot on the heels of a teaser of a new text-to-music model from popular voice AI startup ElevenLabs.

Enjoy the update.

Tara

🔥 Latest News

OpenAI’s GPT-4o delivers human-like AI interaction with text, audio, and vision integration: OpenAI has launched its new multimodal model, GPT-4o, which seamlessly integrates text, audio, and visual inputs and outputs, promising to enhance the naturalness of machine interactions. GPT-4o, where the “o” stands for “omni,”accepts as input any combination of text, audio, and image and generates any combination of text, audio, and image outputs,” OpenAI announced. Users can expect a response time as quick as 232 milliseconds, mirroring human conversational speed, with an impressive average response time of 320 milliseconds.

Google takes on GPT-4o with Project Astra, an AI agent that understands dynamics of the world: At its annual I/O developer conference, Google made a ton of announcements focused on AI, including Project Astra – an effort to build a universal AI agent of the future. An early version was demoed at the conference, however, the idea is to build a multimodal AI assistant that sits as a helper, sees and understands the dynamics of the world and responds in real time to help with routine tasks/questions. The premise is similar to what OpenAI showcased with GPT-4o-powered ChatGPT.

TikTok is testing AI-generated search results: TikTok is testing a more robust search results page, including using generative AI. The feature appears to be new and is called “search highlights.” A snippet of AI results appear at the top of some search results pages, and clicking into the section opens a new page with the full response. A page explaining the results says that the material is generated using ChatGPT, and that TikTok displays the content “when [the algorithm] finds them relevant to your search.” The feature appears to be limited so far: not all queries have AI answers.

Google announced a text-to-video AI model called Veo, allowing for the creation of computer-generated footage based only on written prompts. The model is a clear rival to OpenAI’s Sora, which performs similar functions and is planned to be released to the public later this year.

Eleven Labs previews its AI music generator ElevenLabs Music, a tool that can create full-length songs with realistic vocals from a single text prompt. It also launched the ElevenLabs Dubbing API – enabling any developer to add audio or video translation to their product while preserving the unique characteristics of the original speaker’s voices.

Sony has sent more than 700 letters to AI companies declaring that it is opting-out of being used in any model training. That “expressly prohibits and opts out of any text or data mining, web scraping or similar reproductions, extractions or uses” of any Sony music content including “musical compositions, lyrics, audio recordings, audiovisual recordings, artwork, images, data etc” for “training, developing or commercializing any AI system”. How will AI companies respond? Will they comply, or ignore?

OpenAI just shipped a major improvement to data analysis in ChatGPT. Now, users can interact with tables and charts, and add files directly from Google Drive and Microsoft OneDrive.

Intelligent interfaces: AI goes beyond text into voice and vision with OpenAI and Google.

🔥 Latest News

Discussion about this post