Category · 59 models
Live captions, speaker labels, and an action-item digest by the time the call ends.
Used for sales, support and meetings: streaming ASR with diarization, plus an LLM pass that extracts decisions, owners and next steps.
OpenAI
Real-time omni-model handling text, vision and voice in a single network.
Google DeepMind
High-fidelity video generation with native synchronised audio.
OpenAI
Open multilingual speech recognition and translation model.
ElevenLabs
Best-in-class expressive TTS and voice cloning across 70+ languages.
Source-grounded research assistant with audio overviews.
OpenAI
OpenAI text-to-speech voices via the audio API.
High-fidelity expressive TTS voices on Google Cloud.
Meta
Multilingual speech-to-speech and speech-to-text translation.
HeyGen
AI avatar video generator for marketing and training.
Resemble AI
Voice cloning and real-time speech synthesis platform.
Deepgram
Production-grade streaming speech-to-text model.
AssemblyAI
Highly accurate speech recognition with rich audio intelligence.
Useful Sensors
Open ASR model optimised for real-time edge inference.
Duolingo
AI-powered language tutoring features.
Pixel SoC powering on-device Gemini Nano features.
Samsung
Suite of on-device + cloud AI features for Galaxy phones (translate, edit, summarize).
BMW
Voice-first in-car AI assistant integrating Alexa LLM features.
Rabbit
Pocket AI device built around the Large Action Model paradigm.
Meta
Smart glasses with multimodal Meta AI for live look-and-ask.
Veritone
AI for evidence redaction, transcription and investigations for law enforcement.
Hyundai Motor
AI-powered software-defined vehicle OS with voice and personalization.
Zoom
AI assistant for meeting summaries, chat and email across Zoom.
Cisco
GenAI assistant for meetings, contact center and collaboration.
Oracle
Voice-enabled clinical documentation agent for clinicians.
AWS
GenAI for contact-center agents, self-service and analytics.
AWS
HIPAA-eligible service that generates clinical notes from patient conversations.
Google Cloud
Generative contact-center AI for virtual agents, agent assist and insights.
Gong
Revenue AI for call insights, forecasting and deal execution.
Twilio
GenAI and predictive AI across Twilio messaging, voice and Segment.
Zendesk
AutoQA AI that scores 100% of support conversations across voice and chat.
Twilio
Speech-to-text, summaries and language operators that analyze every call in real time.
Twilio
Build conversational AI agents over SMS, voice and WhatsApp grounded in Segment data.
Microsoft / Nuance
Ambient AI scribe for clinicians that drafts notes and orders from doctor-patient conversations.
Spotify
Personalized AI DJ that curates and narrates listening sessions in a realistic voice.
Reka AI
We introduce Reka Core, Flash, and Edge, a series of powerful multimodal language models trained from scratch by Reka.
OpenAI
We’re announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.
OpenAI
We’re announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.
NVIDIA
Fugatto is a versatile audio synthesis and transformation model capable of following free-form text instructions with optional audio inputs.
Google DeepMind
Today, we’re releasing an experimental version of Gemini 2.0 Pro that responds to that feedback.
OpenAI
We’re announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.
Google DeepMind
Gemini 2.5 Pro Experimental is our most advanced model for complex tasks.
OpenAI
We’re announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.
Google DeepMind
Gemini 2.5 Pro Experimental is our most advanced model for complex tasks.
Google DeepMind
Gemini 2.5 Pro Experimental is our most advanced model for complex tasks.
Google DeepMind
To advance Gemini’s capabilities towards solving hard reasoning problems, we developed a novel reasoning approach, called Deep Think, that naturally blends in parallel thinking techniques during response generation.
Alibaba
We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts.
Google DeepMind
Our most capable vision-language model (VLM) reasons about the physical world, natively calls digital tools and creates detailed, multi-step plans to complete a mission.
ByteDance
ByteDance's image generation, video, audio model tracked by Epoch, focused on video generation.
Google DeepMind
Google DeepMind's audio model tracked by Epoch, focused on audio generation.
Hume AI
Empathic voice interface that perceives and generates emotional speech in real time.
Inflection AI
Inflection's empathetic conversational assistant tuned for personal, supportive dialogue.
Stability AI
Stability AI's 2B text-to-audio diffusion model for higher-capacity music, sound-effect generation, and audio editing.
Alibaba
Alibaba's vision-enhanced real-time audio/video translation model for live multilingual interpretation across 60 languages.
Mirelo AI
Mirelo's text-to-sound-effects model for production-ready Foley, ambience, and SFX generation.
Meta
Meta's audio generation model focused on high-fidelity waveform synthesis and speech-music co-generation.