Category · 66 models

Talk & Listen (Voice)

AI you speak to that speaks back — for calls, meetings, and accessibility.

What it is

Voice AI turns speech into text and back again, often in real time. It powers phone agents, meeting transcripts, audiobooks, and natural conversations with assistants.

Real-world examples

  • ·Transcribe and summarize a sales call
  • ·Answer customer calls 24/7
  • ·Read articles aloud in a natural voice
  • ·Translate a conversation live

What to look for

  • ·How natural the voice sounds
  • ·Accuracy with accents and noisy audio
  • ·Latency for real-time use

66 models in this category

GPT-4o

OpenAI

AIDB91

Real-time omni-model handling text, vision and voice in a single network.

MultimodalAudio / SpeechImage Understanding
Text + Image + AudioProprietary

Veo 3

Google DeepMind

AIDB93

High-fidelity video generation with native synchronised audio.

Video GenerationAudio / Speech
Video + AudioProprietary

Whisper v3

OpenAI

AIDB91

Open multilingual speech recognition and translation model.

Audio / Speech
AudioOpen

ElevenLabs v3

ElevenLabs

AIDB88

Best-in-class expressive TTS and voice cloning across 70+ languages.

Audio / Speech
AudioProprietary

Suno v4

Suno

AIDB85

Generates full songs with vocals from a text prompt.

Music
AudioProprietary

Udio

Udio

AIDB84

Text-to-music model focused on production-quality tracks.

Music
AudioProprietary

NotebookLM

Google

AIDB93

Source-grounded research assistant with audio overviews.

Text GenerationAudio / Speech
Text + AudioProprietary

TTS-1 / GPT-4o Voice

OpenAI

AIDB92

OpenAI text-to-speech voices via the audio API.

Audio / Speech
AudioProprietary

Lyria 2

Google DeepMind

AIDB93

Google's professional music generation model.

Music
AudioProprietary

Chirp 3

Google

AIDB93

High-fidelity expressive TTS voices on Google Cloud.

Audio / Speech
AudioProprietary

Seamless M4T v2

Meta

AIDB89

Multilingual speech-to-speech and speech-to-text translation.

Audio / Speech
Audio + TextOpen

MusicGen

Meta

AIDB88

Open text-to-music model from AudioCraft.

Music
AudioOpen

HeyGen Avatar IV

HeyGen

AIDB86

AI avatar video generator for marketing and training.

Video GenerationAudio / Speech
VideoProprietary

Cartesia Sonic

Cartesia

AIDB83

Ultra-low-latency state-space TTS model.

Audio / Speech
AudioProprietary

PlayHT 3.0

PlayHT

AIDB81

Conversational TTS optimised for AI agents.

Audio / Speech
AudioProprietary

Resemble AI

Resemble AI

AIDB85

Voice cloning and real-time speech synthesis platform.

Audio / Speech
AudioProprietary

Deepgram Nova-3

Deepgram

AIDB85

Production-grade streaming speech-to-text model.

Audio / Speech
AudioProprietary

AssemblyAI Universal-2

AssemblyAI

AIDB83

Highly accurate speech recognition with rich audio intelligence.

Audio / Speech
AudioProprietary

Moonshine

Useful Sensors

AIDB86

Open ASR model optimised for real-time edge inference.

Audio / Speech
AudioOpen

Stable Audio 2.0

Stability AI

AIDB89

Generates full-length audio tracks from text.

Music
AudioProprietary

Riffusion

Riffusion

AIDB85

AI music generation with vocal & instrumental control.

Music
AudioProprietary

Duolingo Max

Duolingo

AIDB83

AI-powered language tutoring features.

Text GenerationAudio / Speech
Text + AudioProprietary

Google Tensor G5

Google

AIDB92

Pixel SoC powering on-device Gemini Nano features.

MultimodalAudio / Speech
On-deviceProprietary

Samsung Galaxy AI

Samsung

AIDB87

Suite of on-device + cloud AI features for Galaxy phones (translate, edit, summarize).

MultimodalAudio / SpeechImage Generation
HybridProprietary

BMW Intelligent Personal Assistant

BMW

AIDB84

Voice-first in-car AI assistant integrating Alexa LLM features.

Audio / SpeechAgents
In-vehicleProprietary

Rabbit R1

Rabbit

AIDB83

Pocket AI device built around the Large Action Model paradigm.

AgentsAudio / Speech
DeviceProprietary

Meta Ray-Ban (with Meta AI)

Meta

AIDB90

Smart glasses with multimodal Meta AI for live look-and-ask.

MultimodalAudio / Speech
WearableProprietary

Veritone Public Sector

Veritone

AIDB86

AI for evidence redaction, transcription and investigations for law enforcement.

Audio / SpeechImage Understanding
SaaSProprietary

Hyundai Pleos

Hyundai Motor

AIDB83

AI-powered software-defined vehicle OS with voice and personalization.

AgentsAudio / Speech
Vehicle OSProprietary

Zoom AI Companion

Zoom

AIDB87

AI assistant for meeting summaries, chat and email across Zoom.

AgentsAudio / Speech
SaaSProprietary

Cisco Webex AI Assistant

Cisco

AIDB93

GenAI assistant for meetings, contact center and collaboration.

AgentsAudio / Speech
SaaSProprietary

Oracle Health Clinical AI Agent

Oracle

AIDB93

Voice-enabled clinical documentation agent for clinicians.

AgentsAudio / Speech
SaaSProprietary

Amazon Connect AI

AWS

AIDB95

GenAI for contact-center agents, self-service and analytics.

AgentsAudio / Speech
SaaSProprietary

AWS HealthScribe

AWS

AIDB94

HIPAA-eligible service that generates clinical notes from patient conversations.

Audio / SpeechText Generation
APIProprietary

Customer Engagement Suite (CCAI)

Google Cloud

AIDB93

Generative contact-center AI for virtual agents, agent assist and insights.

AgentsAudio / Speech
SaaSProprietary

Gong AI

Gong

AIDB83

Revenue AI for call insights, forecasting and deal execution.

AgentsAudio / Speech
SaaSProprietary

Twilio CustomerAI

Twilio

AIDB83

GenAI and predictive AI across Twilio messaging, voice and Segment.

AgentsAudio / Speech
PlatformProprietary

Zendesk QA (Klaus)

Zendesk

AIDB84

AutoQA AI that scores 100% of support conversations across voice and chat.

ReasoningAudio / Speech
SaaSProprietary

Twilio Voice Intelligence

Twilio

AIDB87

Speech-to-text, summaries and language operators that analyze every call in real time.

Audio / SpeechReasoning
PlatformProprietary

Twilio AI Assistants

Twilio

AIDB87

Build conversational AI agents over SMS, voice and WhatsApp grounded in Segment data.

AgentsAudio / Speech
PlatformProprietary

Dragon Copilot

Microsoft / Nuance

AIDB95

Ambient AI scribe for clinicians that drafts notes and orders from doctor-patient conversations.

Audio / SpeechText Generation
SaaSProprietary

Spotify AI DJ

Spotify

AIDB88

Personalized AI DJ that curates and narrates listening sessions in a realistic voice.

Audio / SpeechAgents
SaaSProprietary

Reka Core

Reka AI

AIDB81

We introduce Reka Core, Flash, and Edge, a series of powerful multimodal language models trained from scratch by Reka.

Audio / SpeechCodeImage Generation
AudioProprietary

GPT-4o (Aug 2024)

OpenAI

AIDB91

We’re announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.

Audio / SpeechImage GenerationMultimodal
AudioProprietary

GPT-4o (Nov 2024)

OpenAI

AIDB92

We’re announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.

Audio / SpeechImage GenerationMultimodal
AudioProprietary

Fugatto 1

NVIDIA

AIDB91

Fugatto is a versatile audio synthesis and transformation model capable of following free-form text instructions with optional audio inputs.

Audio / SpeechMultimodalText Generation
AudioProprietary

Gemini 2.0 Pro

Google DeepMind

AIDB94

Today, we’re releasing an experimental version of Gemini 2.0 Pro that responds to that feedback.

Audio / SpeechCodeImage Generation
AudioProprietary

GPT-4o (Jan 2025)

OpenAI

AIDB92

We’re announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.

Audio / SpeechImage GenerationMultimodal
AudioProprietary

Gemini 2.5 Pro (Mar 2025)

Google DeepMind

AIDB92

Gemini 2.5 Pro Experimental is our most advanced model for complex tasks.

Audio / SpeechCodeImage Generation
AudioProprietary

GPT-4o (Mar 2025)

OpenAI

AIDB95

We’re announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.

Audio / SpeechImage GenerationMultimodal
AudioProprietary

Gemini 2.5 Pro (May 2025)

Google DeepMind

AIDB95

Gemini 2.5 Pro Experimental is our most advanced model for complex tasks.

Audio / SpeechCodeImage Generation
AudioProprietary

Gemini 2.5 Pro (Jun 2025)

Google DeepMind

AIDB95

Gemini 2.5 Pro Experimental is our most advanced model for complex tasks.

Audio / SpeechCodeImage Generation
AudioProprietary

Gemini 2.5 Deep Think

Google DeepMind

AIDB94

To advance Gemini’s capabilities towards solving hard reasoning problems, we developed a novel reasoning approach, called Deep Think, that naturally blends in parallel thinking techniques during response generation.

Audio / SpeechCodeImage Generation
AudioProprietary

Qwen3-Omni-30B-A3B

Alibaba

AIDB87

We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts.

Audio / SpeechImage GenerationMultimodal
AudioOpen Weights

Gemini Robotics-ER 1.5

Google DeepMind

AIDB94

Our most capable vision-language model (VLM) reasons about the physical world, natively calls digital tools and creates detailed, multi-step plans to complete a mission.

Audio / SpeechImage GenerationText Generation
AudioProprietary

Seedance 2.0

ByteDance

AIDB82

ByteDance's image generation, video, audio model tracked by Epoch, focused on video generation.

Audio / SpeechImage GenerationVideo Generation
AudioProprietary

Gemini Flash 3.1 TTS

Google DeepMind

AIDB94

Google DeepMind's audio model tracked by Epoch, focused on audio generation.

Audio / Speech
AudioProprietary

Hume EVI 3

Hume AI

AIDB84

Empathic voice interface that perceives and generates emotional speech in real time.

Audio / SpeechMultimodal
VoiceProprietary

Pi 3.0

Inflection AI

AIDB86

Inflection's empathetic conversational assistant tuned for personal, supportive dialogue.

Text GenerationAudio / Speech
Text + VoiceProprietary

Stable Audio 3 Medium

Stability AI

AIDB87

Stability AI's 2B text-to-audio diffusion model for higher-capacity music, sound-effect generation, and audio editing.

MusicAudio / Speech
AudioProprietary

Qwen3.5 LiveTranslate Flash

Alibaba

AIDB92

Alibaba's vision-enhanced real-time audio/video translation model for live multilingual interpretation across 60 languages.

Audio / SpeechMultimodal
Audio + Video + TextProprietary

Mirelo SFX 1.6

Mirelo AI

AIDB82

Mirelo's text-to-sound-effects model for production-ready Foley, ambience, and SFX generation.

Audio / Speech
AudioProprietary

WavFlow

Meta

AIDB92

Meta's audio generation model focused on high-fidelity waveform synthesis and speech-music co-generation.

Audio / SpeechMusic
AudioOpen Weights

Dramabox

Resemble AI

AIDB85

Resemble AI's expressive multi-character voice acting model for long-form dramatic dialogue and narration.

Audio / Speech
AudioProprietary

Stable Audio 3 Small SFX

Stability AI

AIDB89

Stability AI's compact text-to-sound-effects diffusion model optimized for low-latency on-device SFX generation.

Audio / Speech
AudioOpen Weights

Stable Audio 3 Small Music

Stability AI

AIDB93

Stability AI's compact text-to-music diffusion model tuned for short, license-friendly musical loops and stems.

Music
AudioOpen Weights

Explore other categories