Category · 59 models

Real-Time Call Transcription & Summary

Live captions, speaker labels, and an action-item digest by the time the call ends.

What it is

Used for sales, support and meetings: streaming ASR with diarization, plus an LLM pass that extracts decisions, owners and next steps.

Real-world examples

·Auto-summarize a Zoom sales call into the CRM
·Caption a multilingual all-hands in real time
·Extract objections from 200 support calls per day

What to look for

·Sub-300ms streaming latency
·Diarization + accent robustness
·Native CRM / helpdesk integration

59 models in this category

Compare

GPT-4o

OpenAI

AIDB91

Real-time omni-model handling text, vision and voice in a single network.

MultimodalAudio / SpeechImage Understanding

Text + Image + AudioProprietary

Compare

Veo 3

Google DeepMind

AIDB93

High-fidelity video generation with native synchronised audio.

Video GenerationAudio / Speech

Video + AudioProprietary

Compare

Whisper v3

OpenAI

AIDB91

Open multilingual speech recognition and translation model.

Audio / Speech

AudioOpen

Compare

ElevenLabs v3

ElevenLabs

AIDB88

Best-in-class expressive TTS and voice cloning across 70+ languages.

Audio / Speech

AudioProprietary

Compare

NotebookLM

Google

AIDB93

Source-grounded research assistant with audio overviews.

Text GenerationAudio / Speech

Text + AudioProprietary

Compare

TTS-1 / GPT-4o Voice

OpenAI

AIDB92

OpenAI text-to-speech voices via the audio API.

Audio / Speech

AudioProprietary

Compare

Chirp 3

Google

AIDB93

High-fidelity expressive TTS voices on Google Cloud.

Audio / Speech

AudioProprietary

Compare

Seamless M4T v2

HeyGen Avatar IV

HeyGen

AIDB86

AI avatar video generator for marketing and training.

Video GenerationAudio / Speech

VideoProprietary

Compare

Cartesia Sonic

Cartesia

AIDB83

Ultra-low-latency state-space TTS model.

Audio / Speech

AudioProprietary

Compare

PlayHT 3.0

PlayHT

AIDB81

Conversational TTS optimised for AI agents.

Audio / Speech

AudioProprietary

Compare

Resemble AI

AIDB85

Voice cloning and real-time speech synthesis platform.

Audio / Speech

AudioProprietary

Compare

Deepgram Nova-3

Deepgram

AIDB85

Production-grade streaming speech-to-text model.

Audio / Speech

AudioProprietary

Compare

AssemblyAI Universal-2

AssemblyAI

AIDB83

Highly accurate speech recognition with rich audio intelligence.

Audio / Speech

AudioProprietary

Compare

Moonshine

Useful Sensors

AIDB86

Open ASR model optimised for real-time edge inference.

Audio / Speech

AudioOpen

Compare

Duolingo Max

Duolingo

AIDB83

AI-powered language tutoring features.

Text GenerationAudio / Speech

Text + AudioProprietary

Compare

Google Tensor G5

Google

AIDB92

Pixel SoC powering on-device Gemini Nano features.

MultimodalAudio / Speech

On-deviceProprietary

Compare

Samsung Galaxy AI

Samsung

AIDB87

Suite of on-device + cloud AI features for Galaxy phones (translate, edit, summarize).

MultimodalAudio / SpeechImage Generation

HybridProprietary

Compare

BMW Intelligent Personal Assistant

BMW

AIDB84

Voice-first in-car AI assistant integrating Alexa LLM features.

Audio / SpeechAgents

In-vehicleProprietary

Compare

Rabbit R1

Rabbit

AIDB83

Pocket AI device built around the Large Action Model paradigm.

AgentsAudio / Speech

DeviceProprietary

Compare

Meta Ray-Ban (with Meta AI)

Veritone Public Sector

Veritone

AIDB86

AI for evidence redaction, transcription and investigations for law enforcement.

Audio / SpeechImage Understanding

SaaSProprietary

Compare

Hyundai Pleos

Hyundai Motor

AIDB83

AI-powered software-defined vehicle OS with voice and personalization.

AgentsAudio / Speech

Vehicle OSProprietary

Compare

Zoom AI Companion

Zoom

AIDB87

AI assistant for meeting summaries, chat and email across Zoom.

AgentsAudio / Speech

SaaSProprietary

Compare

Cisco Webex AI Assistant

Cisco

AIDB93

GenAI assistant for meetings, contact center and collaboration.

AgentsAudio / Speech

SaaSProprietary

Compare

Oracle Health Clinical AI Agent

Oracle

AIDB93

Voice-enabled clinical documentation agent for clinicians.

AgentsAudio / Speech

SaaSProprietary

Compare

Amazon Connect AI

AWS

AIDB95

GenAI for contact-center agents, self-service and analytics.

AgentsAudio / Speech

SaaSProprietary

Compare

AWS HealthScribe

AWS

AIDB94

HIPAA-eligible service that generates clinical notes from patient conversations.

Audio / SpeechText Generation

APIProprietary

Compare

Customer Engagement Suite (CCAI)

Google Cloud

AIDB93

Generative contact-center AI for virtual agents, agent assist and insights.

AgentsAudio / Speech

SaaSProprietary

Compare

Gong AI

Gong

AIDB83

Revenue AI for call insights, forecasting and deal execution.

AgentsAudio / Speech

SaaSProprietary

Compare

Twilio CustomerAI

Twilio

AIDB83

GenAI and predictive AI across Twilio messaging, voice and Segment.

AgentsAudio / Speech

PlatformProprietary

Compare

Zendesk QA (Klaus)

Zendesk

AIDB84

AutoQA AI that scores 100% of support conversations across voice and chat.

ReasoningAudio / Speech

SaaSProprietary

Compare

Twilio Voice Intelligence

Twilio

AIDB87

Speech-to-text, summaries and language operators that analyze every call in real time.

Audio / SpeechReasoning

PlatformProprietary

Compare

Twilio AI Assistants

Twilio

AIDB87

Build conversational AI agents over SMS, voice and WhatsApp grounded in Segment data.

AgentsAudio / Speech

PlatformProprietary

Compare

Dragon Copilot

Microsoft / Nuance

AIDB95

Ambient AI scribe for clinicians that drafts notes and orders from doctor-patient conversations.

Audio / SpeechText Generation

SaaSProprietary

Compare

Spotify AI DJ

Spotify

AIDB88

Personalized AI DJ that curates and narrates listening sessions in a realistic voice.

Audio / SpeechAgents

SaaSProprietary

Compare

Reka Core

Reka AI

AIDB81

We introduce Reka Core, Flash, and Edge, a series of powerful multimodal language models trained from scratch by Reka.

Audio / SpeechCodeImage Generation

AudioProprietary

Compare

GPT-4o (Aug 2024)

OpenAI

AIDB91

We’re announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.

Audio / SpeechImage GenerationMultimodal

AudioProprietary

Compare

GPT-4o (Nov 2024)

OpenAI

AIDB92

We’re announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.

Audio / SpeechImage GenerationMultimodal

AudioProprietary

Compare

Fugatto 1

NVIDIA

AIDB91

Fugatto is a versatile audio synthesis and transformation model capable of following free-form text instructions with optional audio inputs.

Audio / SpeechMultimodalText Generation

AudioProprietary

Compare

Gemini 2.0 Pro

Google DeepMind

AIDB94

Today, we’re releasing an experimental version of Gemini 2.0 Pro that responds to that feedback.

Audio / SpeechCodeImage Generation

AudioProprietary

Compare

GPT-4o (Jan 2025)

OpenAI

AIDB92

We’re announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.

Audio / SpeechImage GenerationMultimodal

AudioProprietary

Compare

Gemini 2.5 Pro (Mar 2025)

Google DeepMind

AIDB92

Gemini 2.5 Pro Experimental is our most advanced model for complex tasks.

Audio / SpeechCodeImage Generation

AudioProprietary

Compare

GPT-4o (Mar 2025)

OpenAI

AIDB95

We’re announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.

Audio / SpeechImage GenerationMultimodal

AudioProprietary

Compare

Gemini 2.5 Pro (May 2025)

Google DeepMind

AIDB95

Gemini 2.5 Pro Experimental is our most advanced model for complex tasks.

Audio / SpeechCodeImage Generation

AudioProprietary

Compare

Gemini 2.5 Pro (Jun 2025)

Google DeepMind

AIDB95

Gemini 2.5 Pro Experimental is our most advanced model for complex tasks.

Audio / SpeechCodeImage Generation

AudioProprietary

Compare

Gemini 2.5 Deep Think

Google DeepMind

AIDB94

To advance Gemini’s capabilities towards solving hard reasoning problems, we developed a novel reasoning approach, called Deep Think, that naturally blends in parallel thinking techniques during response generation.

Audio / SpeechCodeImage Generation

AudioProprietary

Compare

Qwen3-Omni-30B-A3B

Alibaba

AIDB87

We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts.

Audio / SpeechImage GenerationMultimodal

AudioOpen Weights

Compare

Gemini Robotics-ER 1.5

Google DeepMind

AIDB94

Our most capable vision-language model (VLM) reasons about the physical world, natively calls digital tools and creates detailed, multi-step plans to complete a mission.

Audio / SpeechImage GenerationText Generation

AudioProprietary

Compare