Bringing the power of large models to Google Cloud’s Speech API

By mullaned2002

May 19, 2023

337

As voice becomes an increasingly popular touchpoint between businesses and customers, our Speech-to-Text (STT) API has been one of the fastest growing APIs from Google Cloud. Google Cloud’s Speech API processes more than 1 billion voice minutes per month for our enterprise customers, across a range of industries, with near-human levels of understanding for many commonly spoken languages.

Many companies are using speech services from Google Cloud to power next-generation products and customer experiences. HubSpot is using STT for their Conversational Intelligence tools, MRV uses the API to reduce customer service time by a third, and Spotify is leveraging STTfor their voice interface, Car Thing.

Our goal is to provide users with the highest possible quality speech recognition for their use case. At Google Cloud, we continue to partner with our colleagues in Google Research and beyond to push quality and new types of models. Today, that means we’re bringing the power of large models to our Speech API and into the hands of developers.

In March of this year, Google published research on progress towards a Universal Speech Model. Last week at Google I/O, we announced that we are bringing a new version of the Universal Speech Model, Chirp, to Cloud. Chirp will serve as a foundation model for Speech AI in Google Cloud. Today we are excited to dive deeper into how we are now applying the power of large models to our Speech API with Chirp.

Chirp is Google Cloud’s 2B-parameter speech model built via self-supervised training on millions of hours of audio and 28 billion sentences of text spanning 100+ languages. Chirp delivers 98% speech recognition accuracy in English and over 300% relative improvement in several languages with less than 10M speakers.

Chirp is not only larger than previous speech models, but also incorporates new training approaches. Chirp’s encoder was first trained with millions of hours of unsupervised (i.e., unlabeled) audio data from 100+ languages. The model was then fine-tuned for transcription in each specific language with small amounts of supervised data. This contrasts with traditional speech recognition techniques that focus on large amounts of language-specific supervised data. These techniques help Chirp to achieve such large quality improvements in languages and accents with very few speakers and small amounts of labeled training data. By adding Chirp to Cloud, we are thrilled to bring the quality of speech recognition for more languages and accents closer to that of the most widely spoken languages.

In collaboration with the Internet Archive’s TV News Archive, the GDELT Project is applying Google Cloud’s Speech-to-Text and Translation APIs to transcribe and translate global television news from across the world, enabling researchers and journalists to understand and cite local events from local sources across a wide range of languages and dialects. “Television news is a major source of information for societies around the world, but the lack of searchable and translatable transcripts has largely rendered it inaccessible. Through the combination of Speech-to-Text and Translation AI from Google Cloud, GDELT to date has transcribed and translated more than 66,000 broadcasts totaling more than 328 million words. With the release of Google’s new Chirp speech model, we are now able to improve the accuracy of those transcriptions and dramatically expand the set of languages we can explore, greatly expanding our reach across the world,” said Kalev Leetaru, Founder of the GDELT Project

We are excited to see how other companies will use Chirp to enable new Speech AI use cases across a variety of languages. Chirp is available now, in Preview, in the Speech-to-Text API. See our documentation and get started with the Speech-to-Text console today.

We’re so excited to continue investing in making our pre-trained Speech API even stronger to help developers leverage the power of voice for their businesses, programs, and applications.

Cloud BlogRead More

Previous articleIntroducing an image-to-speech Generative AI application using Amazon SageMaker and Hugging Face

Next articleDebugging a FUSE deadlock in the Linux kernel

Bringing the power of large models to Google Cloud’s Speech API

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Optimize generative AI applications with pgvector indexing: A deep dive into IVFFlat and HNSW techniques

The Netflix Cosmos Platform

Introducing SAP Cost Estimator: Price out your SAP landscape on Google Cloud

POPULAR CATEGORY