It’s been five years since we launched the Google Cloud Speech-to-Text (STT) API, and we’re awed by the things our customers have done. From powering voice-controlled apps to generating captions for videos, the API processes more than 1 billion minutes of spoken language each month—enough to transcribe the entirety of the Oxford English Dictionary more than half a million times (including obsolete words), assuming normal speaking speeds.
“With voice poised to become the next major disruption in human-computer interaction, technologies like Google’s Cloud Speech API are becoming increasingly important to enterprises looking to keep pace with changing consumer behaviors and expectations. In partnership with DeepMind and Google Brain, Google continues to invest in this space and bring new innovations to the market that enable organizations to quickly and easily add voice components to their consumer-facing applications,” says Ritu Jyoti, group vice president, AI and Automation Research Practice at IDC.
Familiar use cases, like giving instructions to a smartphone assistant or watching text appear as someone speaks during a video meeting, are just the beginning, with customers making more advanced and creative uses of these AI technologies each day. Once you can accurately transcribe and understand spoken language at scale, you can layer on a variety of other AI services and applications to create more engaging experiences or deeper insights from this data.
To explore new frontiers in this technology, and illustrate how your business might do more with voice, let’s examine some of the novel ways Google Cloud customers are using the Speech API, from creating better sales experiences to building friendly robots.
Moving from speech to insights and sales: InteractiveTel
Phone calls are a significant source of leads and sales for automobile dealers, but historically, dealers have struggled to collect and act on call data, even failing in some cases to call back the majority of would-be buyers. Leaders at InteractiveTel, a provider of cloud-based telephony applications that help improve customer service and improve sales, recognized that AI could erase these challenges.
They envisioned voice data as an opportunity to provide dealers with real-time insights for more productive conversations, more reliable follow up, and ultimately, more robust sales. Early in its history, however, InteractiveTel relied on speech recognition technologies that produced inconsistent results.
This led the company to become one of the first STT API customers when the product was released in 2017. The company almost immediately enjoyed a 30% improvement in transcription accuracy and has been growing more advanced and reliable ever since.
“The biggest KPI that speaks to our platform’s power is retention,” said co-founder Gary Graves. “We have a 96% retention rate.”
Graves noted that the Google Cloud Speech API is central to this success. “Without it, we’re just vanilla ice cream,” he stated. “When we first started, we baked the Cloud Speech API into our core. Every discussion has to be transcribed with the API, and generating that data in near real-time creates a foundation for richer services.”
For example, if a customer calls about a specific vehicle that is not available, InteractiveTel surfaces alerts for the dealer as the conversation happens, helping them to know if a similar vehicle will soon be in stock. The platform also knows if the customer has had past interactions, such as appointments at the dealership, and even includes sentiment analysis to detect events like disagreements between a customer and salesperson that may require a sales manager to join the call.
“The API is pretty low maintenance,” according to Graves. “It has scaled with the company, keeping up with velocity and never causing a bottleneck.”
“I’m data driven. We tested everything out there at the time,” he added. “Google works best. Other providers reach out every six months or so, and I always tell them, ’Try again in six months.’ That’s been happening for years.”
Fostering childhood development with a robot friend: Embodied
While InteractiveTel’s platform speaks to trends in the business world, Embodied’s Moxie robot shows how Speech AI can impact social-emotional learning, from hospitals to the home. Designed for continuous conversations, not just predefined prompts and responses, Moxie encourages children to interact with it as they might with a friend. For example, if a child says, “I like space,” Moxie can automatically shift into a conversation filled with astronomical facts, or if a child reads a book from Moxie’s Book Club, the robot can lead a targeted question and discussion session after reading.
Though a fun way for all children to work on social, emotional, and critical thinking skills, Moxie has been particularly promising for children facing adversity, from social isolation to difficulty making friends. Some parents of children with developmental disorders have shared promising feedback about their children’s social-emotional development after spending time with Moxie. The robot can discern whom to address and how to proactively engage, using subtle eye gaze signals, facial expressions, and body language as part of its response to create a lifelike, believable AI friend that can gain build rapport with a child.
“We want to empower parents to help children with technology,” said Paolo Pirjanian, Embodied’s founder and CEO. A former NASA scientist who previously served as CTO of iRobot, Pirjanian noted that though the market for interactive robots is in “early innings,” they’re encouraged by reception to Moxie. The robot “provides a non-judgmental space that helps kids to share hard feelings and encourages engagement with friends and family and the world around them,” he said.
A number of AI technologies enable Moxie’s multi-modal interactions, as well as the accompanying app for parents. Computer vision technologies, for example, help to decipher a child’s body language. But as with InteractiveTel, the Cloud Speech API is the starting place for interactions, as the robot cannot tap into resources appropriate to the situation if it cannot accurately understand the child in the first place.
When Speech meets CRM: HubSpot
HubSpot is also using speech-derived data for insights, through its Conversation Intelligence products. Hubspot customers can use AI to automatically take notes in meetings, for example, and connect voice data to CRM data to measure trends, identify changes in market dynamics, and even unlock coaching opportunities.
To offer Conversation Intelligence, HubSpot uses a proprietary stack of several models built atop the STT API. HubSpot leverages a variety of the API’s features, from contextual biasing to speaker tagging, said Ian Leaman, Senior Product Manager, AI, at HubSpot.
“It had the best word error rate, and it was plug and play while still giving us the freedom to mess around and find the best configurations, as we figured out which models work best for different segments of our customer base,” he added. “It’s helped us to support happy customers, achieve faster dev times, and support more languages”
Conversations enable richer AI experiences and services
As these stories attest, speech AI technologies are powerful in and of themselves, but they’re also an important starting point for many more advanced and ambitious use cases that combine many AIs for never-before-seen experiences. Five years ago, many of the customer stories we see today would have seemed more aspirational than feasible, and we expect half-a-decade from now, we’ll continue to be humbled by the ways AI changes how we interact with machines and even one another. To learn more about Google Cloud’s Speech API, click here.
Cloud BlogRead More