Making social robot conversations more natural with Speech-to-Text

By mullaned2002

August 7, 2023

210

MIXI, Inc. (MIXI) is a social networking organization that provides a diverse range of services for friends and family to enjoy together, such as the social-media platform mixi, a mobile game called Monster Strike, and a family photo and video sharing service known as FamilyAlbum. One of our current projects is Romi, a social robot launched in April 2021 that uses Speech-to-Text by Google Cloud as its speech recognition engine.

Since the late 2010s, the social robot market has been booming, with some models becoming increasingly affordable for consumers, from robotic tutors that promote social and cognitive development for children, to companion robots for elderly care. But with Romi, there is a marked difference in the quality of dialogue that makes Romi distinct from most social robots.

The biggest feature of Romi is that the AI developed internally by MIXI can generate natural exchange of communication. The size of a hand-held device, Romi can be placed anywhere in a room and has a screen to demonstrate different facial expressions. It responds to conversation within context. Until now, AI has been used to interpret the intentions behind user speech, but Romi is an AI-powered robot that takes it a step further, generating spoken conversations. After all, Romi was created to offer heartwarming communication to those who are looking for it. This form of speech recognition did not exist before Romi was released. We hope users will enjoy conversing with it, including the occasional unexpected response.

The speech recognition part was one of the most critical aspects of Romi. Most of the infrastructure that makes up Romi uses a main public cloud, which was used for other services then. As for speech recognition, we decided to try out the Speech-to-Text tool by Google Cloud, which was praised for its overwhelmingly high accuracy, and the prototype’s results were very positive. Even though we tried other companies’ services before making the final decision, our conclusion about Speech-to-Text remains the same.

The accuracy and responsiveness of Speech-to-Text made the tool an effective one for a social robot like Romi. Google Cloud also provided a sense of security with its high reliability that has been demonstrated in enabling Romi’s workloads, and will be able to support continuous development of Romi’s services for the long run.

With the rapid development of speech recognition technology, MIXI decided to re-examine the speech recognition engine for Romi in June 2022, about a year after its release. We eventually decided to continue its use of Speech-to-Text. We reviewed about 10 companies’ Japanese-compatible speech recognition engines, and found that Speech-to-Text offered the best results. In addition, Speech-to-Text has several speech recognition transcription models, but we found that the latest short model, which specializes in short utterances, is more suitable for Romi than the default model.

The cost-savings that Speech-to-Text delivers is also impressive. The billing unit was changed from 15 seconds increments rounded up, to one second in November, and huge cost reductions could be expected with Romi. This is important to us because Romi does not have trigger phrases, such as “OK Google,” so as to achieve more natural conversations. As a result, it can recognize and process more speech as compared to other social robots. While this results in a more user-friendly experience, it also requires greater workloads and can incur a higher cost compared to most speech recognition engines. But with the updated billing system that Speech-to-Text delivers, we are able to continue refining Romi’s speech recognition accuracy while keeping costs low.

Improving data analysis with BigQuery

Google Cloud was only used for speech recognition initially, but as Romi’s range of service expanded, more aspects of Romi were hosted on Google Cloud. Among these features, the machine learning platform for AI was moved to Google Cloud at an early stage. To be able to make use of a cloud platform at an affordable cost makes Google Cloud very appealing. Premium Support and technical account management helped us with our cost considerations.

Furthermore, MIXI started migrating the data analysis platform for Romi to BigQuery last year. BigQuery was chosen because it excels at bringing together and analyzing big data in various formats, as in-depth data analysis becomes necessary to improve Romi’s services. What also makes BigQuery an attractive choice was the ability to introduce structured query language (SQL) to BigQuery, a language that the development team from MIXI is familiar with.

In particular, we are grateful for the use of software like Looker. It takes a lot of work, even for engineers, to write complex queries, but with Looker, even non-engineers can intuitively perform fairly complex analysis. About half a year ago, we held regular briefings mainly for employees interested in data analysis, and now they voluntarily conduct analysis, conduct discussions based on the results, and create new projects and ideas. This has become a regular workflow for us.

Currently, what is popular in AI-based communication is the emergence of large-scale language models (LLMs) that learn from huge amounts of data, and generate natural responses on a different level than before.

To improve the conversational experience with Romi, we have been looking into relevant LLM technologies for a while now. It is important to be able to use high performance GPUs as inexpensively as possible in order to run PoC at high speed. We will continue to focus on Google Cloud services, including Compute Engine and VertexAI.

Cloud BlogRead More

Previous articleHow Asahi Group fostered a culture of innovation by building a data analysis platform

Next articleMaximizing Reliability, Minimizing Costs: Right-Sizing Kubernetes Workloads

Making social robot conversations more natural with Speech-to-Text

Improving data analysis with BigQuery

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Inferential Insights: How Confidence Intervals Illuminate the Ames Real Estate Market

Reduce call hold time and improve customer experience with self-service virtual agents using Amazon Connect and Amazon Lex

Integrate Amazon RDS for Oracle with Amazon EFS – Part 2 – Strengthen Access using EFS file system policy and Enforce in-transit encryption

POPULAR CATEGORY