Jump to Content
Application Modernization

Making social robot conversations more natural with Speech-to-Text

August 7, 2023
https://storage.googleapis.com/gweb-cloudblog-publish/images/MIXI.max-2000x2000.jpg
Harumitsu Nobuta

Manager of Development Group, Romi Department, Vantage Studio, MIXI

Shinji Sakaguchi

SRE Group, CTO's office, Development Department, MIXI

MIXI, Inc. (MIXI) is a social networking organization that provides a diverse range of services for friends and family to enjoy together, such as the social-media platform mixi, a mobile game called Monster Strike, and a family photo and video sharing service known as FamilyAlbum. One of our current projects is Romi, a social robot launched in April 2021 that uses Speech-to-Text by Google Cloud as its speech recognition engine.

Since the late 2010s, the social robot market has been booming, with some models becoming increasingly affordable for consumers, from robotic tutors that promote social and cognitive development for children, to companion robots for elderly care. But with Romi, there is a marked difference in the quality of dialogue that makes Romi distinct from most social robots. 

The biggest feature of Romi is that the AI developed internally by MIXI can generate natural exchange of communication. The size of a hand-held device, Romi can be placed anywhere in a room and has a screen to demonstrate different facial expressions. It responds to conversation within context. Until now, AI has been used to interpret the intentions behind user speech, but Romi is an AI-powered robot that takes it a step further, generating spoken conversations. After all, Romi was created to offer heartwarming communication to those who are looking for it. This form of speech recognition did not exist before Romi was released. We hope users will enjoy conversing with it, including the occasional unexpected response.

https://storage.googleapis.com/gweb-cloudblog-publish/images/1_MIXI.max-1800x1800.jpg

The speech recognition part was one of the most critical aspects of Romi. Most of the infrastructure that makes up Romi uses a main public cloud, which was used for other services then. As for speech recognition, we decided to try out the Speech-to-Text tool by Google Cloud, which was praised for its overwhelmingly high accuracy, and the prototype’s results were very positive. Even though we tried other companies' services before making the final decision, our conclusion about Speech-to-Text remains the same. 

The accuracy and responsiveness of Speech-to-Text made the tool an effective one for a social robot like Romi. Google Cloud also provided a sense of security with its high reliability that has been demonstrated in enabling Romi’s workloads, and will be able to support continuous development of Romi’s services for the long run.

With the rapid development of speech recognition technology, MIXI decided to re-examine the speech recognition engine for Romi in June 2022, about a year after its release. We eventually decided to continue its use of Speech-to-Text. We reviewed about 10 companies' Japanese-compatible speech recognition engines, and found that Speech-to-Text offered the best results. In addition, Speech-to-Text has several speech recognition transcription models, but we found that the latest short model, which specializes in short utterances, is more suitable for Romi than the default model.

The cost-savings that Speech-to-Text delivers is also impressive. The billing unit was changed from 15 seconds increments rounded up, to one second in November, and huge cost reductions could be expected with Romi. This is important to us because Romi does not have trigger phrases, such as “OK Google,” so as to achieve more natural conversations. As a result, it can recognize and process more speech as compared to other social robots. While this results in a more user-friendly experience, it also requires greater workloads and can incur a higher cost compared to most speech recognition engines. But with the updated billing system that Speech-to-Text delivers, we are able to continue refining Romi’s speech recognition accuracy while keeping costs low. 

Improving data analysis with BigQuery

Google Cloud was only used for speech recognition initially, but as Romi’s range of service expanded, more aspects of Romi were hosted on Google Cloud. Among these features, the machine learning platform for AI was moved to Google Cloud at an early stage. To be able to make use of a cloud platform at an affordable cost makes Google Cloud very appealing. Premium Support and technical account management helped us with our cost considerations.

Furthermore, MIXI started migrating the data analysis platform for Romi to BigQuery last year. BigQuery was chosen because it excels at bringing together and analyzing big data in various formats, as in-depth data analysis becomes necessary to improve Romi’s services. What also makes BigQuery an attractive choice was the ability to introduce structured query language (SQL) to BigQuery, a language that the development team from MIXI is familiar with. 

In particular, we are grateful for the use of software like Looker. It takes a lot of work, even for engineers, to write complex queries, but with Looker, even non-engineers can intuitively perform fairly complex analysis. About half a year ago, we held regular briefings mainly for employees interested in data analysis, and now they voluntarily conduct analysis, conduct discussions based on the results, and create new projects and ideas. This has become a regular workflow for us.

Currently, what is popular in AI-based communication is the emergence of large-scale language models (LLMs) that learn from huge amounts of data, and generate natural responses on a different level than before. 

To improve the conversational experience with Romi, we have been looking into relevant LLM technologies for a while now. It is important to be able to use high performance GPUs as inexpensively as possible in order to run PoC at high speed. We will continue to focus on Google Cloud services, including Compute Engine and VertexAI.

Posted in