Robots in movies seem to have an endless capability to converse with people. While this capability seems far fetched, this post will demonstrate how I created a conversational interface to my Raspberry Pi based robot. I wanted to be able to call out it’s name and ask it questions and have the robot respond with a spoken response or perform an action. All code for this posting is part of the overall project to implement the InMoov robot using ROS.
In order for the robot to have this conversation, it has to do the following:
- Listen and detect when it’s name is spoken. I’ve elected to name my robot Jarvis after the computer assistant used by Tony Stark in the Iron Man series.
- Respond in a way to indicate it has recognized it’s name and is ready for me to say something. This is called hotword detection.
- Record my response as audio and convert it to text. This is called Speech to Text (STT)
- Understand the intent of what I said. This is called Natural Language Processing (NLP)
- Take some action based on the intent, including a verbal response.
This post covers steps 1 and 2.
Is It Hot in Here?
Here is the concept of how the name detection is done.
I used an USB connected microphone that is directly connected to the Raspberry Pi. I elected to use Snowboy for hotword detection on the Raspberry Pi because it is very good at a specialized task and it is very fast. The project uses machine learning to improve the detection accuracy through crowd sourcing of the phrase. You can’t download the trained model until you have recorded several samples of your own pronunciation. Snowboy runs entirely on the Raspberry Pi with no need for an internet connection after the trained model is downloaded. I’ve encapsulated the Snowboy detector in hotword_detector.py and made a ROS node out of it in hotword.py. When the name ‘Jarvis’ is detected, a ROS message is published.
A conversation is started by the ConversationManager (not shown) after the hotword message is published. I’ll cover this class in a future post. One of the first things it does is to acknowledge the start of the conversation by saying ‘Yes?’. The internal TTS service is called to convert a text response to speech and eventually ends up in the PollyVoiceSynthesizer class in voice_pyaudio.py where it may call the AWS Polly service if the audio file is not yet on the Raspberry Pi.
In my previous post about using Amazon Polly, I used PyGame to implement the sound. There was little need for an entire game framework so I converted the TTS code to use the PyAudio library to be consistent with the rest of the audio capabilities of the project. Admittedly, the PyAudio library documentation is pretty terrible, but many examples exist that often are a good substitute.
Once the WAV file is available, it is played through PyAudio and the response is complete. The entire process takes less a second.
The ConversationManager is responsible for keeping the interaction going until instructed to end the conversation. This is done by continuous looping through a command and response interaction until the command is to end the conversation. After the conversation is ended, a new one can be started by speaking the ‘Jarvis’ hotword. I’ll cover that in the next post that will also include a demonstration.