Give your Raspberry Pi a voice with AWS Polly

Amazon announced several new services a few weeks ago.  Among the new services is Polly, an affordable text to speech service that supports 47 voices across 24 languages. This post describes my experience getting this service up and running on a Raspberry Pi 3.

Why Polly?

I’d like to incorporate speak synthesis in my robotics project, but I haven’t been impressed with the standalone implementations.  I’d like to use Python to be consistent with the rest of my project.  I looked around and found that the IBM Watson Developer Cloud offering did not seem to support Python.  The API documentation supports Curl, Node and Java so I moved on.  Google’s Deep Mind reportedly produces very life-like voices.  Unfortunately, it appears that the Google speech synthesis API is only for JavaScript.

Polly is a new Amazon Web Service offering that uses Deep Learning to support 47 voices across 24 languages.  It supports a whole host of options including Android, JavaScript in the browser, iOS, Java, .NET, Node.js, Python, PHP and Ruby.  Given the other complementary services available through AWS, I elected to give this option a serious try.

Installation and Configuration

Since I am a new Amazon Web Services (AWS) user, I needed to:

  1. Establish an AWS account
  2. Use the Identity and Access Management service to create an administrative group, create an administrative user, create a development user and default group. I granted the development user access to the Polly services.  The development user is assigned a public and private key for authentication as part of this process.
  3. Install the AWS Command Line Interface (CLI) using this documentation on the Raspberry Pi.
  4. Run the AWS CLI to configure access to the services from the Raspberry Pi.  I used the simplest approach by using the aws configure option and was prompted for the needed information.  This configured everything in the right place.
  5. Install the Boto3 Python client to enable access to AWS services.

Overall, I found the AWS documentation to be below my expectations.  Getting started links led to pages that did not agree with the instructions.  Instructions were often technically complete but presented in an overly complex way.

Example Code

Here the example code.  I’ll describe the code below.

#! /usr/bin/env python
import pygame, StringIO
import sys, traceback
from boto3 import Session
from botocore.exceptions import BotoCoreError, ClientError
from contextlib import closing

class VoiceSynthesizer(object):
    def __init__(self, volume=0.1):
       self._volume = volume
       session = Session(profile_name="default")
       self.__polly = session.client("polly")
    def _getVolume(self):
       return self._volume
    def say(self, text):
    def _synthesize(self, text):
       # Implementation specific synthesis 
          # Request speech synthesis
          response = self.__polly.synthesize_speech(Text=text, 
       except (BotoCoreError, ClientError) as error:
          # The service returned an error
          exc_type, exc_value, exc_traceback = sys.exc_info()
          traceback.print_exception(exc_type, exc_value, exc_traceback,
          limit=5, file=sys.stdout)

       # Access the audio stream from the response
       if "AudioStream" in response:
          # Note: Closing the stream is important as the service throttles on the
          # number of parallel connections. Here we are using contextlib.closing to
          # ensure the close method of the stream object will be called automatically
          # at the end of the with statement's scope.
          with closing(response["AudioStream"]) as stream:
             data =
             filelike = StringIO.StringIO(data) # Gives you a file-like object
             sound = pygame.mixer.Sound(file=filelike)
             while pygame.mixer.get_busy() == True:

         # The response didn't contain audio data, exit gracefully
         print("Could not stream audio - no audio data in response")

if __name__ == "__main__":
    import sys, traceback
    # Test code 
    debugging = False
       synthesizer = VoiceSynthesizer(0.1)
       synthesizer.say("Attention! The blue zone is for loading and unloading only.")
       print "exception occurred!"
       exc_type, exc_value, exc_traceback = sys.exc_info()
       traceback.print_exception(exc_type, exc_value, exc_traceback,
       limit=5, file=sys.stdout)
    print "done"

The code is organized as a first iteration of a VoiceSynthesizer class and test code.

First, I had to figure out how to make a sound on the Raspberry Pi.  There are many Python-based options and luckily PyGame is part of the default distribution for Raspberry Pi.  PyGame is a general purpose game development framework that includes support for music and sound through a mixer object.  The init method initializes the mixer for use during synthesis.  The few lines of the init method establishes a Session for the default profile configured with the AWS CLI and a client capable of calling the Polly services.

All of the interesting work is done in the synthesize method.  First, the Polly client is used to call the synthesize_speech service using plain text, the OGG sound format and the voice named Brian.  The OGG sound format was used because it is supported by the PyGame Sound class.

Next, the response is checked to see if there is audio stream present.  The audio stream is a Boto-specific class that does not implement file-like behavior.  As a result, all of the data from the stream is read and wrapped with a StringIO object that does.  The file-like audio stream is provided to the Sound class as a “file”, the volume is set and the sound is played.  All previous examples that I could find for text to speech APIs first saved the stream to a local file and then played the file.  There was a great deal of latency in that approach, and I was able to convince PyGame to access an in memory stream instead.

This example spins in a busy loop while the audio plays.  One thing I’d like to add in the future is to locally save the sound file while it is playing and use that copy as a cached result when the same text is used the next time.

Here is an example of the synthesized speech:

Reality Check

The Polly service was announced as a Machine Learning service.  This announcement video gave us the impression that there is context behind the translation service by converting “The temperature in WA is 75°F” to “The temperature in Washington is 75 degrees Fahrenheit”.  That definitely works but change WA to another state such as MN and it resorts to spelling out M-N instead.  I suspect that this was done through the use of Polly Lexicons – the ability to define the pronunciation of words specifically for the announcement demo.


Integrating a VR controller with a 3-D Printed Robot Arm

This is a milestone post for my series describing my experience building a 3-D printed InMoov robot when using the Raspberry Pi 3 and ROS.  The source code is now available here.  In my last post, I described the development of a user interface to adjust the position of the servos using a touch screen.  Since then, I have repackaged the hardware from a breadboard implementation to a “permanent breadboard” implementation making it more compact and neat.  Here is a video that highlights that progress.

Taking a Leap

The next step in this project is to integrate a Leap Motion virtual reality controller.  The Leap Motion controller uses infrared cameras to monitor the position of your hands and fingers in a space in front of your monitor.  The intent is to use the controller for controlling what is happening on your computer including a virtual reality.  In this case, we are going to use it to have the robot hand mimic my hand.

Conceptual architecture of the Leap Motion integration with a Raspberry Pi-based robot arm
  1. The Leap Motion is plugged into my MacBook Pro via an USB cable.  I installed the V2 Desktop SDK so that I could get the hand and finger position data through an API.  The Leap Motion monitors your hands and generates up to 100 frames per second of JSON data that is available on a WebSocket API hosted by a daemon process.  The documentation is above average but still has a few inconsistencies.
  2. As I have mentioned previously, I am running an Ubuntu virtual machine on my MacBook Pro and have installed ROS on that virtual machine.  I originally intended to install ROS natively on the Mac, but I was unable to get that working.  As a result, a ROS Publisher runs on the Ubuntu and is responsible for bridging the Leap Motion data with the ROS messaging infrastructure.  This component converts the JSON-based Leap Motion data into a nearly complete version suitable for ROS messaging. This component cannot run on the Raspberry Pi because cannot keep up with the bandwidth generated by the Leap Motion.  While converting the JSON-based message into a ROS message, the frequency of publishing is governed down to 20 frames per second.
  3. The Raspberry Pi subscribes to the ROS-compatible Leap Motion messages and converts the position vectors from Leap Motion into angle goals for the hand servos. This is done through the use of an ROS Subscriber (to receive the Leap messages and convert to a goal) and an ROS Action Server (to update the hand and finger positions).

The diagram below depicts these activities:


The published code depends very heavily on the physical implementation of the Raspberry Pi, the servo controller and the servo channel assignments.  This diagram shows how I have assembled most of the hardware:


Here is demonstration of how well the integration works.  There is a bit of a delay between my hand moving and the robot reacting and it appears that two fingers need some adjustment.

I am now at a point where I have to decide on my next steps in this project.  There appears to be a minor bug in the wrist rotation, so that has been commented out for now.