Give your Raspberry Pi a voice with AWS Polly

Amazon announced several new services a few weeks ago.  Among the new services is Polly, an affordable text to speech service that supports 47 voices across 24 languages. This post describes my experience getting this service up and running on a Raspberry Pi 3.

Why Polly?

I’d like to incorporate speak synthesis in my robotics project, but I haven’t been impressed with the standalone implementations.  I’d like to use Python to be consistent with the rest of my project.  I looked around and found that the IBM Watson Developer Cloud offering did not seem to support Python.  The API documentation supports Curl, Node and Java so I moved on.  Google’s Deep Mind reportedly produces very life-like voices.  Unfortunately, it appears that the Google speech synthesis API is only for JavaScript.

Polly is a new Amazon Web Service offering that uses Deep Learning to support 47 voices across 24 languages.  It supports a whole host of options including Android, JavaScript in the browser, iOS, Java, .NET, Node.js, Python, PHP and Ruby.  Given the other complementary services available through AWS, I elected to give this option a serious try.

Installation and Configuration

Since I am a new Amazon Web Services (AWS) user, I needed to:

  1. Establish an AWS account
  2. Use the Identity and Access Management service to create an administrative group, create an administrative user, create a development user and default group. I granted the development user access to the Polly services.  The development user is assigned a public and private key for authentication as part of this process.
  3. Install the AWS Command Line Interface (CLI) using this documentation on the Raspberry Pi.
  4. Run the AWS CLI to configure access to the services from the Raspberry Pi.  I used the simplest approach by using the aws configure option and was prompted for the needed information.  This configured everything in the right place.
  5. Install the Boto3 Python client to enable access to AWS services.

Overall, I found the AWS documentation to be below my expectations.  Getting started links led to pages that did not agree with the instructions.  Instructions were often technically complete but presented in an overly complex way.

Example Code

Here the example code.  I’ll describe the code below.

#! /usr/bin/env python
import pygame, StringIO
import sys, traceback
from boto3 import Session
from botocore.exceptions import BotoCoreError, ClientError
from contextlib import closing

class VoiceSynthesizer(object):
    def __init__(self, volume=0.1):
       self._volume = volume
       session = Session(profile_name="default")
       self.__polly = session.client("polly")
    def _getVolume(self):
       return self._volume
    def say(self, text):
    def _synthesize(self, text):
       # Implementation specific synthesis 
          # Request speech synthesis
          response = self.__polly.synthesize_speech(Text=text, 
       except (BotoCoreError, ClientError) as error:
          # The service returned an error
          exc_type, exc_value, exc_traceback = sys.exc_info()
          traceback.print_exception(exc_type, exc_value, exc_traceback,
          limit=5, file=sys.stdout)

       # Access the audio stream from the response
       if "AudioStream" in response:
          # Note: Closing the stream is important as the service throttles on the
          # number of parallel connections. Here we are using contextlib.closing to
          # ensure the close method of the stream object will be called automatically
          # at the end of the with statement's scope.
          with closing(response["AudioStream"]) as stream:
             data =
             filelike = StringIO.StringIO(data) # Gives you a file-like object
             sound = pygame.mixer.Sound(file=filelike)
             while pygame.mixer.get_busy() == True:

         # The response didn't contain audio data, exit gracefully
         print("Could not stream audio - no audio data in response")

if __name__ == "__main__":
    import sys, traceback
    # Test code 
    debugging = False
       synthesizer = VoiceSynthesizer(0.1)
       synthesizer.say("Attention! The blue zone is for loading and unloading only.")
       print "exception occurred!"
       exc_type, exc_value, exc_traceback = sys.exc_info()
       traceback.print_exception(exc_type, exc_value, exc_traceback,
       limit=5, file=sys.stdout)
    print "done"

The code is organized as a first iteration of a VoiceSynthesizer class and test code.

First, I had to figure out how to make a sound on the Raspberry Pi.  There are many Python-based options and luckily PyGame is part of the default distribution for Raspberry Pi.  PyGame is a general purpose game development framework that includes support for music and sound through a mixer object.  The init method initializes the mixer for use during synthesis.  The few lines of the init method establishes a Session for the default profile configured with the AWS CLI and a client capable of calling the Polly services.

All of the interesting work is done in the synthesize method.  First, the Polly client is used to call the synthesize_speech service using plain text, the OGG sound format and the voice named Brian.  The OGG sound format was used because it is supported by the PyGame Sound class.

Next, the response is checked to see if there is audio stream present.  The audio stream is a Boto-specific class that does not implement file-like behavior.  As a result, all of the data from the stream is read and wrapped with a StringIO object that does.  The file-like audio stream is provided to the Sound class as a “file”, the volume is set and the sound is played.  All previous examples that I could find for text to speech APIs first saved the stream to a local file and then played the file.  There was a great deal of latency in that approach, and I was able to convince PyGame to access an in memory stream instead.

This example spins in a busy loop while the audio plays.  One thing I’d like to add in the future is to locally save the sound file while it is playing and use that copy as a cached result when the same text is used the next time.

Here is an example of the synthesized speech:

Reality Check

The Polly service was announced as a Machine Learning service.  This announcement video gave us the impression that there is context behind the translation service by converting “The temperature in WA is 75°F” to “The temperature in Washington is 75 degrees Fahrenheit”.  That definitely works but change WA to another state such as MN and it resorts to spelling out M-N instead.  I suspect that this was done through the use of Polly Lexicons – the ability to define the pronunciation of words specifically for the announcement demo.


3 thoughts on “Give your Raspberry Pi a voice with AWS Polly

  1. This is exactly the basics of code I need for a project I am working on with the Pi. I keep getting an error when executing and I cant seem to figure out what it is (It seems to be something with the filelike code)

    exception occurred!
    Traceback (most recent call last):
    File “”, line 59, in
    synthesizer.say(“Attention blue zone is for loading and unloading only.”)
    File “”, line 19, in say
    File “”, line 43, in _synthesize
    sound = pygame.mixer.Sound(file=filelike)
    TypeError: function takes exactly 1 argument (0 given)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s