Amazon announced several new services a few weeks ago. Among the new services is Polly, an affordable text to speech service that supports 47 voices across 24 languages. This post describes my experience getting this service up and running on a Raspberry Pi 3.
Why Polly?
I’d like to incorporate speak synthesis in my robotics project, but I haven’t been impressed with the standalone implementations. I’d like to use Python to be consistent with the rest of my project. I looked around and found that the IBM Watson Developer Cloud offering did not seem to support Python. The API documentation supports Curl, Node and Java so I moved on. Google’s Deep Mind reportedly produces very life-like voices. Unfortunately, it appears that the Google speech synthesis API is only for JavaScript.
Polly is a new Amazon Web Service offering that uses Deep Learning to support 47 voices across 24 languages. It supports a whole host of options including Android, JavaScript in the browser, iOS, Java, .NET, Node.js, Python, PHP and Ruby. Given the other complementary services available through AWS, I elected to give this option a serious try.
Installation and Configuration
Since I am a new Amazon Web Services (AWS) user, I needed to:
- Establish an AWS account
- Use the Identity and Access Management service to create an administrative group, create an administrative user, create a development user and default group. I granted the development user access to the Polly services. The development user is assigned a public and private key for authentication as part of this process.
- Install the AWS Command Line Interface (CLI) using this documentation on the Raspberry Pi.
- Run the AWS CLI to configure access to the services from the Raspberry Pi. I used the simplest approach by using the aws configure option and was prompted for the needed information. This configured everything in the right place.
- Install the Boto3 Python client to enable access to AWS services.
Overall, I found the AWS documentation to be below my expectations. Getting started links led to pages that did not agree with the instructions. Instructions were often technically complete but presented in an overly complex way.
Example Code
Here the example code. I’ll describe the code below.
#! /usr/bin/env python import pygame, StringIO import sys, traceback from boto3 import Session from botocore.exceptions import BotoCoreError, ClientError from contextlib import closing class VoiceSynthesizer(object): def __init__(self, volume=0.1): pygame.mixer.init() self._volume = volume session = Session(profile_name="default") self.__polly = session.client("polly") def _getVolume(self): return self._volume def say(self, text): self._synthesize(text) def _synthesize(self, text): # Implementation specific synthesis try: # Request speech synthesis response = self.__polly.synthesize_speech(Text=text, OutputFormat="ogg_vorbis",VoiceId="Brian") except (BotoCoreError, ClientError) as error: # The service returned an error print(error) exc_type, exc_value, exc_traceback = sys.exc_info() traceback.print_exception(exc_type, exc_value, exc_traceback, limit=5, file=sys.stdout) # Access the audio stream from the response if "AudioStream" in response: # Note: Closing the stream is important as the service throttles on the # number of parallel connections. Here we are using contextlib.closing to # ensure the close method of the stream object will be called automatically # at the end of the with statement's scope. with closing(response["AudioStream"]) as stream: data = stream.read() filelike = StringIO.StringIO(data) # Gives you a file-like object sound = pygame.mixer.Sound(file=filelike) sound.set_volume(self._getVolume()) sound.play() while pygame.mixer.get_busy() == True: continue else: # The response didn't contain audio data, exit gracefully print("Could not stream audio - no audio data in response") if __name__ == "__main__": import sys, traceback # Test code debugging = False try: synthesizer = VoiceSynthesizer(0.1) synthesizer.say("Attention! The blue zone is for loading and unloading only.") except: print "exception occurred!" exc_type, exc_value, exc_traceback = sys.exc_info() traceback.print_exception(exc_type, exc_value, exc_traceback, limit=5, file=sys.stdout) print "done"
The code is organized as a first iteration of a VoiceSynthesizer class and test code.
First, I had to figure out how to make a sound on the Raspberry Pi. There are many Python-based options and luckily PyGame is part of the default distribution for Raspberry Pi. PyGame is a general purpose game development framework that includes support for music and sound through a mixer object. The init method initializes the mixer for use during synthesis. The few lines of the init method establishes a Session for the default profile configured with the AWS CLI and a client capable of calling the Polly services.
All of the interesting work is done in the synthesize method. First, the Polly client is used to call the synthesize_speech service using plain text, the OGG sound format and the voice named Brian. The OGG sound format was used because it is supported by the PyGame Sound class.
Next, the response is checked to see if there is audio stream present. The audio stream is a Boto-specific class that does not implement file-like behavior. As a result, all of the data from the stream is read and wrapped with a StringIO object that does. The file-like audio stream is provided to the Sound class as a “file”, the volume is set and the sound is played. All previous examples that I could find for text to speech APIs first saved the stream to a local file and then played the file. There was a great deal of latency in that approach, and I was able to convince PyGame to access an in memory stream instead.
This example spins in a busy loop while the audio plays. One thing I’d like to add in the future is to locally save the sound file while it is playing and use that copy as a cached result when the same text is used the next time.
Here is an example of the synthesized speech:
Reality Check
The Polly service was announced as a Machine Learning service. This announcement video gave us the impression that there is context behind the translation service by converting “The temperature in WA is 75°F” to “The temperature in Washington is 75 degrees Fahrenheit”. That definitely works but change WA to another state such as MN and it resorts to spelling out M-N instead. I suspect that this was done through the use of Polly Lexicons – the ability to define the pronunciation of words specifically for the announcement demo.