Software development represented by an abstract pattern and a parrot illustration, displaying technical creativity.

I made a cool thing last week that I wanted to share.

I've been studying for my AWS Certified Solutions Architect exam and going through the corresponding A Cloud Guru course. The course has a series of labs using Polly, Amazon's text-to-speech service; these labs inspired me to build something with Polly for my own use.

What I Built

I spend a lot of time on Wikipedia, where I often encounter words I don’t know how to pronounce, like the names of various animal genera, for example.

Wikipedia usually spells these phonetically in International Phonetic Alphabet (IPA) notation, which might look like /ˈpɪdʒ.ən/ for the word "pigeon," for example. Wikipedia also links this to their IPA help page and, if you hover over each character, provides a helpfully-simplified per-character pronunciation guide.

What I wanted was something to read me the pronunciation aloud without me having to comb through interesting but complex charts — sometimes I just want to know how to pronounce the scientific name for whiskers (spoiler: it's /vaɪˈbrɪsi/).

So that's what I built.

How I Built It

I set up a Lambda function (triggered by an incoming POST request to API Gateway) to take the given IPA notation and voice selection, send them to the Polly service to be translated into speech, then handle the returned audio stream. Initially, this meant saving the audio as a file on S3; later, I decided to just return the Base64-encoded audio directly.

Lambda + Polly

My first step was to create the following IAM policy to allow a Lambda function to use Polly’s speech synthesis feature, then create an IAM role to attach the policy to.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "polly:SynthesizeSpeech"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

I then created the Lambda function and assigned it the new role I created. The full code for the Lambda function (using Python 3) is below, but basically it:

  1. Initializes the Boto 3 Polly client
  2. Gets the IPA notation and selected Polly voice from the request
  3. Wraps the text with some structure so that Polly will pronounce it, not read it
  4. Sends the text to Polly to convert to speech
  5. Encodes and returns the audio it gets back from Polly

The only real configuration I had to do here was to get Polly to read the text as IPA notation rather than as regular text.

Polly reading /ˈkʌ.təl.fɪʃ/ as plain text instead of the IPA notation for pronouncing "cuttlefish."

One of the arguments taken by the synthesize_speech method is TextType, which accepts either "text" or "ssml" as its value. The default value is "text" and will result in Polly reading the text as you would expect a person to read it. The "ssml" option, however, allows the use of supported Speech Synthesis Markup Language (SSML) tags to control how Polly generates speech from text. In the case of translating IPA notation, the <phoneme> tag did exactly what I was looking for, with "ipa" specified as the alphabet and our IPA notation to read as the ph (phonetic symbols for pronunciation) value.

<phoneme alphabet="ipa" ph="ˈkʌ.təl.fɪʃ"></phoneme>

Polly reading /ˈkʌ.təl.fɪʃ/ as the IPA notation for pronouncing "cuttlefish," using the code above.

Originally, I set the Lambda function up to save the audio returned from Polly as an MP3 file in a bucket on S3, then to check whether the audio already existed before sending the text to Polly. I eventually decided to just Base64 encode the audio and return it directly, skipping the S3 step.

If you're interested in the implementation with the S3 upload intact, you can check it out here. (Don't forget to update your IAM policy to let Lambda access S3, too.)

API Gateway

Once I created the Lambda function, I needed to create the trigger for it. For this, I created a new API with the API Gateway service. The API itself only took a few steps to configure:

  1. Add POST method ("Create Method" in the "Actions" menu, select "POST", and confirm)
  2. Set endpoint "Integration type" as "Lambda function"
  3. Select newly-created Lambda function in "Lambda Function" field and save
  4. Enable CORS ("Enable CORS" in the "Actions" menu)
  5. Deploy API ("Deploy API" in the "Actions" menu)

I then grabbed the resulting invoke URL for my static site to POST to, and that was it for API Gateway setup.

S3 Static Site

Perhaps the least interesting part of the process, the web page I created for interacting with the Lambda function/Polly service is also using AWS services — it's hosted as a static website in an S3 bucket. The page itself is just some HTML for structure, some JavaScript to POST the submitted form to the Lambda API and to present the audio player when the Polly audio comes back, and some CSS for fun.

/kənˈkluːʒən/

And that's it! The whole process was surprisingly simple and a lot of fun.

I've already been using the result myself, but give it a try and let me know what you think in the comments!

Related Posts

August 18, 2024 • Frank Valcarcel

What makes Enterprise Software Development Different?

Enterprise software powers large organizations, handling complex tasks across departments. From robust security to scalability, these solutions face unique challenges. Explore what makes software “enterprise-ready” and how to choose the right development approach for your business.

Cuttlesoft's leadership team in a software design and planning meeting at our Denver headquarters
December 21, 2019 • Frank Valcarcel

Cuttlesoft Ranked a 5-Star Agency!

Cuttlesoft achieves levels of flexibility and efficiency that set us apart from the competition. Which is why our 5-star ranking on Clutch, a premier B2B buying guide, shows we can deliver.