Tips & Tricks for Playing Video with APL and Alexa

I am currently in the process of building an Alexa skill that contains all of the knowledge of the Star Wars Universe.  This includes characters, droids, weapons, vehicles, planets, creatures, and even different species and organizations.  It also includes the ability to request the opening crawl videos from each of the movies in the Star Wars saga, and the trailers for the movies, television shows, and video games.

It’s the videos that have brought me here to share what I have learned.

Alexa is available in a wide variety of devices.  Some small, some big, some with screens, others without.  For those devices with screens, I want to be able to provide my users with a simple workflow.

  1. Ask for a specific video.
  2. View the requested video.
  3. Continue the conversation when the video ends.

For the first two steps, this was surprisingly easy to implement using Alexa Presentation Language (APL.) . For the third step, it required some research and trial and error, but I have it working successfully now.

Identifying the Video a User Requested

While there is nothing complicated about identifying a user’s request, I’ll show you how I am handling this so that if you want to build your own version of this, you have everything you need.

In my Interaction Model, I have an intent called “CrawlIntent.”  This is there to handle all of the ways a user might ask to see the opening crawl of a specific film.  It looks like this:

{
  "name": "CrawlIntent",
  "slots": [
  {
    "name": "media",
    "type": "Media"
  }
  ],
  "samples": [
    "show me the {media} crawl",
    "{media} crawl",
    "can I see the {media} crawl",
    "show the crawl for {media}",
    "for the {media} crawl",
    "to show the crawl for {media}",
    "show me the {media} opening crawl",
    "{media} opening crawl",
    "can I see the {media} opening crawl",
    "show the opening crawl for {media}",
    "for the {media} opening crawl",
    "to show the opening crawl for {media}",
    "play the {media} opening crawl",
    "play the {media} crawl"
  ]
}

When a user says something to my skill like one of the utterances above, I can be confident they are looking for the opening crawl video for a specific film.  I also have a slot, called media that contains a list of all of the films and shows that I want my skill to be aware of.

{
  "values": [
    {"name": { "value": "Battlefront 2","synonyms": ["battlefront 2", "battlefront"]}},
    {"name": { "value": "Clone Wars","synonyms": ["the clone wars"]}},
    {"name": { "value": "Episode 1","synonyms": ["the phantom menace"]}},
    {"name": { "value": "Episode 2","synonyms": ["attack of the clones"]}},
    {"name": { "value": "Episode 3","synonyms": ["revenge of the sith"]}},
    {"name": { "value": "Episode 4","synonyms": ["a new hope", "new hope"]}},
    {"name": { "value": "Episode 5","synonyms": ["empire", "the empire strikes back", "empire strikes back"]}},
    {"name": { "value": "Episode 6","synonyms": ["return of the jedi", "jedi"]}},
    {"name": { "value": "Episode 7","synonyms": ["the force awakens", "force awakens"]}},
    {"name": { "value": "Episode 8","synonyms": ["the last jedi", "last jedi"]}},
    {"name": { "value": "Episode 9","synonyms": ["rise of skywalker", "the rise of skywalker"]}},
    {, "name": { "value": "Rebels","synonyms": ["star wars rebels"]}},
    {"name": { "value": "Resistance","synonyms": ["star wars resistance"]}},
    {"name": { "value": "Rogue One","synonyms": ["rogue one a star wars story"]}},
    {"name": { "value": "Solo","synonyms": ["han solo movie", "solo a star wars story"]}},
    {"name": { "value": "The Mandalorian","synonyms": ["the mandalorian"]}}
],
"name": "Media"
}
This slot allows me to match the user’s request against the list of items my skill can handle, using Entity Resolution.  This allows me to be certain that I’m choosing the right video for their request.

 

Playing A Video Using APL

For the code of my skill, I am using the Alexa Skill Kit SDK.  This makes parsing through the JSON that Alexa provides far easier, and gives me greater control over building responses for my users.

To add APL to my skill’s response, I do something like this:

var apl = require("apl/videoplayer.json");
apl.document.mainTemplate.items[0].items[0].source = media.fields.Crawl;
handlerInput.responseBuilder.addDirective({
  type: 'Alexa.Presentation.APL.RenderDocument',
  token: '[SkillProvidedToken]',
  version: '1.0',
  document: apl.document,
  datasources: apl.datasources
})
handlerInput.responseBuilder.getResponse();

Line #1 refers to the location of my APL document.  This document is the markup that tells the screen what to show.  Line #2 is dynamically updating the source of the video file to be played, so that we can play the appropriate video for the appropriate request.

As you’ll see in the APL document below, we define a Video element, and include a source property that indicates a specific URL for our video.

The important lesson I learned when building this is that I don’t want to include any speech or reprompts to my user in this response.  I can send this APL document to the user’s device, which immediately starts playing the video.  This is completely counter-intuitive to everything I’ve ever considered when building an Alexa skill, but it makes sense.  I’m sending them a video to watch…not trying to continue our conversation.

Adding an Event to the Video When It Is Finished

Finally, I had to do some exploration to figure out how to not only identify when the video has concluded, but also prompt my skill to speak to the user in order to continue the conversation.  This is done using the onEnd event on the Video element that we created earlier.  Here is the entire APL document.

{
  "document": {
    "type": "APL",
    "version": "1.1",
    "settings": {},
    "theme": "dark",
    "import": [],
    "resources": [],
    "styles": {},
    "onMount": [],
    "graphics": {},
    "commands": {},
    "layouts": {},
    "mainTemplate": {
      "parameters": [
        "payload"
      ],
      "items": [
      {
        "type": "Container",
        "items": [
          {
            "type": "Video",
            "width": "100%",
            "height": "100%",
            "autoplay": true,
            "source": "https://starwarsdatabank.s3.amazonaws.com/openingcrawl/Star+Wars+Episode+I+The+Phantom+Menace+Opening+Crawl++StarWars.com.mp4",
            "scale": "best-fit",
            "onEnd": [
            {
              "type": "SendEvent",
              "arguments": [
                "VIDEOENDED"
              ],
              "components": [
                "idForTheTextComponent"
              ]
            }
            ]
          }
          ],
          "height": "100%",
          "width": "100%"
        }
        ]
      }
    },
    "datasources": {}
}
This is the second lesson that I learned when building this.  By adding this onEnd event, when the video finishes playing, it will send a new kind of request type to your skill: Alexa.Presentation.APL.UserEvent. You will need to handle this new event type, and prompt the user to say something in order to continue the conversation. I included the argument “VIDEOENDED” so that I’d be confident I was handling the appropriate UserEvent. Here is my example code for handling this:
const VideoEndedIntent = {
  canHandle(handlerInput) {
    return Alexa.getRequestType(handlerInput.requestEnvelope) === 'Alexa.Presentation.APL.UserEvent'
    && handlerInput.requestEnvelope.request.arguments[0] === 'VIDEOENDED';
  },
  handle(handlerInput) {
    const actionQuery = "What would you like to know about next?";
    return handlerInput.responseBuilder
      .speak(actionQuery)
      .reprompt(actionQuery)
      .getResponse();
  }
};
With these few additions to my Alexa skill, I was able to play videos for my users, but bring them back to the conversation once the video concludes.
Have you built anything using APL?  Have you published an Alexa skill?  I’d love to hear about it.  Share your creations in the comments!

Getting Alexa To Pronounce Ordinals

Today, I’m working on a project that requires Alexa to say things like “first,” “second,” or “twenty-first.”  I’ve gone through a few iterations of creating these ordinal strings.

First: Brute Force Attempt

I started the easy way: I created a hard-coded switch statement for the values from 1 – 10, and used a helper function to feed me the appropriate return value as a string..  Not the most elegant, but it got the job done.

Second: Slightly More Elegant and Scaleable

As my application grew, I realized that I would now need the values from 1 – 50 available in my application.  I added to my switch statement…until I got to 15.  At that point, I realized I needed a new solution that could scale to any number I passed in.  So I started writing some logic to append “st” to numbers that ended in 1, “nd” to numbers that ended in 2, “rd” to numbers that ended in 3, and “th” to pretty much everything else.  I had to write some exception cases for 11, 12, and 13.

It was at this point that I made an amazing discovery.

Third: Alexa is already too smart for me.

While playing with my second solution, I used the Voice Simulator that is available when you are building an Alexa skill.  I wanted to see if Alexa would pronounce the words the same if I just appended the suffixes like “th” or “nd” to the actual number value, rather than trying to convert the whole thing to a word.

This is where the discovery was made.

I tried getting her to say “4th,” and she pronounced it as I expected: “fourth.”

On a whim, I added “th” to the number 2, which would normally be incorrect.  She pronounced it “second.”  I had the same experience with “1th,” which she still got correct as “first.”

If you append “th” to the end of any number, Alexa will pronounce the appropriate ordinal.

My mind was slightly blown today.  Thanks, Alexa.

My Frustrations with “Smart Home”

That’s not a fair title.  I LOVE the smart home movement.  I love being able to open/close my garage remotely.  I love having rooms light up as I walk into them.  I love concocting recipes on IFTTT to mash-up my smart devices into even smarter experiences.  I love telling Alexa to control my home with only my voice.

“If it is 10:30pm, and the garage door is open, close it.”

What I don’t love, however, is that much of the experience and joy that is meant to be created by smart devices seems to have been created exclusively for one person that lives by themselves.  Let me give you a few examples:

Smart Bulbs

lifx
Lifx Color 1000

Smart lightbulbs can be controlled by my phone.  They can change colors, be turned on and off, and even dance to my music.  Amazing, right?  Where this story falls apart quickly, however, is the traditional light switch.

If I turn the bulb off from my phone, the light switch becomes non-functional.

If I turn the bulb off with the switch, I lose all of the “smart” features.

If I am a single person living by myself, I can consciously make a decision to only control the bulbs from my phone, and everything is harmonious.  Once you introduce roommates, like a spouse, or even worse, children, this entire experiment falls apart.  The consistency that you require evaporates instantly.

Smart Plugs

 

wemoswitch
WeMo Switch

 

This is another example of power management that has so much potential.  Plug this into the wall, and now you can control a lamp, a stereo, or really anything else that requires power.  You can even set timers, so it’s an incredible way to manage those random lamps you have around your home!

That is, until someone turns that lamp off in the traditional way.

catlight

Now your smart plug is a $40 brick that can control nothing.  It’s incredibly frustrating, and most of the frustration comes from the fact that our homes are not constructed with the idea of a smart home in mind.

Geofencing

geofencing

Geofencing might be one of the coolest ideas around when it comes to smart home functionality.

“When I pull into my driveway, open the garage, turn on the lights, set the thermostat to 71F, and turn on my favorite music.”

“When I am gone for more than 18 hours, set the entire house to away.  Light bulbs on timers, thermostat as low energy as possible, all doors closed and locked.”

If I lived by myself, This. Would. Be. Awesome.  Instead, it becomes an incredible way to scare my entire family to death as I dramatically announce my home arrival.  There HAS to be a better way.

Summary

In short, I love smart home stuff.  But as a software developer, my brain wants more.  Today, in our homes, we basically get the equivalent of a solitary IF statement.

IF I pull in the driveway, THEN do this stuff.

In order for this smart home stuff to be truly impressive (and accessible) to everyone, we need to be able to add as many conditions as we possibly can.

IF I pull in the driveway AND my family is home, THEN open the garage.
ELSEIF my family isn’t home, THEN open the garage AND turn the house up to eleven.

Smart home is still really in its infancy for consumers.  If we want to make it more accessible, we need to be able to provide this level of customization in an easy to use format.  IFTTT and Stringify have made huge strides here, but we still have a long way to go.

I, for one, look forward to the very near future.  This stuff is amazing, even if it’s also frustrating sometimes.