Building Cross-Platform Voice Assistants

You’ve already experienced! Building cross-platform voice app is hard.

Managing multiple codebases is a nightmare for developers.

Not anymore! Jovo is the first open-source framework that lets you build voice apps for Amazon Alexa and Google Assistant with one codebase.

Now we have Jovo, what should we do?

We got the idea to rebuild one of our Alexa skill Broken Sequence using Jovo.

This article will show you how to do it. Sorry for being lengthy!

What we are building

Here we are building a voice app that will ask users an incomplete sequence of words and ask the user to find the last word. An example would be like:

Sun: Day
Moon: _____

For every correct answer, the user gets a coin as a reward. If they make it correct 5 consecutive sequences, they will be given with 5 extra coins as a bonus.

We will be using the Alexa Presentation Language (APL) features for our voice app in the Alexa platform and a voice only experience for the Google Assistant version.

Install Jovo CLI and Create Project

Using Jovo Command Line Interface, the development will be easy. Open the command line interface and install the Jovo CLI as shown below:

npm install -g jovo-cli

Now we are going to create a project in Jovo. We will choose the “Hello World” template to create the project with all the necessary dependencies. In the command line enter the below command:

jovo new Broken-Sequence

This will create a new project with the name ‘Broken-Sequence’ in your machine with the same folder name. This is how a typical Jovo project looks like:

 models/
   └── en-US.json
 src/
   |── app.js
   |── config.js
   |── index.js
 project.js

You may also see some additional folders and files like node_modules, test, db etc. but the basic structure and the core logic will be in those folders and files that specified earlier. Let’s just run the basic app to see how it works.

Go back to the terminal go to the ‘Broken-Sequence’ folder and enter:

jovo run

Now you can see the below message in terminal:

Local server listening on port 3000.

This is your webhook url: https://webhook.jovo.cloud/851*********3178

Go to the webhook URL and you can see the Jovo Debugger. We can test our app within the debugger and see more information about the request and response and their formats. We will use this feature in the following sections to test our apps.

Build The Language Model

Now, we are going to build the language model for our voice app. Open the file models --> en-US.json. You can see there is only this file (en-US.json) within the models folder.

This folder holds the language specific voice app user experience. In the json file, you can see a field named “invocation” whose value will be “my test app” by default.

This is the name that the users will be used to invoke or open your voice app. So we are giving a more meaningful and appropriate name for it: “broken sequence”. Please note that invocation name should not be lengthier than 3 words and easily pronounceable.

You can also see that, there are 2 intents defined: "HelloworldIntent" and "MyNameIsIntent". These are user-defined intents. The "HelloworldIntent" is triggered when the user invokes the skill asks for their name. Then it waits for the user to say their name.

When the user says the name, the second intent gets triggered and says welcome with the user name. You can see how the responses are generated when these intents are triggered in the file src --> app.js

We don’t need these intents. We need to ask the sequence to the user. So, delete all intents (within the square brackets if intents in language model).

Create a new intent named: “AskIntent” which will ask the user a new sequence when they are ready to begin/continue to play the game or prompt to take the game.

    {
        "name": "AskIntent",
        "phrases": [
            "yes",
            "start",
            "sure",
            "start playing",
            "Play the Game",
            "start the game"
        ]
    }

In the "AskIntent" handler, we ask a sequence to the user. When the user says an answer, we need an intent to catch their response. To do so, we are defining our second intent "AnswerIntent".

The "AnswerIntent" has a slot (argument of the intent) which captures the answer value. So the "AnswerIntent" definition would be:

    {
        "name": "AnswerIntent",
        "phrases": [
            "{Answer}",
            "the answer is {Answer}",
            "my answer is {Answer}",
            "answer is {Answer}",
            "select {Answer}",
            "lock on {Answer}",
            "that was {Answer}",
            "it is {Answer}",
            "is it {Answer}",
            "I think it is {Answer}"
        ],
        "inputs": [
            {
                "name": "Answer",
                "type": "ANSWER"
            }
        ]
    }

In the definition, we can see an additional field compared to the previous intent: inputs. We created a custom slot type and named it "Answer" and declared it as type "ANSWER".

If we are using more than one slot (built in slots or custom slots) in intent, they need to be declared within the inputs field. So, we defined "AnswerIntent" but not defined the slot type that we created as a custom type.

Custom Slots are defined outside the intents list. So outside the intent list, define and provide values that apply to the slot Answer.

"inputTypes": [
    {
        "name": "ANSWER",
        "values": [
            {
                "value": "I don't know"
            },
            {
                "value": "don't know"
            },
            {
                "value": "Moon"
            },
            {
                "value": "FIFA"
            },
            {
                "value": "Facebook"
            },
            {
                "value": "Bath"
            },
            {
                "value": "April First",
                "synonyms": [
                    "1st of April",
                    "April 1st",
                    "April 1",
                    "April one"
                 ]
            },
            {
                "value": "Ball"
            },
            {
                "value": "Protein"
            }
        ]
    }
]

We need to add all possible values that the slot "Answer" can have. For simplicity, we just added a sample of values in the above snippet.

Please refer to the complete language model to see a more elaborated list of values. Now we need 2 more intents: "HelpIntent" and "StopIntent". Define these as:

    {
        "name": "StopIntent",
        "phrases": [
            "no",
            "nope",
            "stop"
        ]
    },
    {
        "name": "HelpIntent",
        "phrases": [
             "help",
             "help me"
        ]
    }

Keep the other fields (platform specific intents) same as it is.

Now we need to declare that we are using the Presentation Interface in Alexa. So open the project.js file in the root folder and add below lines into alexaSkill field.

     nlu: 'alexa',
     manifest: {
       apis: {
         custom: {
           interfaces: [
             {
               type: 'ALEXA_PRESENTATION_APL'
             }
           ]
         }
       }
     }

We also need to have a storage feature that keeps the user information (total coins they earned and sequence counter). So let’s use MongoDB as a storage solution.

Create a MongoDB in any of the providers (https://mlab.com/ is a good place) and copy the connection URL.

Go to the file src --> config.js and add these:

   db: {
     MongoDb: {
       databaseName: <Name_of_the_db>,
             collectionName: 'UserData',
             uri: <Your_URI>
     }
   }

Writing Intent Handlers

Now we are going to write handlers for these intents. These handlers define the logical operations that the skill does. We are splitting the handlers into two: one for the Alexa and one for the Google Assistant.

The APL files for Alexa responses (both design and data files of APL) in a separate folder named "apl". In the app.js file, we use these handlers as:

 const alexaHandlers = require('./alexa_handlers.js')
 const googleAssistantHandlers = require('./google_handlers.js')
 app.setAlexaHandler(alexaHandlers)
 app.setGoogleAssistantHandler(googleAssistantHandlers)

When the skill is invoked we want to give a welcome message to those users who are coming for the first time and a welcome back message for those who already have a conversation with the skill. For this, we want to store the user details in the DB. So when a user came, we check if there is any details exist for the user.

Storing and retrieving data across conversations can be done in Jovo as:

 this.$user.$data.<Your_Variable> = value
 var name = this.$user.$data.<Variable_Name>

So for checking the user is new or existing one can be done like:

 if (this.$user.$data.newUser === undefined) {
    this.$user.$data.newUser = true
    var reply = 'Welcome to Broken Sequence.’
    var prompt = 'Are you ready to take your first sequence?'
 }
 else {
    var reply = 'Welcome back to Broken Sequence.’
    var prompt = 'Are you ready to take your first sequence?'
 }

Now, we have to display the APL in Alexa and a speech response only in Google Assistant. So, in Alexa handler add the display directive into the "LAUNCH" handler like this :

 this.$alexaSkill.addDirective({
       type: 'Alexa.Presentation.APL.RenderDocument',
       version: '1.0',
       document: require(`./apl/welcome.json`),
       datasources: require(`./apl/welcome_data.json`)
 })
 this.ask(reply + prompt, prompt)

In the Google Assistant handler for "LAUNCH", the reply is generated like:

this.ask(reply + prompt, prompt)

For a new user, in the "LAUNCH" handler, we will set the user data parameter counter as one. This is like the following:

this.$user.$data.counter = 1

We will use this counter to get the sequence item from a list of sequence stored in a json file.

Now we go to the next handler: the "AskIntent" handler. Here, we ask the sequence using an SSML tag. SSML allows you to control the way Alexa/Google Assistant speak a response to the user. Here, we are using a break between each word in the sequence.

This is how we do it:

'<speak>' + words[0] + '<break time="0.50s"/>' + words[1] + '<break time="0.50s"/>' + words[2] + "<break time='1s'/>" + ‘what will come next?’ + '</speak>'

Also, we use another parameter to specify that the user is provided with a sequence to answer.

this.$user.$data.questionMode = true

Now the user says an answer word. If the user says the correct answer the “AnswerIntent” will be triggered since all the values of the correct answer is trained to the "Answer" slot. We then compare the correct answer and the one user said.

If they are the same we give them a coin. If the user says the wrong answer which is not trained with our slot sample values, it may trigger the "fallback" intent. In the "fallback" intent, we check if the "questionMode" is true. If so, we give the wrong answer response. Otherwise, a normal fallback response is given.

Test the Voice APP

We can test the app by running it locally with Jovo. Run the following command:

jovo run

This will run the test server. Copy the endpoint URL. Go to the Alexa console and select the endpoint menu from the left menu panel. We can have 2 types of endpoints that can be set in Alexa: "Lambda" and "https" types.

Select the "https" option and paste the URL there and save the changes.

Now go to "Dialogflow" console and select the menu option "Fulfillment". Use the URL we copied in the URL field there and click save button.

Now we have set up the endpoints for both Alexa and Google Actions. Go to the Alexa console and select the Test tab from the top tab options. Type “open broken sequence” in the text area of the Testing Console window and hit Enter.

Now we invoked the Alexa Skill.

Make sure you have selected the options "Skill I/O" and "Device Display" so that you can see the request structure Alexa received and the response and parameters it sets on the response. Now scroll down the window you can see the APL that is rendered in response of the invocation of the skill.

You can continue the testing within the console. You can also try it with voice. Now go to the Action Console (https://console.actions.google.com) and select the action and go to the "Simulator". You can test there the action we created there.

Continue testing the apps in Alexa and Action test console and ensure the app is working good. Then we proceed to the deployment of the app. If you face any problem with the endpoint URL use the below command to run the server and use that URL:

jovo run --bst-proxy

Deploy the Voice app

Next step is to build the platform specific models for our voice app. For that run:

jovo build

We can see a new folder named “platforms” where there are two sub-folders for Alexa and Google Action. Now we need to deploy the app to the platforms. For our app to be deployed into the Alexa, we first need to set up ASK CLI first.

npm install -g ask-cli

ask init

You can find the documentation about how to install and set up the ASK CLI in here: ASK CLI reference. After setting up ASK CLI, enter the following command:

jovo deploy

Now you can see the skill is created and enabled in the Alexa Developer Console.

For Google Assistant, go to https://console.dialogflow.com and create an agent. Then go to Settings --> Export and Import. Click the button "RESTORE FROM ZIP" and select the zip file "platforms --> googleAction --> dialogflow_agent.zip". Click the save button. Now the model is imported to the agent.

Deploy Source Code with AWS Lambda

Now, we are going to deploy our source code to the "AWS Lambda". Refer to the article https://www.jovo.tech/tutorials/host-google-action-on-lambda and create and connect the "lambda function" to the "Dialogflow Fulfillment". Then go to the Alexa Developer Console and change the endpoint type as "lambda" and use the ARN value to connect with the lambda function.

You can refer to the complete source code here.

About Author

Jyothish G

Chatbot and Voicebot Developer