• Find us on

Speech Recognition Server API Specification

Purpose

The purpose of this document is to share the Speech Recognition Server & Keyword Spotting API specification so that GoVivace potential customers could test their integration. The contents of this document are GoVivace Proprietary and Subject to change.

Introduction

It’s a real-time full-duplex speech recognition server. Communication with the server is based on WebSockets. The client sends the speech to the server using small chunks, while the server sends partial and full recognition hypotheses back to the client via the same WebSocket, thus enabling full-duplex communication. The client can be implemented in Javascript, thus enabling browser-based speech recognition, or on mobile or tablets or laptops using a WebSockets library.

Client-Server Protocol

This document describes the client-server protocol used by the system.

Opening a session

To open a session, connect to the specified server WebSocket address (e.g. wss://test.govivace.com:49155/client/ws/speech). The content type is generally interpreted by looking at the header of the incoming stream. For best accuracy, send content that is 16KHz sampling rate or better, unless you have been told that it’s a telephony ASR in which case you can send 8KHz or better audio. A number of different codecs are supported. In particular Ogg files that are native to many html5 web browsers, and wav formats, a-law, u-law (in wav format) and mp3 are supported. If you are not sure that your format is supported, email an example and we will verify for you.
Sometimes, you may plan to send a raw PCM stream. In that case it’s possible to specify the data encoding in the URI at the time of opening the connection. For example:
content-type=audio/x-raw, +layout=(string)interleaved, +rate=(int)44100, +format=(string)S16LE, +channels=(int)1

Sending audio

The speech should be sent to the server in raw blocks of data, using the encoding specified when the session was opened or in the header of the connection. It is recommended that a new block is sent at least 4 times per second (less frequent blocks would increase the recognition lag). Blocks do not have to be of equal size.
After the last block of speech data, a special 3-byte ANSI-encoded string “EOS” (“end-of-stream”) needs to be sent to the server. This tells the server that no more speech is coming and the recognition can be finalized.

After sending “EOS”, the client has to keep the WebSocket open to receive recognition results from the server. The server closed the connection itself when all recognition results have been sent to the client. No more audio can be sent via the same WebSocket after an “EOS” has been sent. In order to process a new audio stream, a new WebSocket connection has to be created by the client.

Reading result

The server sends recognition results and other information to the client using the JSON format. The response can contain the following fields:

status: This will show response status. It contains an integer value.
message: This will show a message regarding the above status value. (optional)
result: recognition result (optional), containing the following fields:
hypotheses: recognized words, a list with each item containing the following:
transcript: recognized words
confidence: the confidence of the hypothesis (float, 0..1), (optional)
final: true when the hypothesis is final, i.e., doesn’t change any more

The following status codes (integer value) are currently in use:
0: Success. Usually used when recognition results are sent.
1: No speech. Sent when the incoming audio contains a large portion of silence or non-speech.
2: Aborted. Recognition was aborted for some reason.
9: Not available. Used when all recognizer processes are currently in use and recognition cannot be performed.

Websocket is always closed by the server after sending a non-zero status update.

Examples of server response

➢ {“status”: 9} ➢ {“status”: 0, “result”: {“hypotheses”: [{“transcript”: “see on”}], “final”: false}} ➢ {“status”: 0, “result”: {“hypotheses”: [{“transcript”: “this is awesome”}], “final”:
true}}
Server segments incoming audio on the fly. For each segment, many non-final hypotheses, followed by one final hypothesis are sent. Non-final hypotheses are used to present partial recognition hypotheses to the client. A sequence of non-final hypothesis is always followed by a final hypothesis for that segment. After sending a final hypothesis for a segment, the server starts decoding the next segment, or closes the connection, if all audio sent by the client has been processed.

The client is responsible for presenting the results to the user in a way suitable for the application.

Alternative usage through HTTP API

One can also use the server through a very simple HTTP-based API. This allows you to simply send audio via a PUT or POST request to https://test.govivace.com:49155/client/dynamic/recognize and read the JSON ouput. Note that the JSON output is differently structured than the output of the websocket-based API.

The HTTP API supports chunked transfer encoding which means that server can read and decode an audio stream before the post is complete.

Example of sending an audio to the server

curl -T test/data/english_test.wav “https://test.govivace.com:49155/client/dynamic/recognize” –cacert gd_bundle-g2-g1.crt

Response

{
“status”:0,
“result”:{
“final”:true,
“hypotheses”:[
{
“speaking_rate”:1.1363637447357178,
“transcript”:”how can I help you today”,
“likelihood”:1.0,
“formatted_transcript”:” How can I help you today,”,
“word-alignment”:[
{
“word”:”how”,
“start”:3.8399999141693115,
“length”:0.26999998092651367,
“confidence”:1.0
},
{
“word”:”can”,
“start”:4.1100001335144043,
“length”:0.20999999344348907,
“confidence”:1.0
},
{
“word”:”I”,
“start”:4.3199996948242188,
“length”:0.08999999612569809,
“confidence”:1.0
},
{
“word”:”help”,
“start”:4.4099998474121094,
“length”:0.23999999463558197,
“confidence”:1.0
},
{
“word”:”you”,
“start”:4.6500000953674316,
“length”:0.11999999731779099,
“confidence”:1.0
},
{
“word”:”today”,
“start”:4.7699999809265137,
“length”:0.50999999046325684,
“confidence”:1.0
}
]
}
]
},
“segment”:1.0,
“segment-start”:9.5759992599487305,
“segment-length”:5.2799997329711914,
“total-length”:15.720005989074707,
“id”:”07.30.2019_05.39.29_AM_telephony_9_24522360″
}

Send audio using chunked transfer encoding at an audio byte rate. Decoding starts already when the first chunks have been received:
curl –header “Transfer-Encoding: chuncked” –limit-rate 32000 -T test/data/english_test.ogg “https://test.govivace.com:49155/client/dynamic/recognize” –cacert gd_bundle-g2-g1.crt

Response

{
“status”:0,
“result”:{
“final”:true,
“hypotheses”:[
{
“speaking_rate”:1.1363637447357178,
“transcript”:”how can I help you today”,
“likelihood”:1.0,
“formatted_transcript”:” How can I help you today,”,
“word-alignment”:[
{
“word”:”how”,
“start”:3.8399999141693115,
“length”:0.26999998092651367,
“confidence”:1.0
},
{
“word”:”can”,
“start”:4.1100001335144043,
“length”:0.20999999344348907,
“confidence”:1.0
},
{
“word”:”I”,
“start”:4.3199996948242188,
“length”:0.08999999612569809,
“confidence”:1.0
},
{
“word”:”help”,
“start”:4.4099998474121094,
“length”:0.23999999463558197,
“confidence”:1.0
},
{
“word”:”you”,
“start”:4.6500000953674316,
“length”:0.11999999731779099,
“confidence”:1.0
},
{
“word”:”today”,
“start”:4.7699999809265137,
“length”:0.50999999046325684,
“confidence”:1.0
}
]
}
]
},
“segment”:1.0,
“segment-start”:9.5759992599487305,
“segment-length”:5.2799997329711914,
“total-length”:15.720005989074707,
“id”:”07.30.2019_05.39.29_AM_telephony_9_24522360″
}

For keyword spotting & corresponding intent through HTTP or WS API

Our Speech Recognition API is also used for finding particular keywords from the ASR transcripts. It will give the phrases that you want, start time and end time in seconds of the phases, score, and its Intent if required. You have to enter the phrases as the validation_string in our API. If you need particular intent for phrases also give that by the comma separation of phrases and intent. Your phrases contain space please replace it with %20 in API call. For multiple phrases detection in single API write all phrases with (|) pipe separated in the validation_string field. See the below format and example for more information.
Note : This API works on (.wav) (8KHz, 16-bit signed PCM, and mono channel ) audio format.

➢ For HTTP request with intent:
“https://test.govivace.com:49155/telephony?validation_string=users_phrases,inte nt”
➢ If intent not required then use:
“https://test.govivace.com:49155/telephony?validation_string=users_phrases”
➢ For multiple phrases:
“https://test.govivace.com:49155/telephony?validation_string=phrase1|phrase2|p hrase3”
➢ For ws request with intent:
“wss://test.govivace.com:49155/telephony?validation_string=users_phrases,inten t”
➢ If intent not required then use:
“wss://test.govivace.com:49155/telephony?validation_string=users_phrases”
➢ For multiple phrases:
“wss://test.govivace.com:49155/telephony?validation_string=phrase1|phrase2|ph rase3”

Some examples for given above formats

1. For HTTP with intents :
curl –request POST –data-binary “@audio.wav” “https://test.govivace.com:49155/telephony?key=xxxxxxxxxxxxxxx&validation_str ing=please%20record%20your%20message,answering%20machine”
Here your phrases are: ‘to send your messages’ & ‘answering machine’ as intent of respective phrases.
2. For HTTP without intent :
curl –request POST –data-binary “@audio.wav” “https://test.govivace.com:49155/telephony?key=xxxxxxxxxxxxxxxxx&validation_ string=please%20record%20your%20message”

Response
{
“status”:0,
“result”:{
“final”:true,
“hypotheses”:[
{
“speaking_rate”:3.5897436141967773,
“transcript”:”at the tone please record your message when you’ve finished recording you may hang up or press one for more options”,
“likelihood”:1.0,
“word-alignment”:[
{
“word”:”at”,
“start”:0.14999999105930328,
“length”:0.14999999105930328,
“confidence”:1.0
},
{
“word”:”the”,
“start”:0.29999998211860657,
“length”:0.11999999731779099,
“confidence”:1.0
},
{
“word”:”tone”,
“start”:0.44999998807907104,
“length”:0.32999998331069946,
“confidence”:1.0
},
{
“word”:”please”,
“start”:0.81000000238418579,
“length”:0.23999999463558197,
“confidence”:1.0
},
{
“word”:”record”,
“start”:1.0499999523162842,
“length”:0.38999998569488525,
“confidence”:1.0
},
{
“word”:”your”,
“start”:1.4399999380111694,
“length”:0.08999999612569809,
“confidence”:1.0
},
{
“word”:”message”,
“start”:1.5299999713897705,
“length”:0.38999998569488525,
“confidence”:1.0
},
{
“word”:”when”,
“start”:2.1299998760223389,
“length”:0.17999999225139618,
“confidence”:1.0
},
{
“word”:”you’ve”,
“start”:2.309999942779541,
“length”:0.11999999731779099,
“confidence”:1.0
},
{
“word”:”finished”,
“start”:2.4600000381469727,
“length”:0.32999998331069946,
“confidence”:1.0
},
{
“word”:”recording”,
“start”:2.7899999618530273,
“length”:0.44999998807907104,
“confidence”:1.0
},
{
“word”:”you”,
“start”:3.2400000095367432,
“length”:0.08999999612569809,
“confidence”:1.0
},
{
“word”:”may”,
“start”:3.3299999237060547,
“length”:0.14999999105930328,
“confidence”:1.0
},
{
“word”:”hang”,
“start”:3.4800000190734863,
“length”:0.26999998092651367,
“confidence”:1.0
},
{
“word”:”up”,
“start”:3.75,
“length”:0.17999999225139618,
“confidence”:1.0
},
{
“word”:”or”,
“start”:4.0199999809265137,
“length”:0.26999998092651367,
“confidence”:1.0
},
{
“word”:”press”,
“start”:4.3199996948242188,
“length”:0.26999998092651367,
“confidence”:1.0
},
{
“word”:”one”,
“start”:4.679999828338623,
“length”:0.26999998092651367,
“confidence”:1.0
},
{
“word”:”for”,
“start”:4.9499998092651367,
“length”:0.14999999105930328,
“confidence”:1.0
},
{
“word”:”more”,
“start”:5.0999999046325684,
“length”:0.26999998092651367,
“confidence”:1.0
},
{
“word”:”options”,
“start”:5.369999885559082,
“length”:0.47999998927116394,
“confidence”:1.0
}
],
“keywords”:[
{
“phrase”:”please record your message”,
“start”:0.80999994277954102,
“end”:1.9199999570846558,
“score”:1.0,
“intent”:”answering machine”
}
]
}
]
},
“segment”:1.0,
“segment-start”:10.9320068359375,
“segment-length”:5.8499999046325684,
“total-length”:18.100021362304688,
“id”:”09.22.2018_04.05.41_PM_telephony_9_10375308″
}
Note: See the Bold things in response to the keyword Spotting.

Unique assignment for audio file and its response from ASR API

Our Speech Recognition API is also assigned the Unique Id to the input audio and its response. which is useful for the IVR system, Data analysis or call dashboard system, etc. In our ASR server API, there was a field called “session_id”. In that, you can assign any string to the audio response for identification and uniqueness.
1. For HTTP :
“https://test.govivace.com:49155/telephony?session_id=usr_assignment”
2. For WS :
“wss://test.govivace.com:49155/telephony?session_id=usr_assignment”
Here, in place of usr_assignment you can give your id or string.
Format and response as explain in below example :
curl –request POST –data-binary “@audio.wav” “https://test.govivace.com:49155/telephony?key=xxxxxxxxxxxxxxxxxx&session_id=1037 5308”
Response : {
{
“status”:0,
“result”:{
“final”:true,
“hypotheses”:[
{
“speaking_rate”:3.5897436141967773,
“transcript”:”at the tone please record your message when you’ve finished recording you may hang up or press one for more options”,
“likelihood”:1.0,
“word-alignment”:[
{
“word”:”at”,
“start”:0.14999999105930328,
“length”:0.14999999105930328,
“confidence”:1.0
},
{
“word”:”the”,
“start”:0.29999998211860657,
“length”:0.11999999731779099,
“confidence”:1.0
},
{
“word”:”tone”,
“start”:0.44999998807907104,
“length”:0.32999998331069946,
“confidence”:1.0
},
{
“word”:”please”,
“start”:0.81000000238418579,
“length”:0.23999999463558197,
“confidence”:1.0
},
{
“word”:”record”,
“start”:1.0499999523162842,
“length”:0.38999998569488525,
“confidence”:1.0
},
{
“word”:”your”,
“start”:1.4399999380111694,
“length”:0.08999999612569809,
“confidence”:1.0
},
{
“word”:”message”,
“start”:1.5299999713897705,
“length”:0.38999998569488525,
“confidence”:1.0
},
{
“word”:”when”,
“start”:2.1299998760223389,
“length”:0.17999999225139618,
“confidence”:1.0
},
{
“word”:”you’ve”,
“start”:2.309999942779541,
“length”:0.11999999731779099,
“confidence”:1.0
},
{
“word”:”finished”,
“start”:2.4600000381469727,
“length”:0.32999998331069946,
“confidence”:1.0
},
{
“word”:”recording”,
“start”:2.7899999618530273,
“length”:0.44999998807907104,
“confidence”:1.0
},
{
“word”:”you”,
“start”:3.2400000095367432,
“length”:0.08999999612569809,
“confidence”:1.0
},
{
“word”:”may”,
“start”:3.3299999237060547,
“length”:0.14999999105930328,
“confidence”:1.0
},
{
“word”:”hang”,
“start”:3.4800000190734863,
“length”:0.26999998092651367,
“confidence”:1.0
},
{
“word”:”up”,
“start”:3.75,
“length”:0.17999999225139618,
“confidence”:1.0
},
{
“word”:”or”,
“start”:4.0199999809265137,
“length”:0.26999998092651367,
“confidence”:1.0
},
{
“word”:”press”,
“start”:4.3199996948242188,
“length”:0.26999998092651367,
“confidence”:1.0
},
{
“word”:”one”,
“start”:4.679999828338623,
“length”:0.26999998092651367,
“confidence”:1.0
},
{
“word”:”for”,
“start”:4.9499998092651367,
“length”:0.14999999105930328,
“confidence”:1.0
},
{
“word”:”more”,
“start”:5.0999999046325684,
“length”:0.26999998092651367,
“confidence”:1.0
},
{
“word”:”options”,
“start”:5.369999885559082,
“length”:0.47999998927116394,
“confidence”:1.0
}
],
“keywords”:[
{
“phrase”:”please record your message”,
“start”:0.80999994277954102,
“end”:1.9199999570846558,
“score”:1.0,
“intent”:”answering machine”
}
]
}
]
},
“segment”:1.0,
“segment-start”:10.9320068359375,
“segment-length”:5.8499999046325684,
“total-length”:18.100021362304688,
“id”:”09.22.2018_04.05.41_PM_telephony_9_10375308″
}

Note: See “id” in the response which contains date:mm.dd.yyyy, time: hh.mm.ss_PM/AM, uri_id: grammername_id, session_id:usr_uniqueid. All the field separated by (_) in the “id” field of the response.

Find Status

We have provided a feature to check the status of the server. We have set some total number of allowed clients in the authentication file of the server. This status will give you information about the number of available clients. To find a number of available clients you can make a request to the server with the keyword ‘status’.
1. For HTTP:
“https://test.govivace.com:49155/status”
2. For WS:
“wss://test.govivace.com:49155/status”
Format and response as explained in the below example :

curl –request PUT “https://test.govivace.com:49155/status”
Response:




200
Available clients : 10


Note: See the Bold line in response, It is showing the total number of available clients.

Closing a session

By default, the session will close once the audio file is uploaded completely but this is the default case. There is another situation that happens when the number of allowed clients are less and session opening requests are more than total allowed clients. In this case, we will immediately disconnect the newly requested session with some status code and result.
This immediate disconnect will only work with key enabling the feature, we will provide a key for your API then we will assign a total number of allowed clients accordingly.
In response, only two fields will be there – status & result.
For example, Total allowed clients = 4.
Suppose 4 clients already connected to the server now you want to connect the next 5th client then it will immediately disconnect this 5th client because all the 4 allowed clients are in use and you can connect to the next client until any one of the clients is free from these 4 clients.
Response
{
“status”:0,
“result”:{
“final”:true
}
}

Dying Gasp

If silence is sent on the audio stream and the final output does not have any words in it, the server shall return a dying-gasp return value on the disconnect that will be an empty message, in order to confirm that it received the audio. The message is as shown below.

Response:
{
“status”:0,
“result”:{
“final”:true
}
}