Speech Recognition Server and Keyword Spotting Specification with Sentiment

Purpose

It’s a real-time full-duplex speech recognition server. Communication with the server is based on WebSockets. The client sends speech to the server using small chunks, while the server sends partial and full recognition hypotheses back to the client via the same WebSocket, thus enabling full-duplex communication. The client can be implemented in Javascript, thus enabling browser-based speech recognition, or on mobile or tablets or laptops using a WebSockets library.

Client-server protocol

This document describes the client-server protocol used by the system.

Opening a session

To open a session, connect to the specified server WebSocket address (e.g. wss://services.govivace.com:49149/telephony?key=xxxxxxxxxxxxxxx)

Sending audio

The speech should be sent to the server in raw blocks of data, using the encoding specified when the session was opened or in the header of the connection. It is recommended that a new block is sent at least 4 times per second (less frequent blocks would increase the recognition lag). Blocks do not have to be of equal size.

After the last block of speech data, a special 3-byte ANSI-encoded string “EOS” (“end-of-stream”) needs to be sent to the server. This tells the server that no more speech is coming and the recognition can be finalized.

After sending “EOS”, the client has to keep the WebSocket open to receive recognition results from the server. The server closes the connection itself when all recognition results have been sent to the client. No more audio can be sent via the same WebSocket after an “EOS” has been sent. In order to process a new audio stream, a new WebSocket connection has to be created by the client.

Usage

python client.py –uri “wss://services.govivace.com:49149/telephony?key=xxxxxxxxxxxxxxx” –save-json-filename sample1_transcription.json –rate 4200 sample1.wav

Options

  • –save-json-filename: Save the intermediate JSON to this specified file
  • –rate: Rate in bytes/sec at which audio should be sent to the server
  • –uri: Server WebSocket URI

 

Reading results

The server sends recognition results and other information to the client using the JSON format. The response can contain the following fields:

  • status: response status (integer), see codes below
  • message:(optional) status message
  • result: (optional) recognition result, containing the following fields:
  • hypotheses: recognized words, a list with each item containing the following:
  • transcript: recognized words
  • confidence: (optional) confidence of the hypothesis (float, 0..1)
  • final: true when the hypothesis is final, i.e., doesn’t change anymore

The following status codes are currently in use:

  • 0 — Success. Usually used when recognition results are sent
  • 1 — No speech. Sent when the incoming audio contains a large portion of silence or non-speech
  • 2 — Aborted. Recognition was aborted for some reason.
  • 9 — Not available. Used when all recognizer processes are currently in use and recognition cannot be performed.

Websocket is always closed by the server after sending a non-zero status update.

Examples of server responses

{
“status”:0,
“segment-start”:9.487499237060547,
“segment-length”:5.87999963760376,
“total-length”:15.735008239746094,
“sentiment”:”Positive”,
“result”:{
“hypotheses”:[
{
“likelihood”:0.9973933100700378,
“speaking_rate”:1.020408272743225,
“transcript”:”how can I help you today”,
“word-alignment”:[
{
“start”:0.0,
“length”:4.289999961853027,
“word”:”how”,
“confidence”:1.0
},
{
“start”:4.289999961853027,
“length”:0.08999999612569809,
“word”:”can”,
“confidence”:1.0
},
{
“start”:4.380000114440918,
“length”:0.05999999865889549,
“word”:”I”,
“confidence”:0.9843596816062927
},
{
“start”:4.440000057220459,
“length”:0.26999998092651367,
“word”:”help”,
“confidence”:1.0
},
{
“start”:4.739999771118164,
“length”:0.11999999731779099,
“word”:”you”,
“confidence”:1.0
},
{
“start”:4.859999656677246,
“length”:1.0199999809265137,
“word”:”today”,
“confidence”:1.0
}
],
“formatted_transcript”:” How can I help you today,”
}
],
“final”:True
},
“segment”:1.0,
“id”:”03.26.2019_06.49.47_AM_telephony_9_92550624″
}

Sentiment can be Neutral, Positive and Negative.

{“status”: 9}

{“status”: 0, “result”: {“hypotheses”: [{“transcript”: “see on”}], “final”: false}}

{“status”: 0, “result”: {“hypotheses”: [{“transcript”: “see on teine lause.”}], “final”: true}}

The server segments incoming audio on the fly. For each segment, many non-final hypotheses, followed by one final hypothesis are sent. Non-final hypotheses are used to present partial recognition hypotheses to the client. A sequence of the non-final hypothesis is always followed by a final hypothesis for that segment. After sending a final hypothesis for a segment, the server starts decoding the next segment or closes the connection, if all audio sent by the client has been processed.

The client is responsible for presenting the results to the user in a way suitable for the application.

For Keyword Spotting and corresponding Intent through HTTP or WS API

Our Speech Recognition API is also used for finding a particular keyword from the ASR transcripts. It will give the phrases that you want, start time and end time in seconds of the phases, score, and Intent if required. You have to enter the phrases as the validation_string in our API. If need particular intent for phrases also gives that by the comma separation of phrases and intent. Your phrases contain space please replace it with %20 in API call. For multiple phrases detection in single API write all phrases with (|) pip separated in the validation_string field. See the below format and example for more information.

Note: This API work on (.wav) (8KHz, 16-bit signed PCM and mono channel ) audio format.

Format:
For HTTP

“https://://services.govivace.com:49149/telephony?key=xxxxxxxxxxxxxxx&validation_string=users_phrases”. For multiple phrases “https:/://services.govivace.com:49149/telephony?key=xxxxxxxxxxxxxxx&validation_string=phrase 1|phrase 2|phrase 3”.

For WebSocket client

“wss://://services.govivace.com:49149/telephony?key=xxxxxxxxxxxxxxx&validation_string=users_phrases”.  For multiple phrases “wss://://services.govivace.com:49149/telephony?key=xxxxxxxxxxxxxxx&validation_string=phrase 1| phrase 2|phrase 3”.

Some examples for given above formats

For WebSocket client

python client.py –uri “wss://services.govivace.com:49149/telephony?key=xxxxxxxxxxxxxxx&validation_string=airlines|reservations|check%20status”  –save-json-filename sample1_phrases.json –rate 4200 sample1.wav

For HTTP

curl –request POST –data-binary “@sample1.wav” “https://services.govivace.com:49149/telephony?key=xxxxxxxxxxxxxxx&validation_string=airlines|reservations|check%20status”

Response:

{
“status”:0,
“segment-start”:0.0,
“segment-length”:8.549999237060547,
“total-length”:9.487499237060547,
sentiment”:”Negative”,
“result”:{
“hypotheses”:[
{
“transcript”:”thank you for calling alaska airlines but ice by noon open doors”,
“speaking_rate”:1.4035089015960693,
“keywords”:[
{
“phrase”:”airlines”,
“score”:0.8999999761581421,
“end”:5.939999580383301,
“start”:4.230000019073486
}
],
“formatted_transcript”:” Thank you for calling Alaska Airlines, but ice by noon, Open doors,”,
“likelihood”:0.8808204531669617,
“word-alignment”:[
{
“start”:0.0,
“length”:2.9700000286102295,
“word”:”thank”,
“confidence”:1.0
},
{
“start”:2.9700000286102295,
“length”:0.11999999731779099,
“word”:“you”,
“confidence”:1.0
},
{
“start”::3.0899999141693115,
“length”:0.14999999105930328,
“word”:”for”,
“confidence”:1.0
},
{
“start”:3.240000009536743,
“length”:0.41999998688697815,
“word”:”calling”,
“confidence”:1.0
},
{
“start”:3.68999981880188,
“length”:0.5099999904632568,
“word”:”alaska”,
“confidence”:1.0
},
{
“start”:4.230000019073486,
“length”:1.709999918937683,
“word”:”airlines”,
“confidence”:1.0
},
{
“start”:5.940000057220459,
“length”:0.08999999612569809,
“word”:”but”,
“confidence”:1.0
},
{
“start”:6.029999732971191,
“length”:0.26999998092651367,
“word”:”ice”,
“confidence”:0.5158590078353882
},
{
“start”:6.299999713897705,
“length”:0.08999999612569809,
“word”:”by”,
“confidence”:0.9764412641525269
},
{
“start”:6.420000076293945,
“length”:0.44999998807907104,
“word”:”noon”,
“confidence”:0.9842000007629395
},
{
“start”:6.899999618530273,
“length”:0.29999998211860657,
“word”:”open”,
“confidence”:0.5237200856208801
},
{
“start”:7.230000019073486,
“length”:1.3199999332427979,
“word”:”doors”,
“confidence”:0.5696256756782532
}
]
}
],
“final”:True
},
“segment”:0.0,
“id”:”03.26.2019_06.49.47_AM_telephony_9_92550624″
}

See the Bold things in response to the keyword Spotting, transcription and sentiment.

Logo