Language and Accent Id Specification

Purpose

The purpose of this document is to share the Emotion Identification API specification so that GoVivace potential customers could test their integration. The contents of this document are GoVivace Proprietary and Subject to change.

Introduction

This API strives to expose Language and Accent Identification routines as a restful web service. The emotion identification process assumes that the audio file input is an 8KHz 16 bit linear PCM file. If wav format is used, the first 44 bytes are just treated like audio and have been found to work fine.

Usage

The language and accent identification service accepts post requests with the audio in the body of the message at the specified URI. For example, using the curl command, one could do-

Curl Command

curl –request POST –data-binary “@sample1.wav” “https://services.govivace.com:7686/LanguageId?action=identify&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxx”

where sample1.wav is an 8KHz sampling rate 16 bit linear PCM file. The body of the post would contain the entire audio file in 16bit linear PCM 8KHz format.

Websocket API

wss://services.govivace.com:7686/LanguageId?action=identify&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxx

For Websocket API

After the last block of speech data, a special 3-byte ANSI-encoded string “EOS” (“end-of-stream”) needs to be sent to the server. This tells the server that no more speech is coming.

After sending “EOS”, the client has to keep the WebSocket open to receive the result from the server. The server closes the connection itself when results have been sent to the client. No more audio can be sent via the same WebSocket after an “EOS” has been sent. In order to process a new audio stream, a new WebSocket connection has to be created by the client.

Python Client

python client.py –uri “wss://services.govivace.com:7686/LanguageId?action=identify&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxx” –save-json-filename sample1_language.json –rate 4200 sample1.wav

Options

➢–save-json-filename: Save the intermediate JSON to this specified file
➢–rate: Rate in bytes/sec at which audio should be sent to the server
➢–uri: Server websocket URI
➢–key: Authentication key
➢–action: Action value which we want to perform like identify
➢–file_format: Define file format (default is 8K_PCM16)

Response

{
“enrollment_audio_time”:0.0,
“message”:”Language and Accent identification is successful”,
“status”:0,
“processing_time”:7.1924380000000001,
“identified_language”:”english”,
“score”:0.98004209995269775,
“languages_identified”:[
{
“language”:”english”,
“accents_identified”:[
{
“accent”:”India”,
“accent_identification_score”:0.35017871856689453
},
{
“accent”:”Russian”,
“accent_identification_score”:0.22314415872097015
},
{
“accent”:”France”,
“accent_identification_score”:0.14898049831390381
},
{
“accent”:”Germany”,
“accent_identification_score”:0.14805249869823456
}
],
“identification_score”:0.98004209995269775
},
{
“language”:”spanish”,
“identification_score”:0.0031394604593515396
},
{
“language”:”hindustani”,
“identification_score”:0.0026037667412310839
},
{
“language”:”bengali”,
“identification_score”:0.0020343062933534384
}
]
}

The server sends language and accent identification results and other information to the client using the JSON format. The response can contain the following fields:

●status: response status (integer), see codes below
●message: status message
●processing_time: total amount of time spent at the server side to process the audio
●identified_language: depends on maximum identification score from languages_identified
●score: confidence of identified language its less than one
●languages_identified: identified more than one languages
➢identification score: confidence score
➢language: English, Spanish, Hindi and so on.
➢accents_identified:contains accent_identification_score and accent

The following status codes are currently in use-

●0 – Success: Usually used when recognition result sent
●1 – No speech: Send when the incoming audio contains a large portion of silence or non-speech
●2 – Aborted: Recognition was aborted for some reason
●9 – Not Available: Used when all recognizer processes are currently in use and recognition cannot be performed

Languages supported and their codes:

English – 0
Thai – 1
Bengali – 2
Hindustani – 3
Russian – 4
Japanese – 5
Chinese – 6
Vietnamese – 7
Korean – 8
Farsi – 9
Arabic – 10
Spanish – 11
Tamil – 12
German – 13