• Find us on

Speaker Identification API Specification

Purpose

The purpose of this document is to share the Speaker Identification API specification so that GoVivace potential customers could test their integration. The contents of this document are GoVivace Proprietary and Subject to change.

Introduction

This API strives to expose Speaker identification routines as a restful web service. The speaker identification process consists of five steps – enrollment, identification, deletion, enabling and disabling of a speaker. The enrollment process takes multiple examples of the same speaker speaking something different to generate a voiceprint. Voiceprints are a digital representation of the speaker’s voice. They are stored internally in the server and are used to identify the voice in the future. Once the enrollment for a particular user is complete, it does not need to be repeated.

The identification step receives a new utterance from the same speaker and verifies that it’s the same voice using the voiceprint.

It is also possible to delete, disable and enable a speaker with the help of curl commands.

Enrollment

The purpose of enrollment is to collect voice examples of a speaker’s voice and extract a “voiceprint” from those. The voiceprint, in similarity to commonly used fingerprint, is a representation of the speaker’s voice in a succinct format. By enrolling a speaker, we have let the speaker identification system know how a particular speaker sounds like. In many situations, it’s desirable to have multiple examples of a speaker’s voice to be used to create a voiceprint. This is particularly true if the process is being used to identify a speaker using a relatively short phrase. Therefore, the speaker identification server allows multiple enrollment examples to be sent. Since these examples could be collected over a period of time, the server has the functionality to remember the previous examples.

 

An example POST to the speaker identification server could look like:

POST http://server.url//SpeakerId HTTP/1.1

Accept: */*

Content-Length: 125324

Content-Type: application/x-www-form-urlencoded

Expect: 100-continue

Host: server.url.port

User-Agent: curl/7.38.0
action=enroll&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx

CURL Command

curl –request POST –data-binary “@sample1.wav” “https://services.govivace.com:49162/SpeakerId?action=enroll&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx

The body of the post would contain the entire audio file in a 16-bit linear PCM 8KHz format.

This POST request has three arguments. “action” determines one of the possible actions. In this case, the action is to enroll the speaker. The speaker_name is a required argument. speaker_name is a unique string identifier for the speaker. This string needs to be unique for each speaker and any future references to the same speaker shall use exactly the same string. The string is case sensitive and can consist of alphabet, numbers and the underscore “_”.

It is recommended that repeated calls be made to the enrollment method with new examples of enrollee’s voice until enrollment_audio_time is  60 seconds or more.

Key –  Key is generated by our company and use for authentication purposes. It is an alphanumeric string.

Websocket API for Enrollment

wss://services.govivace.com:49162/SpeakerId?action=enroll&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx

Python Client

python client.py –uri “wss://services.govivace.com:49162/SpeakerId?action=enroll&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx” –save-json-filename sample1_enroll.json –rate 4200 sample1.wav

(For Websocket API)

After the last block of speech data, a special 3-byte ANSI-encoded string “EOS” (“end-of-stream”) needs to be sent to the server. This tells the server that no more speech is coming.

After sending “EOS”, the client has to keep the WebSocket open to receive the result from the server. The server closes the connection itself when results have been sent to the client. No more audio can be sent via the same WebSocket after an “EOS” has been sent. In order to process a new audio stream, a new WebSocket connection has to be created by the client.

 

Enrollment Response

The server sends enrollment results and other information to the client using the JSON format. The response can contain the following fields:

  • status : response status (integer), see codes below
  • message : status message
  • enrollment_audio_time : Total amount of audio collected for this speaker so far in seconds
  • processing_time : Total amount of time spent at the server side to process the audio.

 

The following status codes are currently in use:

  • 0: Success. Enrollment happened. In this case, the result will be sent back.
  • 1: No speech. Sent when the incoming audio is too short, contains a large portion of silence or non-speech.
  • 2: Unusually high energy. This code is generally sent if the incoming audio has so much noise as if it’s not real audio data. This could happen if the input is not 16 bit linear PCM, or if silence portions in the beginning and the end of the utterance are totally stripped out.
  • 3: Invalid request or a request format error
  • 4: Input not matched (input rejected) – The audio did not seem to match the text string. Therefore the input was rejected.

Examples of server responses:

{{u’status’: 1, u’processing_time’: 5.969066, u’message’: u’Enrollment audio is not sufficient. Please send more examples.’, u’enrollment_audio_time’: 30.84000015258789}

{“status”: 0, “message”: “Enrollment done. Sending more enrollment will only help.”, “enrollment_audio_time”: 61.680000305175781, “processing_time”: 11.857492000000001}

 

Search

The speaker search allows you to search for the closest match from the entire list of enrolled speakers. The audio is once again sent in the body of the message. An example post looks like below.

POST http://server.url//SpeakerId HTTP/1.1

Accept: */*

Content-Length: 125324

Content-Type: application/x-www-form-urlencoded

Expect: 100-continue

Host: server.url.port

User-Agent: curl/7.38.0
action=search&confidence_threshold=0.0&number_of_matches=20&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx

 

CURL Command

curl –request POST –data-binary “@sample1.wav” “https://services.govivace.com:49162/SpeakerId?action=search&confidence_threshold=0.0&number_of_matches=20&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx”

 

Websocket API for Search

wss://services.govivace.com:49162/SpeakerId?action=search&confidence_threshold=0.0&number_of_matches=20&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx

Python Client

python client.py –uri “wss://services.govivace.com:49162/SpeakerId?action=search&confidence_threshold=0.0&number_of_matches=20&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx” –save-json-filename sample1_search.json –rate 4200 sample1.wav

Here the action is “search”. All the parameters are required. “number_of_matches” specifies the maximum number of matches that are desired in the result. “minimum_score” puts a threshold on the match score. Any match that is less than this score is not returned. The search score is returned as JSON. Examples of server responses are:

{“matches”: [{“speaker”: “sample1”, “identification_score”: 237.36494445800781}], “status”: 0, “message”: “Speaker identification is successful”, “enrollment_audio_time”: 30.840000152587891, “processing_time”: 6.8625569999999998}

Here is a description of the fields:

  • status: response status (integer), see codes above
  • message:  (optional) status message
  • identification_audio_time: Total amount of spoken audio in this particular utterance in seconds
  • identification_score: a floating-point number less than 1. Positive values mean a match.
  • processing_time: Time spent in processing the audio in seconds

In these responses, most of the fields are the same as before. identification_audio_time shows the length of spoken audio that was found in the identification utterance, and the identification score specifies a floating-point number corresponding to the match between the voiceprint of the speaker and the identification utterance. The identification score may have a maximum value of 1.0. A reasonable threshold for determining a match may vary from application to application, and identification string to identification string, but is generally expected to be higher than 0.5. 

Status codes retain the same meanings.

Delete

This will delete the speaker’s every piece of information. It is not possible to redo, after this, because the information will be deleted permanently. An example post looks like below.
 

POST https://server.url//SpeakerId HTTP/1.1
User-Agent: curl/7.58.0
Accept: */*
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
content-type: application/json
content-length: 107
action=enroll&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxx

CURL Command

curl –request POST “https://services.govivace.com:49162/SpeakerId?action=delete&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx

This POST request has some arguments. “action” determines one of the possible actions. In this case, the action is deleted. The speaker_name is a required argument. speaker_name is a unique string identifier for the speaker. This string needs to be unique for each speaker. The string is case sensitive and can consist of alphabet, numbers and the underscore “_”.  Key is generated by our company and use for authentication purposes.

Disable 

The purpose of this is to disable a speaker. If a speaker is disabled, the user can not perform any action on that speaker, until the speaker is enabled, after running an enable curl command.

An example post looks like below.

POST https://server.url//SpeakerId HTTP/1.1

User-Agent: curl/7.58.0

Accept: */*

HTTP/1.1 200 OK

Access-Control-Allow-Origin: *

content-type: application/json

content-length: 111

action=disable&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx

CURL Command

curl –request POST “https://services.govivace.com:49162/SpeakerId?action=disable&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx

Here the action is disabled and other arguments are the same as in delete curl command.

Enable

The purpose of this is to enable a speaker. If a speaker was in the disabled state, the speaker will be enabled, after this operation and the user can perform any action on that speaker. An example post looks like below.

POST https://server.url//SpeakerId HTTP/1.1

User-Agent: curl/7.58.0

Accept: */*

HTTP/1.1 200 OK

Access-Control-Allow-Origin: *

content-type: application/json

content-length: 110

action=enable&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxx

CURL Command

curl –request POST “https://services.govivace.com:49162/SpeakerId?action=enable&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx

Here the action is enabled and other arguments are the same as in delete curl command.