Purpose

The purpose of this document is to share the Speaker Verification API specification so that GoVivace potential customers could test their integration. The contents of this document are GoVivace Proprietary and Subject to change.

Introduction

This API strives to expose Speaker Verification routines as a restful web service. The speaker verification process consists of two steps – Enrollment & Verification.

The enrollment process takes multiple examples of the same speaker speaking something specific to generate a voiceprint. Voiceprints are a digital representation of the speaker’s voice. They are stored internally in the server and are used to verify the voice in the future. Once the enrollment for a particular user is complete, it does not need to be repeated.

The identification step receives a new utterance from the same speaker and verifies that it’s the same voice using the voiceprint.
It is also possible to delete, disable, and enable a speaker with the help of curl commands.

Enrollment

The purpose of enrollment is to collect voice examples of a speaker’s voice and extract a ‘voiceprint’ from those. The voiceprint, in similarity to commonly used fingerprint, is a representation of the speaker’s voice in a succinct format. By enrolling a speaker, we have let the speaker verification system know how a particular speaker sounds like. In many situations, it’s desirable to have multiple examples of a speaker’s voice to be used to create a voiceprint. This is particularly true if the process is being used to verify a speaker using a relatively short phrase. Therefore, the speaker verification server allows multiple enrollment examples to be sent. Since these examples could be collected over a period of time, the server has the functionality to remember the previous examples.

An example POST or PUT to the Speaker Verification Server could look like:

● POST http://server.url//SpeakerId HTTP/1.1
● Accept: */*
● Content-Length: 125324
● Content-Type: application/x-www-form-urlencoded
● Expect: 100-continue
● Host: server.url.port
● User-Agent: curl/7.38.0
● action=enroll&speaker_name=sample1&
format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxx

CURL Command:
curl –request POST –data-binary “@sample1.wav” “https://services.govivace.com:49162/SpeakerId?action=enroll&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx”
The body of the post would contain the entire audio file in a 16-bit linear PCM 8KHz format.

This POST request has three arguments. “action” determines one of the possible actions. In this case, the action is to enroll the speaker. The speaker_name is a required argument. speaker_name is a unique string identifier for the speaker. This string needs to be unique for each speaker and any future references to the same speaker shall use exactly the same string. The string is case sensitive and can consist of alphabet, numbers and the underscore “_”.

It is recommended that repeated calls be made to the enrollment method with new examples of enrollee’s voice until enrollment_audio_time is 60 seconds or more.
Key – Key is generated by our company and used for authentication purposes. It is an alphanumeric string.

Websocket API for Enrollment

wss://services.govivace.com:49162/SpeakerId?action=enroll&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx

Python Client:
python client.py –uri wss://services.govivace.com:49162/SpeakerId –action enroll –spk sample1 –file_format 8K_PCM16 –key xxxxxxxxxxxxxx –rate 16000 –save-json-filename sample1_enroll.json sample1.wav

Options:
➢ –save-json-filename: Save the intermediate JSON to this specified file
➢ –rate: Rate in bytes/sec at which audio should be sent to the server
➢ –uri: Server websocket URI
➢ –key: Authentication key
➢ –action: Action value which we want to perform like enroll
➢ –file_format: Define file format (default is 8K_PCM16)
➢ –spk: Speaker name. Necessary for enrollment and searching

Enrollment Response

The server sends enrollment results and other information to the client using the JSON format. The response can contain the following fields:
● status: response status (integer), see codes below
● message: status message
● enrollment_audio_time: Total amount of audio collected for this speaker so far in seconds
● processing_time: Total amount of time spent at the server side to process the audio.
● model_quality_score: Evaluation score representing the quality of the already existing model for a speaker, with the current enrollment signal. Basically it represents how speaker X’s model quality is improving over time when more enrollment is coming. If the speaker had no previously enrolled model (i.e. the current enrollment signal is the first enrollment for the speaker ), model_quality_score would be the default value -1000

The following status codes are currently in use:
● 0: Success. Enrollment happened. In this case, the result will be sent back.
● 1: No speech. Sent when the incoming audio is too short, contains a large portion of silence or non-speech.
● 2: Invalid request or a request format error..
● 4: Beginning of speech – Server provides a signal when it finds a speech chunk for the first time.
Examples of server responses:
Status 1:
{
“model_quality_score”:103.34664916992188,
“processing_time”:6.5869710000000001,
“status”:1,
“message”:”Enrollment audio is not sufficient. Please send more examples.”,
“enrollment_audio_time”:30.540000915527344,
“final”:true
}
Status 0:
{
“model_quality_score”:137.3951416015625,
“processing_time”:5.5276949999999996,
“status”:0,
“message”:”Enrollment done. Sending more enrollment will only help.”,
“enrollment_audio_time”:65.339996337890625,
“final”:true
}

Search

The Speaker search allows you to search for the closest match from the entire list of enrolled speakers. The audio is once again sent in the body of the message. An example post looks like below.

POST http://server.url//SpeakerId HTTP/1.1
Accept: */*
Content-Length: 125324
Content-Type: application/x-www-form-urlencoded
Expect: 100-continue
Host: server.url.port

User-Agent: curl/7.38.0
action=search&confidence_threshold=0.0&number_of_matches=20&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx

CURL Command:
curl –request POST –data-binary “@sample1.wav” “https://services.govivace.com:49162/SpeakerId?action=search&confidence_threshold=0.0&number_of_matches=20&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx”

Websocket API for Search

wss://services.govivace.com:49162/SpeakerId?action=search&confidence_threshold=0.0&number_of_matches=20&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx

Python Client:
python client.py –uri wss://services.govivace.com:49162/SpeakerId –action search –confidence_threshold -200 –number_of_matches 20 –file_format 8K_PCM16 –key xxxxxxxxxxxxxx –rate 16000 –save-json-filename sample1_search.json sample1.wav

Options:
➢–save-json-filename: Save the intermediate JSON to this specified file
➢–rate: Rate in bytes/sec at which audio should be sent to the server
➢–uri: Server websocket URI
➢–key: Authentication key
➢–action: Action value which we want to perform like search
➢–file_format: Define file format (default value is 8K_PCM16)
➢–spk: Speaker name. Necessary for enrollment and searching
➢–confidence_threshold: Threshold value (default value is -200)
➢–number_of_matches: Total number of matching speakers (default value is 20)

Here the action is “search”. All the parameters are required. “number_of_matches” specifies the maximum number of matches that are desired in the result. “minimum_score” puts a threshold on the match score. Any match that is less than this score is not returned. The search score is returned as JSON. Examples of server responses are:

{
“matches”:[
{
“speaker”:”sample1″,
“identification_score”:237.36494445800781
}
],
“status”:0,
“message”:”Speaker identification is successful”,
“enrollment_audio_time”:30.840000152587891,
“processing_time”:6.8625569999999998,
“final”:true
}

Here is a description of the fields:
● status: response status (integer), see codes below
● message: (optional) status message
● identification_audio_time: Total amount of spoken audio in this particular utterance in seconds
● identification_score: a floating-point number. Positive values mean a match.
● processing_time: Time spent in processing the audio in seconds.

In these responses, most of the fields are the same as before. identification_audio_time shows the length of spoken audio that was found in the identification utterance, and the identification score specifies a floating-point number corresponding to the match between the voiceprint of the speaker and the identification utterance. The identification score may have a maximum value of 1.0. A reasonable threshold for determining a match may vary from application to application, and identification string to identification string, but is generally expected to be higher than 0.5.

The following status codes are currently in use:
● 0: Success. Identification success. In this case the result will be sent back.
● 1: No speech. Sent when the incoming audio is too short, contains a large portion of silence or non-speech.
● 2: Invalid request or a request format error.
● 3:Identification no match. i.e. No matching speaker found
● 4: Beginning of speech – Server provides a signal when it finds a speech chunk for the first time.

Delete

This will delete the speaker’s every piece of information. It is not possible to redo, after this, because the information will be deleted permanently. An example post looks like below.

POST https://server.url//SpeakerId HTTP/1.1
User-Agent: curl/7.58.0
Accept: */*
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
content-type: application/json
content-length: 107
action=enroll&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxx

CURL Command:
curl –request POST “https://services.govivace.com:49162/SpeakerId?action=delete&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx”

This POST request has some arguments. “action” determines one of the possible actions. In this case the action is deleted. The speaker_name is a required argument. speaker_name is a unique string identifier for the speaker. This string needs to be unique for each speaker. The string is case sensitive and can consist of alphabet, numbers and the underscore “_” . Key is generated by our company and used for authentication purposes.

Disable

The purpose of this is to disable a speaker. If a speaker is disabled, the user can not perform any action on that speaker, until the speaker is enabled, after running an enable curl command.
An example post looks like below.

POST https://server.url//SpeakerId HTTP/1.1
User-Agent: curl/7.58.0
Accept: */*
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
content-type: application/json
content-length: 111
action=disable&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxx

CURL Command:
curl –request POST “https://services.govivace.com:49162/SpeakerId?action=disable&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx”

Here the action is disabled and other arguments are the same as in delete curl command.

Enable

The purpose of this is to enable a speaker. If a speaker was in a disabled state, the speaker will be enabled, after this operation and the user can perform any action on that speaker. An example post looks like below.
POST https://server.url//SpeakerId HTTP/1.1
User-Agent: curl/7.58.0
Accept: */*
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
content-type: application/json
content-length: 110
action=enable&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxx
CURL Command:
curl –request POST “https://services.govivace.com:49162/SpeakerId?action=enable&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx”

Here the action is enabled and other arguments are the same as in delete curl command.