The purpose of this document is to share the Speaker Identification API specification so that GoVivace potential customers could test their integration. The contents of this document are GoVivace Proprietary and Subject to change.
This API strives to expose Speaker identification routines as a restful web service. The speaker identification process consists of five steps – enrollment, identification, deletion, enabling and disabling of a speaker. The enrollment process takes multiple examples of the same speaker speaking something different to generate a voiceprint. Voiceprints are a digital representation of the speaker’s voice. They are stored internally in the server and are used to identify the voice in the future. Once the enrollment for a particular user is complete, it does not need to be repeated.
The identification step receives a new utterance from the same speaker and verifies that it’s the same voice using the voiceprint.
It is also possible to delete, disable and enable a speaker with the help of curl commands.
The purpose of enrollment is to collect voice examples of a speaker’s voice and extract a “voiceprint” from those. The voiceprint, in similarity to commonly used fingerprint, is a representation of the speaker’s voice in a succinct format. By enrolling a speaker, we have let the speaker identification system know how a particular speaker sounds like. In many situations, it’s desirable to have multiple examples of a speaker’s voice to be used to create a voiceprint. This is particularly true if the process is being used to identify a speaker using a relatively short phrase. Therefore, the speaker identification server allows multiple enrollment examples to be sent. Since these examples could be collected over a period of time, the server has the functionality to remember the previous examples.
An example POST to the speaker identification server could look like:
POST http://server.url//SpeakerId HTTP/1.1
Accept: */*
Content-Length: 125324
Content-Type: application/x-www-form-urlencoded
Expect: 100-continue
Host: server.url.port
User-Agent: curl/7.38.0
action=enroll&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx
CURL Command
curl –request POST –data-binary “@sample1.wav” “https://services.govivace.com:49162/SpeakerId?action=enroll&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx“
The body of the post would contain the entire audio file in a 16-bit linear PCM 8KHz format.
This POST request has three arguments. “action” determines one of the possible actions. In this case, the action is to enroll the speaker. The speaker_name is a required argument. speaker_name is a unique string identifier for the speaker. This string needs to be unique for each speaker and any future references to the same speaker shall use exactly the same string. The string is case sensitive and can consist of alphabet, numbers and the underscore “_”.
It is recommended that repeated calls be made to the enrollment method with new examples of enrollee’s voice until enrollment_audio_time is 60 seconds or more.
Key – Key is generated by our company and use for authentication purposes. It is an alphanumeric string.
wss://services.govivace.com:49162/SpeakerId?action=enroll&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx
python client.py –uri “wss://services.govivace.com:49162/SpeakerId?action=enroll&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx” –save-json-filename sample1_enroll.json –rate 4200 sample1.wav
(For Websocket API)
After the last block of speech data, a special 3-byte ANSI-encoded string “EOS” (“end-of-stream”) needs to be sent to the server. This tells the server that no more speech is coming.
After sending “EOS”, the client has to keep the WebSocket open to receive the result from the server. The server closes the connection itself when results have been sent to the client. No more audio can be sent via the same WebSocket after an “EOS” has been sent. In order to process a new audio stream, a new WebSocket connection has to be created by the client.
Enrollment Response
The server sends enrollment results and other information to the client using the JSON format. The response can contain the following fields:
The following status codes are currently in use:
Examples of server responses:
{{u’status’: 1, u’processing_time’: 5.969066, u’message’: u’Enrollment audio is not sufficient. Please send more examples.’, u’enrollment_audio_time’: 30.84000015258789}
{“status”: 0, “message”: “Enrollment done. Sending more enrollment will only help.”, “enrollment_audio_time”: 61.680000305175781, “processing_time”: 11.857492000000001}
The speaker search allows you to search for the closest match from the entire list of enrolled speakers. The audio is once again sent in the body of the message. An example post looks like below.
POST http://server.url//SpeakerId HTTP/1.1
Accept: */*
Content-Length: 125324
Content-Type: application/x-www-form-urlencoded
Expect: 100-continue
Host: server.url.port
User-Agent: curl/7.38.0
action=search&confidence_threshold=0.0&number_of_matches=
20&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx
CURL Command
curl –request POST –data-binary “@sample1.wav” “https://services.govivace.com:49162/SpeakerId?action=search&confidence_threshold=0.0&number_of_matches=
20&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx”
wss://services.govivace.com:49162/SpeakerIdaction=search&confidence_threshold=
0.0&number_of_matches=20&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx
Python Client
python client.py –uri “wss://services.govivace.com:49162/SpeakerIdaction=search&confidence_threshold
=0.0&number_of_matches=20&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx” –save-json-filename sample1_search.json –rate 4200 sample1.wav
Here the action is “search”. All the parameters are required. “number_of_matches” specifies the maximum number of matches that are desired in the result. “minimum_score” puts a threshold on the match score. Any match that is less than this score is not returned. The search score is returned as JSON. Examples of server responses are:
{“matches”: [{“speaker”: “sample1”, “identification_score”: 237.36494445800781}], “status”: 0, “message”: “Speaker identification is successful”, “enrollment_audio_time”: 30.840000152587891, “processing_time”: 6.8625569999999998}
Here is a description of the fields:
In these responses, most of the fields are the same as before. identification_audio_time shows the length of spoken audio that was found in the identification utterance, and the identification score specifies a floating-point number corresponding to the match between the voiceprint of the speaker and the identification utterance. The identification score may have a maximum value of 1.0. A reasonable threshold for determining a match may vary from application to application, and identification string to identification string, but is generally expected to be higher than 0.5.
Status codes retain the same meanings.
Delete
This will delete the speaker’s every piece of information. It is not possible to redo, after this, because the information will be deleted permanently. An example post looks like below.
POST https://server.url//SpeakerId HTTP/1.1
User-Agent: curl/7.58.0
Accept: */*
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
content-type: application/json
content-length: 107
action=enroll&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxx
CURL Command
curl –request POST “https://services.govivace.com:49162/SpeakerId?action=delete&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx“
This POST request has some arguments. “action” determines one of the possible actions. In this case, the action is deleted. The speaker_name is a required argument. speaker_name is a unique string identifier for the speaker. This string needs to be unique for each speaker. The string is case sensitive and can consist of alphabet, numbers and the underscore “_”. Key is generated by our company and use for authentication purposes.
Disable
The purpose of this is to disable a speaker. If a speaker is disabled, the user can not perform any action on that speaker, until the speaker is enabled, after running an enable curl command.
An example post looks like below.
POST https://server.url//SpeakerId HTTP/1.1
User-Agent: curl/7.58.0
Accept: */*
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
content-type: application/json
content-length: 111
action=disable&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx
CURL Command
curl –request POST “https://services.govivace.com:49162/SpeakerId?action=disable&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx“
Here the action is disabled and other arguments are the same as in delete curl command.
Enable
The purpose of this is to enable a speaker. If a speaker was in the disabled state, the speaker will be enabled, after this operation and the user can perform any action on that speaker. An example post looks like below.
POST https://server.url//SpeakerId HTTP/1.1
User-Agent: curl/7.58.0
Accept: */*
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
content-type: application/json
content-length: 110
action=enable&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxx
CURL Command
curl –request POST “https://services.govivace.com:49162/SpeakerId?action=enable&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxx“
Here the action is enabled and other arguments are the same as in delete curl command.