• Find us on

Speaker Verification API User Manual

Purpose

The purpose of this document is to share the Speaker Verification API specification so that GoVivace potential customers. The contents of this document are GoVivace Proprietary and Subject to change.

Introduction

This API strives to expose Speaker Verification routines as a restful web service. The speaker verification process consists of two steps – Enrollment & Verification.

The enrollment process takes multiple examples of the same speaker speaking something specific to generate a voiceprint. Voiceprints are a digital representation of the speaker’s voice. They are stored internally in the server and are used to verify the voice in the future. Once the enrollment for a particular user is complete, it does not need to be repeated.

The verification step receives a new utterance from the same speaker and verifies that it’s the same voice using the voiceprint.

Enrollment

The purpose of enrollment is to collect voice examples of a speaker’s voice and extract a ​‘voiceprint’ from those. The voiceprint, in similarity to commonly used fingerprint, is a representation of the speaker’s voice in a succinct format. By enrolling a speaker, we have let the speaker verification system know how a particular speaker sounds like. In many situations, it’s desirable to have multiple examples of a speaker’s voice to be used to create a voiceprint. This is particularly true if the process is being used to verify a speaker using a relatively short phrase. Therefore, the speaker verification server allows multiple enrollment examples to be sent. Since these examples could be collected over a period of time, the server has the functionality to remember the previous examples.

An example POST or PUT to the Speaker Verification Server could look like:

● PUT /SpeakerId?action=enroll&speaker_name=speaker1&format=8K_PCM16&key=x xxxxxxxxxxxxxxxxxxxxxx HTTP/1.1
● Host: services.govivace.com:port
● User-Agent: curl/7.58.0
● Accept: */*
● Content-Length: 105788
● Content-Type: application/x-www-form-urlencoded
● Expect: 100-continue

You can make a request to the server two ways either HTTP or WebSocket:
1. For HTTP:
“https://services.govivace.com:49180/SpeakerId?action=enroll&speaker_name=s ample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxxxxxxxx”

2. For WS:
“wss://services.govivace.com:49180/SpeakerId?action=enroll&speaker_name=s ample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxxxxxxxx”
Important parameters to make this request:
action: ​enroll (required)
speaker_name: ​It is a unique string identifier for the speaker. This string needs to be unique for each speaker and any future references to the same speaker shall use exactly the same string. The string is case sensitive and can consist of alphabet, numbers and the underscore “_”. (required)
format: ​8K_PCM16 (16-bit linear PCM 8KHz format). (required)
key: ​32bit unique secret key provided by us. (required)

Note: speaker_name value can be alphanumeric and case sensitive.

HTTP format and response as explained in the below example :
curl –request PUT –data-binary “@sample1.wav” “https://services.govivace.com:49180/SpeakerId?action=enroll&speaker_name=sample 1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxxxxxxxx”

Response :
{
“processing_time”:1.795606, “status”:1, “message”:”Enrollment audio is not sufficient. Please send more examples.”, “model_quality_score”:90.855392456054688, “Enrollment_audio_time”:11.600000381469727 “Spoof_attack_probability”:0.2 “Playback_probability”: 0.1 }

For WebSocket API:
After the last block of speech data, a special 3-byte ANSI-encoded string “EOS” (“end-of-stream”) needs to be sent to the server. This tells the server that no more speech is coming.
After sending “EOS”, the client has to keep the websocket open to receive results from the server. The server closes the connection itself when results have been sent to the client. No more audio can be sent via the same websocket after an “EOS” has been sent. In order to process a new audio stream, a new websocket connection has to be created by the client.

WS format and response as explained in the below example :
python client.py –uri “wss://services.govivace.com:49180/SpeakerId?action=enroll&speaker_name=sample1 &format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxx” –save-json-filename sample1_enroll.json –rate 8000 sample1.wav

Response:
{
“status”:4, “message”:”Beginning of speech” } {
“processing_time”:0.952599, “status”:1, “message”:”Enrollment audio is not sufficient. Please send more examples.”, “model_quality_score”:-100000.0, “Enrollment_audio_time”:5.800000190734863 “Spoof_attack_probability”: 0.2 “Playback_probability”: 0.1
} It is recommended that repeated calls be made to the enrollment method with new examples of enrollee’s voice until enrollment_audio_time is 20-30 seconds or more. All calls to the enrollment method MUST have a subset of words from a list of words so that enrollment text for any particular speaker_name has each example word repeated several times.
For example, if the set of words is {red, green, blue, three, four, five}, then valid enrollment strings for this speaker can be any combination of these words, and when six enrollment strings of four words each are sent, it’s possible to have seen four examples of each word. However, the enrollment string may be different for different speakers. The same set of words will be used at the time of verification for that particular user, but the order or specific combination is not important. It’s recommended that each validation string should amount to at least 5 seconds of audio.

Enrollment Response

The server sends enrollment results and other information to the client using the JSON format. The response can contain the following fields:
status:​ response status (integer), see codes below
message:​ status message
enrollment_audio_time​: Total amount of audio collected for this speaker so far in seconds
processing_time: Total amount of time spent at the server side to process the audio.
model_quality_score:​ This will show the quality score of the speech.
Spoof_attack_probability: Probability of the speech being generated through voice conversion or text to speech.
Playback_probability: Probability of replaying a recording of the speaker’s speech

The following status codes are currently in use:
0:​ Success. Enrollment happened. In this case, the result will be sent back.
1: No speech. Sent when the incoming audio is too short, contains a large portion of silence or non-speech.
2: Unusually high energy. This code is generally sent if the incoming audio has so much noise as if it’s not real audio data. This could happen if the input is not 16 bit linear PCM, or if silence portions in the beginning and the end of the utterance are totally stripped out.
3:​ Invalid request or a request format error
4:​ Beginning of the speech.

Verification

Once a speaker has been enrolled in the system, a new recording of the speaker’s voice can be used to verify if it’s the same speaker. The audio is once again sent in the body of the message.
An example POST or PUT to the Speaker Verification Server could look like:
● PUT /SpeakerId?action=verify&speaker_name=sample1&format=8K_PCM16&key=xx xxxxxxxxxxxxxxxxx HTTP/1.1
● Host: services.govivace.com:49180
● User-Agent: curl/7.58.0
● Accept: */*
● Content-Length: 105788
● Content-Type: application/x-www-form-urlencoded
● Expect: 100-continue
You can make a request to the server two ways either HTTP or WebSocket:
1. For HTTP:
“https://services.govivace.com:49180/SpeakerId?action=verify&speaker_name=s ample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxxxxxxxx”
2. For WS:
“wss://services.govivace.com:49180/SpeakerId?action=verify&speaker_name=sa mple1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxxxxxxxx”
This format is similar to the enrollment format except that the action is ​‘verify’. All the parameters are required. If the audio does not match the specified string at a reasonable level of confidence, then the verification process fails. If the audio string matches, then the verification score is returned along with the string match results. Important parameters to make this request:
action: ​verify (required)
speaker_name: ​It is a unique string identifier for the speaker. This string needs to be unique for each speaker and any future references to the same speaker shall use exactly the same string. The string is case sensitive and can consist of alphabet, numbers and the underscore “_” . (required)
format: ​8K_PCM16 (16-bit linear PCM 8KHz format). (required)
key: ​32bit unique secret key provided by us. (required)

Note: speaker_name value can be alphanumeric and case sensitive.

HTTP format and response as explained in the below example :
curl –request PUT –data-binary “@sample1.wav” “​https://services.govivace.com:49180/SpeakerId?action=verify&speaker_name=sample 1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxxxxxxxxx​”

Response :
{
“processing_time”:0.73345400000000005, “status”:0, “message”:”Speaker verification is successful”, “speaker”:”sample1″, “verification_score”:98.361328125, “enrollment_audio_time”:5.8000001907348633 } WS format and response as explained in the below example: python client.py –uri “wss://services.govivace.com:49180/SpeakerId?action=verify&speaker_name=sample1 &format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxx” –save-json-filename sample1_enroll.json –rate 8000 sample1.wav
Response: {
“status”:4, “message”:”Begining of speech” } {
“processing_time”:0.918749, “status”:0, “message”:”Speaker verification is successful”, “speaker”:”sample1″, “verification_score”:90.83572387695312, “enrollment_audio_time”:5.800000190734863 }

Verification Response

The server sends verification results and other information to the client using the JSON format. Most of the fields are the same as before. The response can contain the following fields:

status:​ response status (integer), see codes above.
message:​ status message.
enrollment_audio_time: Total amount of spoken audio in this particular utterance in seconds.
verification_score: a floating-point number less than 1. Positive values mean a match.
processing_time:​ Time spent in processing the audio in seconds.
Spoof_attack_probability: Probability of the speech being generated through voice conversion or text to speech.
Playback_probability: Probability of replaying a recording of the speaker’s speech
Verification scores may have a maximum value of 1.0, but negative values are possible. A reasonable threshold for determining a match may vary from application to application, and verification string to verification string, but is generally expected to be higher than 0.05.
Status codes retain the same meanings.