• Find us on

Speaker Verification API Specification

Purpose

The purpose of this document is to share the Speaker Verification API specification so that GoVivace potential customers. The contents of this document are GoVivace Proprietary and Subject to change.

Introduction

This API strives to expose Speaker verification routines as a restful web service. The speaker verification process consists of two steps – enrollment and verification. The enrollment process takes multiple examples of the same speaker speaking something specific to generate a voiceprint. Voiceprints are a digital representation of the speaker’s voice. They are stored internally in the server and are used to verify the voice in the future. Once the enrollment for a particular user is complete, it does not need to be repeated.

The verification step receives a new utterance from the same speaker and verifies that its the same voice using the voiceprint.

Enrollment

The purpose of enrollment is to collect voice examples of a speaker’s voice and extract a “voiceprint” from those. The voiceprint, in similarity to commonly used fingerprint, is a representation of the speaker’s voice in a succinct format. By enrolling a speaker, we have let the speaker verification system know how a particular speaker sounds like. In many situations, it’s desirable to have multiple examples of a speaker’s voice to be used to create a voiceprint. This is particularly true if the process is being used to verify a speaker using a relatively short phrase. Therefore, the speaker verification server allows multiple enrollment examples to be sent. Since these examples could be collected over a period of time, the server has the functionality to remember the previous examples.

An example POST to the speaker verification server could look like:

POST http://server.url//SpeakerId HTTP/1.1

Accept: */*

Content-Length: 125324

Content-Type: application/x-www-form-urlencoded

Expect: 100-continue

Host: server.url.port

User-Agent: curl/7.38.0
action=enroll&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxx

Curl Command

curl –request POST –data-binary “@sample1.wav” “https://services.govivace.com:49180/SpeakerId?action=enroll&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxx”

The body of the post would contain the entire audio file in 16bit linear PCM 8KHz format.

This POST request has four arguments. “action” determines one of the possible actions. In this case, the action is to enroll the speaker. The speaker_name is a required argument. speaker_name is a unique string identifier for the speaker. This string needs to be unique for each speaker and any future references to the same speaker shall use exactly the same string. The string is case sensitive and can consist of alphabet, numbers and the underscore “_”.

It is recommended that repeated calls be made to the enrollment method with new examples of enrollee’s voice until enrollment_audio_time is  20-30 seconds or more. All calls to enrollment method MUST have a subset of words from a list of words so that enrollment text for any particular speaker_name has each example word repeated several times. For example, if the set of words is {red, green, blue, three, four, five}, then valid enrollment strings for this speaker can be any combination of these words, and when six enrollment strings of four words each are sent, it’s possible to have seen four examples of each word. However, the enrollment string may be different for different speakers. The same set of words will be used at the time of verification for that particular user, but the order or specific combination is not important. It’s recommended that each validation string should amount to at least 5 seconds of audio.

Websocket API for Enrollment

wss://services.govivace.com:49180/SpeakerId?action=enroll&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxx

Python Client

python client.py –uri “wss://services.govivace.com:49180/SpeakerId?action=enroll&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxx” –save-json-filename sample1_enroll.json –rate 4200 sample1.wav

(For Websocket API)

After the last block of speech data, a special 3-byte ANSI-encoded string “EOS” (“end-of-stream”) needs to be sent to the server. This tells the server that no more speech is coming.

After sending “EOS”, the client has to keep the WebSocket open to receive the result from the server. The server closes the connection itself when results have been sent to the client. No more audio can be sent via the same WebSocket after an “EOS” has been sent. In order to process a new audio stream, a new WebSocket connection has to be created by the client.

Enrollment Response
The server sends enrollment results and other information to the client using the JSON format. The response can contain the following fields:

  • status: response status (integer), see codes below
  • message: status message
  • enrollment_audio_time: Total amount of audio collected for this speaker so far in seconds
  • processing_time: Total amount of time spent at the server side to process the audio.

 

The following status codes are currently in use:

  • 0: Success. Enrollment happened. In this case, the result will be sent back.
  • 1: No speech. Sent when the incoming audio is too short, contains a large portion of silence or non-speech.
  • 2: Unusually high energy. This code is generally sent if the incoming audio has so much noise as if it’s not real audio data. This could happen if the input is not 16 bit linear PCM, or if silence portions in the beginning and the end of the utterance are totally stripped out.
  • 3: Invalid request or a request format error
  • 4: Input not matched (input rejected) – The audio did not seem to match the text string. Therefore the input was rejected.

Examples of server responses

{“message”: “Enrollment done. Sending more enrollment will only help.”, “status”: 0, “enrollment_audio_time”: 61.680000305175781, “processing_time”: 12.259484}

{u’status’: 1, u’processing_time’: 7.295596, u’message’: u’Enrollment audio is not sufficient. Please send more examples.’, u’enrollment_audio_time’: 30.850000381469727}
 

Verification

Once a speaker has been enrolled in the system, a new recording of the speaker’s voice can be used to verify if it’s the same speaker. The audio is once again sent in the body of the message. An example post looks like below.

POST http://server.url//SpeakerId HTTP/1.1
Accept: */*
Content-Length: 125324
Content-Type: application/x-www-form-urlencoded
Expect: 100-continue
Host: server.url.port
User-Agent: curl/7.38.0
action=action=verify&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxx

CURL Command

curl –request POST –data-binary “@sample1.wav” “https://services.govivace.com:49180/SpeakerId?action=verify&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxx”

Websocket API for Verify

wss://services.govivace.com:49180/SpeakerId?action=verify&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxx

Python Client

python client.py –uri “wss://services.govivace.com:49180/SpeakerId?action=verify&speaker_name=sample1&format=8K_PCM16&key=xxxxxxxxxxxxxxxxxxx” –save-json-filename sample1_verify.json –rate 4200 sample1.wav

This format is similar to the enrollment format except that the action is “verify”. All the parameters are required. If the audio does not match the specified string at a reasonable level of confidence, then the verification process fails. If the audio string matches, then the verification score is returned along with the string match results. Examples of server responses are:

{“message”: “Speaker verification is successful”, “status”: 0, “verification_score”: 237.35360717773438, “speaker”: “sample1”, “enrollment_audio_time”: 30.840000152587891, “processing_time”: 7.0539759999999996}

Here is a description of the fields:

  • status: response status (integer), see codes above
  • message: (optional) status message
  • enrollment_audio_time: Total amount of spoken audio in this particular utterance in seconds
  • verification_score: a floating-point number less than 1. Positive values mean a match.
  • processing_time: Time spent in processing the audio in seconds.

In these responses, most of the fields are the same as before. enrollment_audio_time shows the length of spoken audio that was found in the verification utterance, and the verification score specifies a floating-point number corresponding to the match between the voiceprint of the speaker and the verification utterance. The verification score may have a maximum value of 1.0, but negative values are possible. A reasonable threshold for determining a match may vary from application to application, and verification string to verification string, but is generally expected to be higher than 0.05. 

Status codes retain the same meanings.