GRXML ASR Specification

Purpose

This document shares the working of GRXML based grammar. Here, we explain the different grammars and corresponding output in JSON format.

Introduction

The audio file input is a 8KHz 16 bit linear PCM file.

There are three grammar i.e. answer, location, and duration.

  • The answer only contains inputs “yes” and “no”.
  • Location only contains inputs “local” and “area”.
  • Duration contains inputs “less than”, “more than”, “days” and hours

Example:

  • Less than 24 days
  • More than 10 hours
  • Less than 1 day
  • Less than 1 hour

Usage

01. For answer grammar

curl –request POST –data-binary “@yes.wav” “http://198.199.70.106:49162/answer?
key=xxxxxxxxxxxxxxxxxx”

Response

{
“status”:0,
“result”:{
“final”:true,
“hypotheses”:[
{
“transcript”:”[noise] yes”,
“likelihood”:1.0,
“final_output”:”rule_ref: yes_no ,output_tag: yes”,
“word-alignment”:[
{
“word”:”[noise]”,
“start”:0.0,
“length”:0.0099999997764825821,
“confidence”:1.0
},
{
“word”:”yes”,
“start”:1.2899999618530273,
“length”:0.52999997138977051,
“confidence”:1.0
}
]
}
]
},
“segment”:0.0,
“segment-start”:0.0,
“segment-length”:1.8199999332427979,
“total-length”:3.1695003509521484,
“id”:”03.27.2019_08.36.47_AM_answer_9_59613404″
}

Note:“In this grammar response, you need to extract the output_tag from the final_output tag.”

Extraction from the JSON

Input: yes

Output: “yes”

02. For location grammar

curl –request POST –data-binary “@area.wav” “http://198.199.70.106:49162/location?key=xxxxxxxxxxxxxxxxxx”
Response

{
“status”:0,
“result”:{
“final”:true,
“hypotheses”:[
{
“transcript”:”[noise] area”,
“likelihood”:1.0,
“final_output”:”rule_ref: location ,output_tag: area”,
“word-alignment”:[
{
“word”:”[noise]”,
“start”:0.0,
“length”:0.0099999997764825821,
“confidence”:1.0
},
{
“word”:”area”,
“start”:0.15999999642372131,
“length”:0.56999999284744263,
“confidence”:1.0
}
]
}
]
},
“segment”:0.0,
“segment-start”:0.0,
“segment-length”:0.72999995946884155,
“total-length”:5.2586245536804199,
“id”:”03.27.2019_08.39.18_AM_location_9_25370808″
}

Input : area

Output: “area”

03. For duration grammar

curl –request POST –data-binary “@lessthanonehour.wav” “http://198.199.70.106:49162/duration?key=xxxxxxxxxxxxxxxxxx”

Response

{
“status”:0,
“result”:{
“final”:true,
“hypotheses”:[
{
“speaking_rate”:2.1367521286010742,
“transcript”:”[noise] less than one hour”,
“likelihood”:1.0,
“final_output”:”rule_ref: less_more ,output_tag: LESS_THAN rule_ref: numbers ,output_tag: 1 rule_ref: hour_day ,output_tag: HOUR”,
“word-alignment”:[
{
“word”:”[noise]”,
“start”:0.0,
“length”:0.0099999997764825821,
“confidence”:1.0
},
{
“word”:”less”,
“start”:1.1899999380111694,
“length”:0.35999998450279236,
“confidence”:1.0
},
{
“word”:”than”,
“start”:1.5499999523162842,
“length”:0.22999998927116394,
“confidence”:1.0
},
{
“word”:”one”,
“start”:1.7999999523162842,
“length”:0.22999998927116394,
“confidence”:1.0
},
{
“word”:”hour”,
“start”:2.0299999713897705,
“length”:0.31000000238418579,
“confidence”:1.0
}
]
}
]
},
“segment”:0.0,
“segment-start”:0.0,
“segment-length”:2.3399999141693115,
“total-length”:3.5537502765655518,
“id”:”03.27.2019_08.40.19_AM_duration_9_57707448″
}
Input : less than one hour
Output : “LESS_THAN 1 HOUR”

 

JSON Fields

01. Status – Represents JSON status

  • 0 – successful
  • otherwise – unsuccessful

02. Result

  • Final – Shows partial and final results
  • True – final result
  • False –  partial result

03. Hypotheses

  • speaking_rate –  The number of words spoken in a second
  • transcript – Contains the whole transcript of a segment
  • likelihood – Represents probabilistic likelihood and used only for debugging
  • final_output – This field represents the Grammar output tag. This is an optional field (it present only in GRXML based grammar APIs)
  • word-alignment – Contains information of particular word
  • word – One best word represents in transcript
  • start – Starting time of word in seconds
  • length – Length of the word in seconds
  • confidence – Scaled probability estimate that the word was identified correctly
  • segment – Represent the number of the current segment
  • segment-start – Starting time of the current segment in the second
  • segment-length – End time of the current segment in seconds
  • total-length – Total length of speech decoded
  • id – Represent speech  id
Logo