Master’s Project
IVR- (Interactive Voice
Application) UB On Line
Submitted
By: Thomas Philip
Student ID :0516801
Project
Guide: Prof Ausif Mahmood
Department
of Computer Science
¨
VoiceXML
¨
Features Of VoiceXML
¨
A Typical Application
¨
Fig 1
¨
Differences
¨
Fig 2
¨
Fig 3
¨
Header
¨
Main Body
¨
Development Environments
¨
Voice Gateways
¨
REFRENCE
Have you ever talked to the
computer? (And no, yelling at it when the Internet connection goes down or
making polite chit-chat with it as we wait for all 25MB of that very important
file to download doesn't count). Have you really talked to your computer? Where
it actually recognized what you said and then did something as a result? If you
have, then you've used a technology known as speech recognition.
VoiceXML takes speech recognition even further. Instead of talking to the computer, you're essentially talking to a web site, and you're doing this over the phone sometimes known as Interactive voice response.
OK, you say, well, what exactly
is speech recognition? Simply put, it is the process of converting spoken input
to text. Speech recognition is thus sometimes referred to as speech-to-text.
Speech recognition allows you to provide input to an application with your voice. Just like clicking with your mouse, typing on your keyboard, or pressing a key on the phone keypad provides input to an application; speech recognition allows you to provide input by speaking. For example, you might say something like "checking account balance", to which your bank's VoiceXML application replies "one million, two hundred twenty-eight thousand, six hundred ninety eight dollars and thirty seven cents."
Or, in response to hearing "Please say coffee, tea, or milk," you say "coffee" and the VoiceXML application you're calling tells you what the flavor of the day is and then asks if you'd like to place an order.
2. Advantages of IVR/Speech Recognition
This is what I found in the latest news:
Internet powerhouse Yahoo! (Nasdaq: YHOO) announced a free Internet telephone service
Tuesday that will allow users to access e-mail, news and other information from
remote locations.
The company also unveiled a joint
venture with Net2Phone, Inc. (Nasdaq:
NTOP) to provide a service that allows its customers to make free calls from
personal computers to telephones.
Microsoft Research Spawns a New
Era in Speech Technology: Simpler, Faster, and Easier Speech Application
Development
Microsoft is equally determined
to move speech technology into the mainstream, making it a widespread industry.
Kokanee is the codename for a major research and
development effort at Microsoft designed to grow the speech industry. Its
focus, the delivery of the .Net Speech Platform, will make speech-enabled
application development and deployment simpler, faster, and easier. If Kokanee succeeds in stimulating the creation of killer
speech applications, we may soon find ourselves talking with computers on a
more frequent basis and actually enjoying the experience.
“By 2005 there will be 128
million users of speech applications with nearly 50% of those users considered
as regular users of speech enabled applications” -the Kelsey group
Speech
Recognition Market To Exceed $5 Billion by 2008
Speech recognition has slowly been building its reputation, as accuracy rates are
much improved and applications to meet users’ needs are being developed. Allied
Business Intelligence (ABI) projects this market to increase to $897.8 million
in 2003, up from $677 million in 2002. Over the longer term, the speech
recognition market is forecasted to grow to $5.3 billion by 2008.
VoiceXML Forum and the W3C Voice Browser working group have at times been considered
competitors, since SALT can be used for telephony applications that are not multimodal.
a. Microsoft speech technology
SALT (Speech Application Language
Tags) is an extension of HTML and other markup languages (cHTML,
XHTML, WML) that adds a powerful speech interface to Web pages, while
maintaining and leveraging all the advantages of the Web application model.
These tags are designed to be used for both voice-only browsers (for example, a
browser accessed over the telephone) and
multimodal browsers
SALT (Speech Application Language
Tags) is a small set of XML elements, with associated attributes and DOM object
properties, events, and methods, which may be used in conjunction with a source
markup document to apply a speech interface to the source page. The SALT
formalism and semantics are independent of the nature of the source document,
so SALT can be used equally effectively within HTML and all its flavors, or
with WML, or with any other SGML derived Markup.
Multimodal access will enable
users to interact with an application in a variety of ways: they will be able
to input data using speech, a keyboard, keypad, mouse and/or stylus, and
produce data as synthesized speech, audio, plain text, motion video, and/or
graphics. Each of these modes will be able to be used independently or
concurrently.
The main top-level elements of SALT are:
|
<prompt ...> for speech
synthesis configuration and prompt playing <listen ...> for speech
recognizer configuration, recognition execution and post-processing, and
recording <dtmf ...> for configuration
and control of DTMF collection <smex ...> for general-purpose
communication with platform components |
The input elements <listen> and <dtmf>
also contain grammars and binding controls
|
<grammar ...> for specifying
input grammar resources <bind ...> for processing of
recognition results |
and <listen> also contains the facility to
record audio input
|
<record ...> for recording audio
input |
A call control object is also
provided for control of telephony functionality.
There are several advantages to
using SALT with a mature display language such as HTML. Most notably: The event
and scripting models supported by visual browsers can be used by SALT
applications to implement dialog flow and other forms of interaction processing
without the need for extra markup.
The addition of speech
capabilities to the visual page provides a simple and intuitive means of
creating multimodal applications.
In this way, SALT is a
lightweight specification, which adds a powerful speech interface to Web pages,
while maintaining and leveraging all the advantages of the Web application
model.
SALT also provides DTMF and call control
capabilities for telephony browsers running voice-only applications through a
set of DOM objects properties, methods, and events.
VoiceXML or the Voice extensible Markup
Language is a scripting language for writing Voice enabled IVR and web services
and
applications. VoiceXML is the 'HTML' for telephony based speech
applications. It hides the complexities of the telephony platform from
developers and provides an easy way of developing feature rich and media rich
speech applications. It uses Speech Recognition and DTMF for user input, and
prerecorded Audio and Text-to-speech for output. VoiceXML
is proposed by the VoiceXML forum (http://www.voicexml.org) and is an
international standard for writing telephony based Voice Applications. The VoiceXML
forum is a group of about 500 companies worldwide and still growing.
4.1 Features
of VoiceXML
The following diagram shows a typical VoiceXML application at
work: 
Fig 1:
A user connects with your application by dialing the appropriate phone number. The VoiceXML interpreter answers the call and starts executing your VoiceXML document. Under the document's control, the interpreter may perform actions such as:
· Sending vocal prompts, messages, or other audio material (such as music or sound effects) to the user
· Accepting numeric input that the user enters by DTMF (telephone key tone) signals.
· Accepting voice input and recognizing the words.
· Accepting voice input and simply recording it, without trying to recognize any words.
· Sending the user's information to a web site or other Internet server.
· Receiving information from the Internet and passing it to the user.
In addition, VoiceXML
documents can perform programming functions such as arithmetic and text
manipulation. This allows a document to check the validity of the user's input.
Also, a user's session need not be a simple sequence that runs the same way
every time. The document may include "if-then-else" decision-making
and other complex structures.
4.3
DIFFERENCES
There are substantial differences,
however, between SALT and VoiceXML. The SALT
specification defines a set of “lightweight” tags as extensions to commonly
used Web-based programming languages, such as Java or ECMA Script, that are
already well developed, as well as using the W3C standards in common with VoiceXML and some Internet standards from the Internet
Engineering Task Force (IETF). This use of existing, well-tested standards and
programming languages can accelerate the maturing of the new standard. VoiceXML is a programming language that does not require
other programming languages. Another difference is that VoiceXML
implements some functions at a higher level, particularly the “form” function
for gathering specific information, avoiding the need to program that function
specifically, while SALT operations are at a lower level, giving more control
to the programmer, but, in its raw form, requiring more effort if the form
function satisfies the application needs.
A fundamental difference is, that VoiceXML does not
currently deal with multimodal interactions, while SALT was designed from the
start to handle multimodal extensions. A practical difference is that VoiceXML is at version 2.0 and has many fielded
deployments, and SALT 1.0 has just been introduced.
5. Architecture of Voice XML
VoiceXML builds on existing data networking standards, such as XML, HTTP, and TCP/IP; and on telephone standards in the Public Switched Telephone Network (PSTN) and Integrated Services Digital Network (ISDN).
The voice web consists of the PSTN, VoiceXML applications on the Internet, and a VoiceXML gateway between the Internet and the PSTN. The VoiceXML gateway hosts specialized hardware and software that enable voice browsing. Some of these resources, such as ASR and TTS, may be located on separate network elements and accessed remotely.
In a voice browser session, the user's phone call goes over the PSTN to a VoiceXML gateway. Based on the number that the user dialed, the gateway downloads and possibly caches the corresponding VoiceXML application from the Internet. The gateway then steps through the VoiceXML, interacting with the user as defined in the application.
Fig 2:
A typical voice network is shown in
Voice XML web
server
Gateway
Internet PSTN
Telephone

![]()




Following is a description of the elements involved in the network diagram:
·
Caller Telephone. The telephone that the caller
uses to access a VoiceXML application. The figure
illustrates a call over the PSTN.
·
VoiceXML
Gateway. A
gateway that bridges the PSTN and IP worlds and hosts the VoiceXML
browser, speech hardware and software, and dialed number-to-URL mapping.
·
Web Server. This is the server hosting a VoiceXML application in this network. By editing the MIME
types supported by an HTTP server, VoiceXML can be
delivered from any web server.
Here's a description of the networks involved in this network diagram:
·
PSTN. Public Switched Telephone
Network, also known as Plain Old Telephone Service (POTS). This is the
telephone service most of us have in our homes, and it carries our speech and
DTMF interactions, such as prompts played by the VoiceXML gateway and responses that the caller speaks.
·
Internet. The Internet, for
which we pay an Internet service provider (ISP) to provide access. It carries
the gateway's request to the web server for VoiceXML
and returns it to the gateway.
Note that this network has a
fairly traditional architecture that doesn't capture scenarios involving Voice
over Internet Protocol (VoIP).
To summarize important points about the network, the voice web consists of the PSTN, VoiceXML applications on the Internet, and a VoiceXML gateway between the Internet and the PSTN.
Fig
3:

Voice
XML Engine
Audio Recording
The key component of any system built on VXML is known as the Voice browser. This is the software that renders the VoiceXML markup as a sequence of two-way dialogs between the system and the user. It consists of core VoiceXML interpreter, integrated with software components for Text to Speech (TTS) and audio file output, and for speech recording and recognition (namely, Automatic Speech Recognition – or ASR – functionality). The voice browser is commonly referred to as the client, since it is here that the code is interpreted and “displayed”.
Let's look at a
simple VoiceXML document (the line numbers are not
part of the document):This document causes the
request, "Please choose News, Weather, or Sports," to be spoken to
the user. Then it accepts the user's response and passes the response to
another document--actually a server-side script named select.jsp--that presumably
provides the service that the user selected.
1.
<?xml
version="1.0"?>
2.
<!DOCTYPE vxml PUBLIC "-//BeVocal
Inc//VoiceXML 2.0//EN"
"http://cafe.bevocal.com/libraries/dtd/vxml2-0-bevocal.dtd">
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"xml:lang="en-US"><form><field
name="selection"><prompt>Please choose
News, Weather, or Sports.</prompt><grammar
type="application/x-nuance-gsl">[ news
weather sports ]</grammar></field>