Master’s Project

IVR- (Interactive Voice Application) UB On Line

Submitted By: Thomas Philip

Student ID :0516801

Project Guide: Prof Ausif Mahmood

Department of Computer Science

University of Bridgeport

 

 

 

INDEX

 

¨     Introduction

¨     Advantages of IVR

¨     Future Growth

¨     Overview of Different Voice Technologies

¨      Microsoft Speech Technology

¨      VoiceXML

¨      Features Of VoiceXML

¨      A Typical Application

¨      Fig 1

¨      Differences

 

¨     Architecture of Voice XML

¨      Fig 2

¨      Fig 3

 

¨     Basics of Voice XML

¨      Basic Syntax

¨      Header

¨      Main Body

 

¨     Voice XML Development Environments

¨      Development Environments

¨      Voice Gateways

 

¨     Voice Developer Environments at a Glance

¨     Architecture

¨     The Development Environment 

¨     Project Design

¨     Code

¨     Results

¨      REFRENCE

1. An Introduction to Speech Recognition

 

Have you ever talked to the computer? (And no, yelling at it when the Internet connection goes down or making polite chit-chat with it as we wait for all 25MB of that very important file to download doesn't count). Have you really talked to your computer? Where it actually recognized what you said and then did something as a result? If you have, then you've used a technology known as speech recognition.

VoiceXML takes speech recognition even further. Instead of talking to the computer, you're essentially talking to a web site, and you're doing this over the phone sometimes known as Interactive voice response.

OK, you say, well, what exactly is speech recognition? Simply put, it is the process of converting spoken input to text. Speech recognition is thus sometimes referred to as speech-to-text.

Speech recognition allows you to provide input to an application with your voice. Just like clicking with your mouse, typing on your keyboard, or pressing a key on the phone keypad provides input to an application; speech recognition allows you to provide input by speaking. For example, you might say something like "checking account balance", to which your bank's VoiceXML application replies "one million, two hundred twenty-eight thousand, six hundred ninety eight dollars and thirty seven cents."

Or, in response to hearing "Please say coffee, tea, or milk," you say "coffee" and the VoiceXML application you're calling tells you what the flavor of the day is and then asks if you'd like to place an order.

 

2. Advantages of IVR/Speech Recognition

 

 

  • Users may be mobile
  • No expensive device or software needed
  • One-dimensional interface (time): no information persistence makes dialog design a challenge
  • Wide variance in individual preference regarding directed vs. natural language interface
  • Interface suited to transactions rather than surfing
  • Session duration is several minutes

 

3. Future Growth

This is what I found in the latest news:

Internet powerhouse Yahoo! (Nasdaq: YHOO) announced a free Internet telephone service Tuesday that will allow users to access e-mail, news and other information from remote locations.

The company also unveiled a joint venture with Net2Phone, Inc. (Nasdaq: NTOP) to provide a service that allows its customers to make free calls from personal computers to telephones.

 

Microsoft Research Spawns a New Era in Speech Technology: Simpler, Faster, and Easier Speech Application Development

Microsoft is equally determined to move speech technology into the mainstream, making it a widespread industry. Kokanee is the codename for a major research and development effort at Microsoft designed to grow the speech industry. Its focus, the delivery of the .Net Speech Platform, will make speech-enabled application development and deployment simpler, faster, and easier. If Kokanee succeeds in stimulating the creation of killer speech applications, we may soon find ourselves talking with computers on a more frequent basis and actually enjoying the experience.

 

“By 2005 there will be 128 million users of speech applications with nearly 50% of those users considered as regular users of speech enabled applications” -the Kelsey group

 

Speech Recognition Market To Exceed $5 Billion by 2008


Speech recognition has slowly been building its reputation, as accuracy rates are much improved and applications to meet users’ needs are being developed. Allied Business Intelligence (ABI) projects this market to increase to $897.8 million in 2003, up from $677 million in 2002. Over the longer term, the speech recognition market is forecasted to grow to $5.3 billion by 2008.

 

 

4. Overview of Different Voice Technologies

 

VoiceXML Forum and the W3C Voice Browser working group have at times been considered competitors, since SALT can be used for telephony applications that are not multimodal.

 

a. Microsoft speech technology

SALT (Speech Application Language Tags) is an extension of HTML and other markup languages (cHTML, XHTML, WML) that adds a powerful speech interface to Web pages, while maintaining and leveraging all the advantages of the Web application model. These tags are designed to be used for both voice-only browsers (for example, a browser accessed over the telephone)  and  multimodal browsers

SALT (Speech Application Language Tags) is a small set of XML elements, with associated attributes and DOM object properties, events, and methods, which may be used in conjunction with a source markup document to apply a speech interface to the source page. The SALT formalism and semantics are independent of the nature of the source document, so SALT can be used equally effectively within HTML and all its flavors, or with WML, or with any other SGML derived Markup.

Multimodal access will enable users to interact with an application in a variety of ways: they will be able to input data using speech, a keyboard, keypad, mouse and/or stylus, and produce data as synthesized speech, audio, plain text, motion video, and/or graphics. Each of these modes will be able to be used independently or concurrently.


The main top-level elements of SALT are:

 

<prompt ...>

for speech synthesis configuration and prompt playing

 

<listen ...>

for speech recognizer configuration, recognition execution and post-processing, and recording

 

<dtmf ...>

for configuration and control of DTMF collection

 

<smex ...>

for general-purpose communication with platform components

 

 

 

 

 

 


The input elements <listen> and <dtmf> also contain grammars and binding controls

<grammar ...>

for specifying input grammar resources

 

<bind ...>

for processing of recognition results

 

 



and <listen> also contains the facility to record audio input

 

 

<record ...>

for recording audio input

 

 

 

 

 

 

 

 

 

A call control object is also provided for control of telephony functionality.

There are several advantages to using SALT with a mature display language such as HTML. Most notably: The event and scripting models supported by visual browsers can be used by SALT applications to implement dialog flow and other forms of interaction processing without the need for extra markup.

The addition of speech capabilities to the visual page provides a simple and intuitive means of creating multimodal applications.

In this way, SALT is a lightweight specification, which adds a powerful speech interface to Web pages, while maintaining and leveraging all the advantages of the Web application model.
 SALT also provides DTMF and call control capabilities for telephony browsers running voice-only applications through a set of DOM objects properties, methods, and events.

 

b. VoiceXML

VoiceXML or the Voice extensible Markup Language is a scripting language for writing Voice enabled IVR and web services and applications. VoiceXML is the 'HTML' for telephony based speech applications. It hides the complexities of the telephony platform from developers and provides an easy way of developing feature rich and media rich speech applications. It uses Speech Recognition and DTMF for user input, and prerecorded Audio and Text-to-speech for output. VoiceXML is proposed by the VoiceXML forum (http://www.voicexml.org) and is an international standard for writing telephony based Voice Applications. The VoiceXML forum is a group of about 500 companies worldwide and still growing.

 

4.1 Features of VoiceXML

 

  • Application Logic is separated from the Voice Interface. This has two main advantages
    • This enables businesses to use their existing investments in web technologies and infrastructure.
    • Businesses can outsource the Voice Interface Design and hosting while having full control on the application logic.
  • VoiceXML being an international standard lets you write the application once and run anywhere.
  • VoiceXML is independent of Speech and Telephony platform. This gives flexibility to choose the platform of choice.
  • VoiceXML is a simple scripting language. Application developers can develop application with ease without worrying about the complexities of the platform.

 

 

4.2 A Typical Application

The following diagram shows a typical VoiceXML application at

 

work:

 

Fig 1:

 

A user connects with your application by dialing the appropriate phone number. The VoiceXML interpreter answers the call and starts executing your VoiceXML document. Under the document's control, the interpreter may perform actions such as:

 

·         Sending vocal prompts, messages, or other audio material (such as music or sound effects) to the user

·         Accepting numeric input that the user enters by DTMF (telephone key tone) signals.

·         Accepting voice input and recognizing the words.

·         Accepting voice input and simply recording it, without trying to recognize any words.

·         Sending the user's information to a web site or other Internet server.

·         Receiving information from the Internet and passing it to the user.

In addition, VoiceXML documents can perform programming functions such as arithmetic and text manipulation. This allows a document to check the validity of the user's input. Also, a user's session need not be a simple sequence that runs the same way every time. The document may include "if-then-else" decision-making and other complex structures.

 

4.3 DIFFERENCES

There are substantial differences, however, between SALT and VoiceXML. The SALT specification defines a set of “lightweight” tags as extensions to commonly used Web-based programming languages, such as Java or ECMA Script, that are already well developed, as well as using the W3C standards in common with VoiceXML and some Internet standards from the Internet Engineering Task Force (IETF). This use of existing, well-tested standards and programming languages can accelerate the maturing of the new standard. VoiceXML is a programming language that does not require other programming languages. Another difference is that VoiceXML implements some functions at a higher level, particularly the “form” function for gathering specific information, avoiding the need to program that function specifically, while SALT operations are at a lower level, giving more control to the programmer, but, in its raw form, requiring more effort if the form function satisfies the application needs.

A fundamental difference is, that VoiceXML does not currently deal with multimodal interactions, while SALT was designed from the start to handle multimodal extensions. A practical difference is that VoiceXML is at version 2.0 and has many fielded deployments, and SALT 1.0 has just been introduced.

 

5. Architecture of Voice XML

VoiceXML builds on existing data networking standards, such as XML, HTTP, and TCP/IP; and on telephone standards in the Public Switched Telephone Network (PSTN) and Integrated Services Digital Network (ISDN).

The voice web consists of the PSTN, VoiceXML applications on the Internet, and a VoiceXML gateway between the Internet and the PSTN. The VoiceXML gateway hosts specialized hardware and software that enable voice browsing. Some of these resources, such as ASR and TTS, may be located on separate network elements and accessed remotely.

In a voice browser session, the user's phone call goes over the PSTN to a VoiceXML gateway. Based on the number that the user dialed, the gateway downloads and possibly caches the corresponding VoiceXML application from the Internet. The gateway then steps through the VoiceXML, interacting with the user as defined in the application.

 

 

Fig 2:

 

A typical voice network is shown in                        Voice XML                                             web server

 Gateway

Telephone

Internet

 

PSTN

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

                                   

 

 

 

Following is a description of the elements involved in the network diagram:

 

·         Caller Telephone. The telephone that the caller uses to access a VoiceXML application. The figure illustrates a call over the PSTN.

·         VoiceXML Gateway. A gateway that bridges the PSTN and IP worlds and hosts the VoiceXML browser, speech hardware and software, and dialed number-to-URL mapping.

·         Web Server. This is the server hosting a VoiceXML application in this network. By editing the MIME types supported by an HTTP server, VoiceXML can be delivered from any web server.

                         

Here's a description of the networks involved in this network diagram:

 

·         PSTN. Public Switched Telephone Network, also known as Plain Old Telephone Service (POTS). This is the telephone service most of us have in our homes, and it carries our speech and DTMF interactions, such as prompts played by the VoiceXML gateway and responses that the caller speaks.

·         Internet. The Internet, for which we pay an Internet service provider (ISP) to provide access. It carries the gateway's request to the web server for VoiceXML and returns it to the gateway.

Note that this network has a fairly traditional architecture that doesn't capture scenarios involving Voice over Internet Protocol (VoIP).

To summarize important points about the network, the voice web consists of the PSTN, VoiceXML applications on the Internet, and a VoiceXML gateway between the Internet and the PSTN.

 

Fig 3:

 

 

 


                                                                                                                                                Voice XML Engine              

 

 

 

Audio Recording

 
 

 

 

 

 


The key component of any system built on VXML is known as the Voice browser. This is the software that renders the VoiceXML markup as a sequence of two-way dialogs between the system and the user. It consists of core VoiceXML interpreter, integrated with software components for Text to Speech (TTS) and audio file output, and for speech recording and recognition (namely, Automatic Speech Recognition – or ASR – functionality). The voice browser is commonly referred to as the client, since it is here that the code is interpreted and “displayed”.

 

6. Basics of Voice XML

Let's look at a simple VoiceXML document (the line numbers are not part of the document):This document causes the request, "Please choose News, Weather, or Sports," to be spoken to the user. Then it accepts the user's response and passes the response to another document--actually a server-side script named select.jsp--that presumably provides the service that the user selected.

1.               <?xml version="1.0"?>

2.                <!DOCTYPE vxml PUBLIC "-//BeVocal Inc//VoiceXML 2.0//EN"
   "http://cafe.bevocal.com/libraries/dtd/vxml2-0-bevocal.dtd">

3.               <vxml version="2.0"
xmlns="http://www.w3.org/2001/vxml"
xml:lang="en-US">

4.               <form>

5.               <field name="selection">

6.               <prompt>

7.               Please choose News, Weather, or Sports.

8.               </prompt>

9.               <grammar type="application/x-nuance-gsl">

10.           [ news weather sports ]

11.           </grammar>

12.           </field>