Pronexus Wpaper Practical Guide2QPO-10202006-2424

8/8/2019 Pronexus Wpaper Practical Guide2QPO-10202006-2424

1/12

System Architectures for Speech

A Practical Guide to Lowering Costs

Andrew Kozminski

260 Terence Matthews Cr.

Kanata, Ontario, Canada, K2M 2C7

Ph: 877-766-3987 Ext. 549

[email protected]

First presented at

AVIOS SPEECH DEVELOPERS

CONFERENCE & EXPO

March 31st - April 3rd, 2003

February 2003

Disclaimer: This paper presents personal views and opinions of the author at this time, which are not binding on Pronexus and are subject to change.

A P O W E R F U L V O I C E I N C O M M U N I C A T I O N S O L U T I O N S


2/12

Table of Contents

1 EXECUTIVE SUMMARY ...........................................................................................................

2 INTRODUCTION .....................................................................................................................

3 WHY ISN'T SPEECH UBIQUITOUS? .....................................................................................

3.1 COST OF COMPLEXITY ...................................................................................................................

3.2 COST OF SOPHISTICATED BUT IMPERFECT TECHNOLOGY ..............................................................

3.3 COST OF CORE SPEECH TECHNOLOGY (LICENSES) ...............................................................

3.4 COST OF TUNING ................................................................................................................................

4 ARCHITECTING FOR LOWER COST .....................................................................................

4.1 SIMPLIFY THE COMPLEX ...................................................................................................................

4.2 BREAKUP INTO MODULES ...................................................................................................................

4.3 MAINTAIN VENDOR INDEPENDENCE ......................................................................................................

4.4 CONSERVE YOUR RESOURCES ......................................................................................................

4.5 DON'T SKIMP ON TUNING! ...................................................................................................................

4.6 SELECT PLATFORM TO FIT YOUR APPLICATION ............................................................................

4.7 FALLBACK TO TOUCHTONE .................................................................................................................1

5 CONCLUSION .. .................................................................................................................16 ABOUT PRONEXUS ... .....................................................................................................1

7 GLOSSARY ..............................................................................................................................12


3/12

1 Executive Summary

Despite the great progress in speech technology over the past twenty years, speech applications are still not widelyused by carriers and enterprises. Although there are various reasons for the limited deployment of speech applications,we believe that cost is the main barrier to wider adoption. The great cost of developing and deploying speech applica-tions stems from several factors, including the complexity, technology inaccuracy and licensing costs, as well as post-installation efforts.

Building a voice-enabled application is a complex task since it involves the use of multiple core technologies from multi-ple vendors. The complexity of the application design is also increased by the lower accuracy levels produced by speechrecognition applications (compared to DTMF) and the lack of restriction on the user input, leading to additional program-ming logic to be dedicated to error compensation and failures.

In addition to design complexities and technology inaccuracies, which require time and human resources, speech appli-cations also necessitate monetary investments in expensive ASR and TTS licenses. Due to their complicated nature,speech applications require attention and tuning even after they are deployed (adding and deleting subscribers to gram-mars for example), increasing the cost of maintenance.

With the above in mind, is it possible to design a lower cost speech application? There are several guidelines that canhelp developers to lower system costs and design for the best cost-functionality balance.

Since building a modern speech application requires integration of components provided by different vendors, the mostefficient way to create an application is to use a high-level Rapid Application Development (RAD) tool. RAD tools hidelow-level complexity and provide a uniform development environment for multi-vendor components. In addition, the useof a development environment will help developers maintain vendor independence when it comes to speech resourcesand telephony hardware.

Development and deployment costs can be reduced by using modular architecture. Among other things, modular archi-tecture allows for independent provisioning and software hot-swaps, eliminating costly downtimes and providingincreased reliability, while enabling for resource sharing. Using royalty-free engines, proper engineering of license man-ager, and floating licenses can lower the cost of ASR and TTS licenses.

These guidelines provide a framework for lowering the cost of speech-enabled applications. The remainder of this docu-ment describes them in more detail.

2 Introduction

It's certainly no secret that the last few years have brought some dramatic changes to the high-tech landscape. Theharsh economic reality tested many business formulas, technologies and products. Unfortunately, many of the oncewidely hyped ideas did not survive and all of us have learned an important lesson: at the end of the day, the only suc-cessful products/solutions are those that make money. In other words, it's all about cost, price and, above all, Return onInvestment (ROI).

So, how can we build a compelling ROI case for speech technology? Of course the return depends on the business sideof your application/project. Because this is a technical paper, we'll only briefly touch on this aspect. But the investment ismostly about technology: it's the cost of the building blocks that you use and the engineering and tuning hours that youspend. We'll take a closer look at this aspect, because the technical decisions made early in project can have a dramaticimpact on the final cost. No matter how much engineers hate this, more often than not, it is the cost that kills the greatproduct ideas.With the above in mind, this paper will review different technologies, architectures and implementation strategies for



4/12

building speech applications, with a special focus on costs versus functionality.

Throughout this paper we'll refer to the example of the Airport Assistant, a real life speech-enabled application imple-mented by one of our partners. Airport Assistant allows callers to select by name one of several hundred airport servic-

es, request real-time flight information and be notified (at a provided cell-number) about any changes in the schedule.Airport Assistant has been successfully deployed at the biggest airport in Canada - Toronto's Pearson International.

3 Why isn't speech ubiquitous?

First, let's put things in perspective by asking another question: how much is speech used today? According toDatamonitor, the total supply-side market for "voice business technologies and services" (meaning all systems employ-ing TTS and ASR) is growing from $629 million in 2001 to $4.3 billion in 2007. That's a compound annual growth rate(CAGR) of 38%. It looks impressive, but only until you realize a key point: these are total numbers that must be furtherbroken down into verticals. For example, if you are in the healthcare and pharmaceuticals business, your share is only2% (again, a Datamonitor number), which for 2001 translated to $12.5 million worldwide. Obviously, this represents alow penetration by any stretch of imagination.

So why, despite almost twenty years of continuous improvements in technology, are speech applications still so slow tosell? Most analysts agree that the technology itself is "ready for prime time" and that customers are ready to accept itsclear benefits. Where is the problem? Many different reasons have been cited, from numerous false starts of immaturetechnology, to bad design, to cultural aversion against "talking to a machine". They all are true, but in our opinion, amajor remaining barrier to wider adoption is cost. The remainder of this section discusses the main factors contributingto the high cost of speech applications today.

3.1 Cost of Complexity

Despite many years of great technological progress, building voice applications is still a complex task. It typicallyinvolves integration of multiple core technologies: ASR, TTS, speaker verification, speech object libraries, telephonyhardware, call processing, web services and databases, all tied together with thousands of lines of custom code.

Although all vendors strive to position themselves as one stop shopping, none are leaders in all areas, which forces sys-tem architects to pick and choose best-of-breed components from multiple suppliers. A classic example is multilingualTTS - no vendor has the best voice quality in all languages.

Of course, components coming from different vendors are not designed to work easily together, which makes implemen-tation difficult and prolongs the learning curve. It also requires a team with a diverse and sophisticated skill set: telepho-ny hardware, protocols, real-time programming and Web development, to name a few. Even today, experienced develop-ers and voice system designers are not cheap and their salaries quickly add up to the cost of a project.

3.2 Cost of Sophisticated but Imperfect Technology

When compared to touchtone, there is no doubt that speech technology allows for a much more effective and satisfyinguser experience. However, those benefits come at a price: building a good speech application requires a lot more effortinvested in the human-machine interface (HMI). Touchtone based user interfaces were easy to design: the user input

was always limited to one out of ten digits and the DTMF detectors guaranteed almost 100% accuracy. With speech andnatural language, the situation is vastly different. Not only is recognition never 100% accurate but also the user input isnot restricted in any way. This puts a lot of extra burden on the application design and dramatically increases its com-plexity - it is not uncommon that more programming logic is dedicated to compensating for errors and failures (low accu-racy, ambiguity, out-of-grammar vocabulary, and so on) than to the actual business features. As a result, designing agood speech interface is a difficult art requiring the interdisciplinary talents of computer science, linguistics and humanfactors psychology.



5/12

3.3 Cost of Core Speech Technology (Licenses)

Today, the core speech technology is offered by a handful of vendors who have spent heavily on development of theirproducts and are now trying (rightly so) to recoup and capitalize on their investment. Consequently, TTS and ASRlicensees continue to be very expensive.

As an example, the ASR licenses (no TTS was used) for the Toronto Airport Assistant application, described in the intro-duction, originally accounted for almost 50% of the total system cost. The remaining 50% paid for everything else,including redundant hardware, development tools, database server, UPS and more. The cost of ASR licenses was laterreduced to less than 23%, by better license management that took advantage of the specific call patterns. This approachis discussed in more details in the following sections.

With such high licensing costs, it is unfortunate that many commercial speech platforms seem to completely ignore theissue and continue to use licenses very ineffectively. It is not uncommon that one application port could require two ormore ASR licenses, especially if multiple languages or "always active hot-words" are involved. Obviously, doubling ortripling the licensing cost has a dramatic impact on the end user price of the finished application.

3.4 Cost of Tuning

The high cost of speech applications doesn't end with the first installation. Even the best-designed systems require a lotof tuning before they can be turned into a full production. Furthermore, some applications (such as dial-by-name auto-attendants needing constant addition or deletion of subscribers to their grammars) require ongoing tuning throughouttheir life cycle. Tuning typically requires the attention of a computational linguist, which is still a unique and thereforehard to find professional. This is in sharp contrast to the traditional touchtone systems, which, once tested by a softwareQA team, would run virtually maintenance-free in production.

4 Architecting for Lower Cost

We've identified cost as one of the main barriers to the wider adoption of speech applications, in particular in mid-marketenvironments. High up-front costs result in marginal ROI stories, and it doesn't matter how elegant or efficient the appli-

cation is if no one buys it. Thus, it is important to design for the best cost-functionality balance.

Today, system architects and developers of speech-based telephony applications face many difficult choices regardingplatforms, tools, speech technologies, and so on. The abundance of new standards only adds to the overall confusion.Of course, no single system architecture could meet all possible requirements, but at the same time, selecting the rightarchitecture has fundamental impact, especially on cost. This section offers a number of specific recommendationsbased on our real-life experiences.

4.1 Simplify the Complex

Modern speech applications are very rarely (if ever) built from scratch. Instead, multiple ready-to-go building blocks areput together including: ASR and TTS engines, object libraries, call processing frameworks, development tools. As aresult, building a modern speech application has become a matter of integration. This has certainly simplified the task,but it hasn't made it easy - integration comes with problems of its own.

Typically, speech building blocks (cards or speech engines) come with native-APIs that is low-level interfaces requiringprogramming in C++. But writing a speech application in C++ is not for the faint-of-heart. It may sound like a fun chal-lenge to developers, but certainly not to a project manager who is responsible for budgeting and scheduling. Low-leveldetails (such as call state machines, resource management, multi-threading, voice buffering, ActiveX, COM, and sock-ets) will very quickly defocus developers from solving the actual business problem at hand. Learning APIs and telephonyabstraction models from different vendors will dramatically extend the learning curve. Don't count on the telephony stan-



6/12

dards for help: unfortunately, despite multiple attempts (such as TAPI, SAPI, TSAPI, and S.100), there is no universalstandard API for low-level telephony functionality. Adoption by vendors is random at best, and interoperability of particu-lar implementations can never be taken for granted.

Can this complexity of native-APIs be avoided? Absolutely - by leveraging the work done by others. In practice, thismeans using one of the high-level Rapid Application Development (RAD) tools. RAD tools hide low-level complexity andabstract the mishmash of multi-vendor components into a uniform development environment, focused on the businesslogic, not technology. Some RAD tools reduce the learning curve even more by leveraging the power of one of theindustry standard Integrated Development Environments (IDE), for example Microsoft Visual Studio, and popular devel-opment languages, such as Visual Basic or Java.

Properly selected development tools can save a lot of money. On the Toronto Airport Assistant project, both the develop-ers and project manager claimed that the switch to an appropriate tool cut the delivery time by more than 50%.

Of course, not all RAD tools are created equally and they should be carefully selected for a specific project and itsdevelopment team. Ideally, you should look for a tool that combines a high-level visual design environment with a flexibleprogramming environment.

Visual Design - Most RAD tools use drag-and-drop GUI interfaces, which increase programmer productivity andenhance structure and readability of the source code. However, when selecting a tool, consider inherent limitations ofvisual design and programming. Ready-to-use building blocks work very well as long as all functionality is available.Some applications may benefit from the simplicity of this approach; however, most practical applications require function-ality that goes beyond ready-to-go functions. Sooner or later, you'll need to customize blocks or integrate them withexternal third-party components. In other words, choose tools that combine the productivity gains of visual programmingwith a powerful programming environment.

Programming Environment -Building speech applications today is all about integration and customization. Any seriousapplication requires some custom programming. The right development and debugging tools can save your project whenyou least expect it. Make absolutely sure that your tools support a serious, industry-standard programming language,

source level debugging, seamless invocation of component libraries (such as DCOM, ActiveX, and CORBA.) and controlover multithreading. If you're building an application on Windows, don't miss out on the benefits of the next-generationtechnology from Microsoft- your tool must support .NET!

As a typical example, one of our customers built a large outbound dialing application that depended on answeringmachine detection. The initial implementation used the original detection algorithm embedded into Dialogic cards.Unfortunately, statistical accuracy (which depends highly on the target calling area) couldn't be verified before field trials.When the first results came in, accuracy levels were around 80%, significantly below expectations. Because their RADtool fully integrated into Microsoft Visual Studio, the programmers were able to devise a custom solution that boostedaccuracy to 96%. This would be impossible without a flexible programming environment.

4.2 Breakup into Modules

The ability to break your speech application into cooperating modules is a must. Not only does it improve scalability, reli-

ability and performance of your system, but it also saves you money in both development and production.

A modular system is cheaper to build and maintain. In development, programmers benefit from working in parallel onwell-defined modules. In production, independent module provisioning and software hot-swaps eliminate costly systemdowntimes. At the same time, separating application logic from telephony and speech processing allows resource shar-ing, which in turn leads to more efficient utilization. Finally, distributing your modules across a LAN enables load balanc-ing and effortless scalability - again resulting in savings on system maintenance.



7/12

The biggest benefit, however, comes from increased reliability of a modular system. Nothing is more frustrating to callersthan a system that crashes into "dead silence" in the middle of a transaction. An unreliable system will be soon pulledout of production, which always means significant financial losses.

A monolithic executable is only as reliable as its weakest component, while a modular system can stay operational evenafter losing one of its modules. Therefore, it is very important that application modules execute properly separated fromeach other and from the system processes, so that a fatal error in one doesn't bring down the whole system. The mod-ules should run out-of-process, or even better, distributed across a LAN. Ideally, modules should be compiled directlyinto stand-alone executables, not into intermediate scripts or p-code. Not only does this speed up program execution, italso removes the dependency on a shared runtime engine as a single point of failure.

4.3 Maintain Vendor Independence

From a cost perspective, vendor-independence comes into play with respect to speech resources and telephony hard-ware.

Speech Resources - Selecting the right speech components for your application is very important. Speech technology isstill complex and very expensive, but the quality and accuracy of the chosen engines could ultimately make a differencebetween success or failure of your project.

In practice, when it comes to speech processing, the "one size fits all" approach does not work. This is true for ASR, buteven more so for TTS - all engines and languages are not equal. Therefore, it pays to carefully research and evaluatedifferent vendors before deciding on a TTS product. Remember that perception of quality is subjective and may dependon your audience and application. As a result you may end up working with more than one vendor at a time and chang-ing vendors as your application evolves. Unfortunately, this may require re-doing the integration work many times, result-ing in additional cost. If you're lucky, the engine of your choice may support a standard API (like SAPI), but not all do.And even if it does support SAPI, these interfaces are often far from perfect. Fine variations in timing, buffering schemesand performance can result in irritating gaps, clicks and delays. Usually, better results are achieved through individuallycrafted, native APIs, but this again means additional development costs.

The best way to achieve vendor-independence is to use a middleware abstraction layer, which in turn works with num-ber of alternate engines. Again, a RAD tool is appropriate: it will protect your investment in application developmentshould a shift in requirements, technology or vendor strategies necessitate the move to another engine. It will also allowyou to experiment with multiple speech products to find the best price-quality balance for your application.

Telephony Hardware - Similar to speech resources, vendor-independence of telephony hardware can save you money. Itis not uncommon for speech applications to run on more than one brand of hardware or to switch vendors for better pric-ing. In general, the smaller telephony hardware suppliers tend to be less expensive and much more accommodatingwhen it comes to technical support plans, and if you're new to computer telephony you will definitely require support. Onthe other hand, smaller vendors may not support all cards and protocols. The most popular, i.e. analog T1/E1, ISDN-PRI, H.323, R2-CAS, etc. are a must. However, it's also worth paying attention to the less obvious capabilities, such asSIP, PBX set-emulator cards and transfer-on-CO protocols like TBCT, RLT and Q.SIG. Again, make sure that your mid-dleware framework allows experimenting with different cards and protocols from multiple vendors, and keep in mind that

your requirements may change with time.

PBX Integration - Most analysts predict that call centers will account for the biggest slice of the "speech market pie" inthe coming years. If your organization is engaged in the call-center market, you know that developing applications for

just one PBX-brand is not enough. Given that speech is so much more expensive than touchtone, PBX vendor-inde-pendence is becoming more important than ever. Again, make sure that your development tools or middleware offer agood PBX integration story.



8/12

What does this mean in practical terms? Today, many PBX vendors are changing their traditional proprietary architec-tures and have begun opening up to third-party applications. Since PBXs are built for the enterprise (with its strongMicrosoft presence), this trend is most visible in Windows, where in recent years we've observed a renewed interest inTAPI as the integration technology of choice. As a result, you may expect a complete and well-tested TAPI Service

Provider for almost any switch, especially for modern IP-PBXs. In our opinion, building your speech application on TAPIis the best strategy for widening your customer base and consequently maximizing your ROI in the call-center market

4.4 Conserve your Resources

As noted earlier, ASR and TTS licenses are the most expensive, yet also the most misused resources in speech applica-tions. Some commercially available platforms use two or more ASR licenses per application port, particularly in multilin-gual or hot-word applications. Below we present a few practical guidelines for saving money:

Royalty Free Engines - Yes, they are available! Companies like Microsoft and Aculab offer license-free ASR and TTStechnologies of high quality that may be perfect for your application. One word of caution: customers may accept a lowerquality TTS (as long as the message is understandable), but they have much less tolerance for imperfect ASR. From ourexperience, speech recognition has to work close to perfectly, or it will be deemed useless and dropped. In other words,carefully evaluate your ASR alternatives.

As with the other elements of your application, maintaining vendor-independence works to your advantage, allowing you,for example, to experiment with free engines before deciding to spend your dollars on licenses. (As a side note, none ofthe speech vendors that we know of accepts returns of purchased licenses). Again, picking a middleware framework thatsupports both free and commercial engines is, in our opinion, the best strategy.

One license per channel - If you decide to use a commercial speech engine from one of the industry leaders, investsome time to properly engineer your license manager - a lot of money can be saved by this effort. There is no technicalreason to use more than one engine license (TTS or ASR) per application channel. Even systems using multiple lan-guages or parallel grammars to implement hot-words can be designed to use one license per channel at any given time.Make sure that your middleware doesn't force you to unnecessary double your resources.

Floating licenses - You should also keep in mind that many applications don't require ASR and TTS for the whole dura-tion of a call. As an example, consider a pre-paid calling-card system. It uses speech recognition to identify callers,checks account balances and then bridges calls to outbound trunks. In this scenario, speech resources (ASR and TTS)are only required for a small fraction of a call, possibly as low as 10%. Once a call is bridged, the resources can beredeployed to serve other channels -- this presents a great savings opportunity. If licenses could float dynamicallybetween channels, in theory the savings could be as high as 90%. Unfortunately, not many platforms allow floatingspeech resources, but some systems do. Given the savings, it pays to ensure that the tool you choose allows for properlicense management to take advantage of the specific calling patterns in your application.

4.5 Don't Skimp on Tuning!

Any non-trivial speech application requires extensive testing and tuning, much more than a traditional touchtone system.This aspect is new, and often comes as a surprise to designers coming from an old IVR background.

Tuning is much more than just tweaking grammars. It is an iterative process of analyzing system performance andrepeatedly applying the best design practices in order to arrive at the most satisfying user experience and in order towork around technology imperfections. As a result, the tuning phase can take many months and requires an interdiscipli-nary team of professionals, including not only developers and testers but also experts in linguistics and often psychology.The resulting cost is substantial, but in our experience this is money well spent.Planning for tuning is difficult, because speech systems, unlike touchtone, are highly dependent on the demographics,local accents, language mix and even the culture of the target audience. Some applications, such as speech-enabled



9/12

auto-attendants, may require regular, on-going tuning as the grammar (i.e. a list of employee names) changes over time.Typically, end users are not able and should not be expected to perform tuning themselves and the application shouldbe specifically designed for on-going, remote maintenance by the vendor.

Tuning large grammars, such as a city's phone book, tends to be particularly challenging and should be approached withspecial caution. The experienced speech technology providers seem to be well aware of the possible problems: onehighly recognized vendor would sell us a ready-to-use grammar containing tens of thousands of names, but would notventure into signing a contract to get it working in the field.

Unfortunately, saving on tuning is not easy and may jeopardize the final quality of your product. Some savings may beachieved by employing off-the-shelf speech component libraries such as Nuance Speech Objects or SpeechWorksDialog Modules. But in general, tuning is not the area in which to be penny-pinching. In speech recognition applications,users typically have very little patience for shaky technology. The system has to be almost perfect or it will not be used.There is no middle ground.

4.6 Select Platform to Fit Your Application

Over the last few years, we've observed two promising trends impacting telephony and speech applications: opensource operating systems (mainly Linux) and XML-based scripting languages (mainly VoiceXML and SALT). But beforeyou bet your budget, take a careful look at the cost -the bottom-line ROI of your application is the criteria of success.

Operating System - The choice of operating system fundamentally impacts many aspects of a speech application.Today, there are three main choices: Unix (mainly Solaris), Linux or Windows. While discussing the merits of each OS isbeyond the scope of this paper, we will discuss some important considerations specific to telephony and speech.

First, keep your target market in mind. The old bias against Windows still holds strong in some traditional telephony envi-ronments, especially among carriers in North America. Even recently, we've seen an already completed applicationbeing ported to Solaris after approaching carriers with a Windows version. However, other regions of the world regardWindows much more favorably. Even in North America, the situation is much different in the enterprise, where Windowsnaturally fits into the desktop and business back-end dominated by Microsoft.

The most widely quoted complaints against Windows are reliability and price. We believe that this continued bias is nolonger justified - the modern Windows is reliable enough for speech applications, both for carriers and enterprise. As forthe price, Windows compares favorably to Unix, and while Linux is free, the price of the OS alone is often almost negligi-ble, especially for systems deployed in small numbers. For example, in the case of the Toronto Airport Assistant (48lines, Nuance), the price of the Windows operating system was not even 1% of the total cost.

Therefore, the price of the operating system is secondary to the availability of strong development tools, componentlibraries and middleware. The resulting increase in developer productivity has the potential to far outweigh the savingson the purchase price of the operating system.

Open Standards for Speech - Recent years have brought multiple exciting initiatives to standardize the development ofspeech applications, including the well-known VoiceXML, SALT, X+V, and CCXML. A discussion of their respective tech-

nical merits is beyond the scope of this paper, but there is a wealth of relevant information available from many sources.Similarly, we will not attempt to speculate which standard will ultimately prevail in the future. Instead, we will point out afew less obvious aspects that may impact your immediate strategy today. We will focus on VoiceXML, as it is the onlystandard in deployment today. The fundamental question is: will your project benefit from using the standard or wouldyou be better off with a proprietary system? Unfortunately, the answer is not always straightforward. Below we present afew ideas to consider.



10/12

At first, try to articulate the exact benefits of VoiceXML for your particular application. Next, look at the cost of yourrespective choices. Request quotes for equivalent VoiceXML and proprietary platforms and analyze them carefully. Youwill most likely find VoiceXML environments to be substantially more expensive. We received a quote for a typical 24port VoiceXML Gateway (including hardware, but without ASR or TTS licenses) for $1265 per port. You can build an

equivalent system using one of the popular proprietary RAD tools at half this cost.

Secondly, again consider the productivity of your programmers. VoiceXML, a complex language in itself, is not enough tobuild an application. You still need scripts (CGI, Java, ASP, JavaScript or VBScript) to implement your business logic andthen you need to host it all on a web server. The resulting environment is complex and does not offer a ready-to-useframework specific to telephony. Consequently, you'll need to think about multithreading your telephony channels, deal-ing with call state machines and synchronizing access to the global data structures, just to name a few.

Debugging your application is another important consideration. We don't know of any VoiceXML Gateway that offers anIDE supporting the complete environment. Even for VoiceXML alone, the development tools are in short supply.Furthermore, most tools are merely GUI overlays on top of VoiceXML syntax -- good only for creating static pages (asopposed to dynamic pages, which are generated on-the-fly from database queries and program logic). Therefore, beforecommitting to a VoiceXML gateway, ask about source-level debugging, handling of call state machines, multithreading ofapplication code, accessing databases and other basic programming tasks. Our point is that today's VoiceXML develop-ment environment is still primitive when compared to the industry-standard IDEs, like Microsoft Visual Studio.

Finally, take a look at the planned functionality of your application. VoiceXML, by definition, is limited by its own specifi-cation. VoiceXML has been designed specifically for speech-based user dialogs and that's where it excels. If your appli-cation is about call control, then beware: your only hope will be proprietary extensions, which in turn ties you to a specif-ic platform, and negates the benefits of vendor-independence and application portability.

So, are we advocating ignoring open standards? No, to the contrary, we strongly believe in the value of open standardsand their future wide acceptance. VoiceXML will steadily gain popularity, especially once CCXML addresses the currentshortcomings in call control, once the platform prices are reduced and once better developer tools emerge. Similarly,SALT offers a great future promise because of its tight integration with the Microsoft environment (including rich develop-

ment tools). Until this happens, however, a proper RAD tool can get the job done much quicker and cheaper.

Our recommendation: don't be afraid of using proprietary environments if they are a better fit for your application andespecially when you can realize significant savings. However, make sure that your tools have a well-defined migrationpath, should the open-standard market develop for your application in the future. In other words, ensure that your toolseither support VoiceXML or are properly integrated with VoiceXML products. You could even consider a hybrid solutionto combine the best of both worlds. For example, a front-end node that does a heavy-duty call processing (built on aproprietary system), which calls a VoiceXML gateway to execute best-of-breed third-party VoiceXML components andapplications.

4.7 Fallback to Touchtone

Yes indeed, this is the last resort. For the record, we truly believe in the superiority of speech recognition. But, don't dis-card the old touch-tone just yet --At the end of the day, reverting to touchtone may cut the cost enough to get your budg-

et approved. Whether we like it or not, many of our customers today opt to save through touchtone, and the fact remainsthat some applications won't benefit significantly from speech recognition.

One possible cost-saving strategy is again a hybrid solution, where speech recognition is applied selectively, to theareas, which bring the most benefit to the application. For example, an auto-attendant and voice mail system may bespeech-enabled on the customer-facing side and touchtone for the employees.



11/12

Make sure that the tools and middleware that you buy are as good for traditional IVR as they are for sophisticatedspeech applications. Touchtone technology is not going away any time soon.

5 Conclusion

Speech today is ready for prime time. Thanks to reliable, accurate and commercially available speech engines, manycompelling applications became possible, and many have been implemented already. At the same time, we continue tosee customers walking away from great speech applications and settling for the old-style touchtone solutions. The pri-mary culprit is cost - in our view the most important factor barring speech from wider acceptance. Unfortunately, the costwill stay high as long as speech remains limited to a niche market. Our industry has yet to come up with a creative wayto get out of this impasse.

Nevertheless, we believe that speech applications present opportunities for cost savings, even with today's high-pricedlicenses and platforms. This paper has presented a number of practical guidelines for lowering the cost of speech byproperly selecting tools and technologies. We hope that applying these guidelines will help you to build a better business

case for speech on your next project.

6 About Pronexus

With nearly a decade of experience and more than 3000 clients and partners around the world, Pronexus Inc. has estab-lished itself as a leader in Computer Telephony and Speech applications for wired and wireless environments. The com-pany is the developer of the award-winning VBVoice, a Rapid Application Development tool for building business-criti-cal CT and speech solutions. It also provides professional services for businesses requiring custom applications anddevelops OnCall?, a line of turnkey business solutions for a variety of industries and applications. Comprehensive sup-port services and acclaimed training complete the firm's offerings.



12/12

7 Glossary

API - Application Programming InterfaceASR - Automatic Speech RecognitionBRI - Basic Rate InterfaceCAGR - Compound Annual Growth RateCO - Central OfficeGUI - Graphical User InterfaceHMI - Human-Machine InterfaceIDE - Integrated Development EnvironmentISDN - Integrated Subscriber Digital NetworkOS - Operating SystemPBX - Private Branch ExchangePRI - Primary Rate InterfaceRAD - Rapid Application DevelopmentRLT - Release Link Transfer

ROI - Return on InvestmentTAPI - Telephony Application InterfaceTBCT - Two B-Channel TransfersTTS - Text to SpeechVUI - Voice User Interface


Pronexus Wpaper Practical Guide2QPO-10202006-2424

Documents

Transcript of Pronexus Wpaper Practical Guide2QPO-10202006-2424