Spoken Documentscholars.lib.ntu.edu.tw/bitstream/123456789/375142/1/17.pdf · 2018-09-10 ·...

19
© ARTVILLE & COMSTOCK IEEE SIGNAL PROCESSING MAGAZINE [42] SEPTEMBER 2005 1053-5888/05/$20.00©2005IEEE [ Lin-shan Lee and Berlin Chen ] S peech is the primary and most convenient means of communication between individuals [1]. In the future network era, the digital content over the network will include all the information activ- ities for human life, from real-time information to knowledge archives, from working environ- ments to private services. Apparently, the most attractive form of the network content will be in multimedia, including speech information. Such speech information usually provides insight concerning the subjects, topics, and concepts of the multimedia content. As a result, the spoken documents associated with the network content will become key for retrieval and browsing. On the other hand, the rapid development of network and wireless technologies is making it possible for people to access the network content not only from the office/home, but from anywhere, at any time, via small handheld devices such as personal digital assistants (PDAs) or cell phones. Today, network access is primarily text based. The users enter the instructions by words or texts, and the network or search engine offers text materials from which the user can select. The users interact with the network or search engine and obtain the desired information via text-based media. In the future, it can be imagined that almost all such functions of text can also be performed with speech. The user’s instructions can be entered not only by text but possibly through speech as well since speech is a convenient user interface for a variety of user ter- minals, especially for small handheld devices. The network content may be indexed/retrieved and browsed not only by text but possibly also by the associated spoken documents as mentioned above. The users may [ The key to future efficient retrieval/browsing applications ] Spoken Document Understanding and Organization Authorized licensed use limited to: National Taiwan University. Downloaded on January 20, 2009 at 02:26 from IEEE Xplore. Restrictions apply.

Transcript of Spoken Documentscholars.lib.ntu.edu.tw/bitstream/123456789/375142/1/17.pdf · 2018-09-10 ·...

Page 1: Spoken Documentscholars.lib.ntu.edu.tw/bitstream/123456789/375142/1/17.pdf · 2018-09-10 · text-based media or spoken/multimodal dialogs. Text-to-speech synthesis can be used to

© ARTVILLE & COMSTOCK

IEEE SIGNAL PROCESSING MAGAZINE [42] SEPTEMBER 2005 1053-5888/05/$20.00©2005IEEE

[Lin-shan Lee and Berlin Chen]

Speech is the primary and most convenient means of communication between individuals [1]. Inthe future network era, the digital content over the network will include all the information activ-ities for human life, from real-time information to knowledge archives, from working environ-ments to private services. Apparently, the most attractive form of the network content will be inmultimedia, including speech information. Such speech information usually provides insight

concerning the subjects, topics, and concepts of the multimedia content. As a result, the spoken documentsassociated with the network content will become key for retrieval and browsing.

On the other hand, the rapid development of network and wireless technologies is making it possible forpeople to access the network content not only from the office/home, but from anywhere, at any time, viasmall handheld devices such as personal digital assistants (PDAs) or cell phones. Today, network access isprimarily text based. The users enter the instructions by words or texts, and the network or search engineoffers text materials from which the user can select. The users interact with the network or search engineand obtain the desired information via text-based media. In the future, it can be imagined that almost allsuch functions of text can also be performed with speech. The user’s instructions can be entered not only bytext but possibly through speech as well since speech is a convenient user interface for a variety of user ter-minals, especially for small handheld devices. The network content may be indexed/retrieved and browsednot only by text but possibly also by the associated spoken documents as mentioned above. The users may

[The key to future efficient

retrieval/browsing applications]

Spoken Document Understanding

and Organization

Authorized licensed use limited to: National Taiwan University. Downloaded on January 20, 2009 at 02:26 from IEEE Xplore. Restrictions apply.

Page 2: Spoken Documentscholars.lib.ntu.edu.tw/bitstream/123456789/375142/1/17.pdf · 2018-09-10 · text-based media or spoken/multimodal dialogs. Text-to-speech synthesis can be used to

IEEE SIGNAL PROCESSING MAGAZINE [43] SEPTEMBER 2005

also interact with the network or the search engine via eithertext-based media or spoken/multimodal dialogs. Text-to-speechsynthesis can be used to transform the text information in thecontent into speech when required. This is the general environ-ment of retrieval/browsing applications for multimedia contentwith associated spoken documents.

SPOKEN DOCUMENT UNDERSTANDING AND ORGANIZATIONWhen considering the above network content access environ-ment, we must keep in mind that, unlike the written documentsthat are better structured with titles and paragraphs and thuseasier to retrieve and browse, multimedia/spoken documents aremerely video/audio signals, or a very long sequence of wordsincluding errors even if automatically transcribed. Examplesinclude a three-hour video of a course lecture, a two-hourmovie, or a one-hour news episode. In general, they are not seg-mented into paragraphs and no titles are mentioned for theparagraphs. Thus, they are much more difficult to retrieve andbrowse because the user simply cannot browse through eachfrom the beginning to the end. As a result, better approaches arerequired for the understanding and organization of spoken doc-uments (or the associated multimedia content) for easierretrieval/browsing. This should include at least the following:

1) Named-entity extraction from spoken documents.Named entities (NEs) are usually the keywords in the spo-ken documents (or associated multimedia content). Theyare the key to understanding the subject matters of thedocuments, although they’re usually difficult to extractfrom the spoken documents. 2) Spoken document segmentation. Spoken documents (orthe associated multimedia content) are automatically seg-mented into short paragraphs, each with some central con-cept or topic.3) Information extraction for spoken documents. This stepinvolves automatically extracting the key information (suchas who, when, where, what, and how) for the eventsdescribed in the segmented short paragraphs. Very often,the relationships among the NEs in the paragraphs areextracted. 4) Spoken document summarization. This step involvesautomatically generating a summary (in text or speech form)for each segmented short paragraph of the spoken documents(or associated multimedia content).5) Title generation for spoken documents. This step involvesautomatically generating a title (in text or speech form) foreach short paragraph of the spoken documents (or the associ-ated multimedia content).6) Topic analysis and organization. This step involves auto-matically analyzing the subject topics of the segmented shortparagraphs of the spoken documents (or the associated multi-media content), clustering them into groups with topiclabels, constructing the relationships among the groups, andorganizing them into some hierarchical visual presentationthat is easier for users to browse.

When all the above tasks can be properly performed, the spo-ken documents (or associated multimedia content) are in factbetter understood and reorganized in a way that retrieval/brows-ing can be performed easily. For example, they are now in theform of short paragraphs, properly organized in some hierarchi-cal visual presentation with titles/ summaries/topic labels as ref-erences for retrieval and browsing. The retrieval can beperformed based on either the full content, thesummaries/titles/topic labels, or both. In this article, this isreferred to as spoken document understanding and organizationfor efficient retrieval/browsing applications.

Note that while some of the areas mentioned above havebeen studied or explored to a good extent, some of them havenot at this point in time. Yet, most of the work has been per-formed independently within respective individual scopes andreported as separated analyses and results. Great efforts havebeen made to try to integrate several of these independentprojects into a specific application domain, and in recent yearsseveral well-known research projects and systems have beensuccessfully developed towards this goal. Examples include theInformedia System at Carnegie Mellon University [2], theMultimedia Document Retrieval Project at CambridgeUniversity [3], the Rough’n’Ready System at BBN Technologies[4], the Speech Content-based Audio Navigator (SCAN) Systemat AT&T Labs-Research [5], the Broadcast News Navigator atMITRE Corporation [6], the SpeechBot Audio/Video SearchSystem at Hewlett-Packard (HP) Labs [7], and the NationalProject of Spontaneous Speech Corpus and ProcessingTechnology of Japan [8]. All of these successful projects andsystems have evidenced the importance and potential ofimproved technologies towards the common goal of efficientretrieval/browsing for massive quantities of spoken documentsand multimedia content.

The purpose of this article, however, is to present a concise,comprehensive, and integrated overview of these related areas,(including relevant problems and issues, general principles,and basic approaches) in a unified context of spoken documentunderstanding and organization for efficient retrieval/brows-ing applications. In addition, we present an initial prototypesystem we developed at National Taiwan University as a newexample of integrating the various technologies and function-alities. Note that similar efforts can be made based on informa-tion of other media in the multimedia content, for examplebased on the text, graphic, or video information. Here we focusonly on those based on the associated spoken documents. Asmentioned previously, speech information is usually the keycompared to other information. Also, note that speech under-standing has existed as a research topic for a long time; statis-tical approaches to speech understanding have been reviewedin the article by Wang et al. [9]. Previously, speech under-standing usually referred to understanding the speaker’s inten-tion in such applications as spoken dialogues within specifictask domains (for example, for air travel reservation or confer-ence registration). Here, the domains for network content canbe arbitrary and almost unlimited. Therefore, the technologies

Authorized licensed use limited to: National Taiwan University. Downloaded on January 20, 2009 at 02:26 from IEEE Xplore. Restrictions apply.

Page 3: Spoken Documentscholars.lib.ntu.edu.tw/bitstream/123456789/375142/1/17.pdf · 2018-09-10 · text-based media or spoken/multimodal dialogs. Text-to-speech synthesis can be used to

needed will have to be domain independent and quite differentfrom those used in spoken dialogues. Of course, forretrieval/browsing purposes, the required accuracy for under-standing is also different from that required in applicationssuch as spoken dialogs, where any error in understanding maydirectly lead to a wrong system response.

The majority of the approaches and technologies to bereviewed and discussed in this article were analyzed and verifiedusing various broadcast news archives. This is apparently due tothe availability of huge quantities of broadcast news archivesand the fact that many of the statistical approaches mentionedhere have to be based on huge quantities of corpora. Therefore,in this article, if not otherwise mentioned, broadcast newsarchives are taken as the default example spoken documents.Other types of spoken documents, for example technical presen-tations, will be mentioned when specific approaches for suchtasks are discussed.

BRIEF REVIEW OF INFORMATION RETRIEVAL AND AUDIO INDEXINGAlthough this article is focused on spoken document under-standing and organization, the purpose is to explore efficientretrieval and browsing. So we will start with a brief review ofinformation retrieval (IR) and audio indexing. In the past twodecades, most of the research efforts in IR were focused on textdocument retrieval, and the Text Retrieval Conference (TREC)evaluations [10] of the 1990s are good examples. In the conven-tional text document retrieval, a collection of documentsD = {di, i = 1, 2, . . . , N } is to be retrieved by a user’s query Q.The retrieval is based on a set of indexing terms specifying thesemantics of the documents and the query, which are very oftena set of keywords or all the words used in all the documents. Thedocument retrieval problem can then be viewed as a clusteringproblem, i.e., selecting the documents out of the collection thatare in the class relevant to the query Q. The documents are usu-ally ranked by a retrieval model (or ranking algorithm) based onthe relevance scores between each of the documents di and thequery Q evaluated with the indexing terms (i.e., those docu-ments on the top of the list are most likely to be relevant). Theretrieval models are usually characterized by two differentmatching strategies, namely, literal term matching and conceptmatching. These two strategies are briefly reviewed below.

LITERAL TERM MATCHINGThe vector space model (VSM) is the most popular example forthe literal term matching [11]. In VSM, every document di isrepresented as a vector �di. Each component wi,t in this vector isa value associated with the statistics of a specific indexing term(or word) t, both within the document di and across all the doc-uments in the collection D,

wi,t = fi,t × ln(N/Nt), (1)

where fi,t is the normalized term frequency (TF) for the term(or word) t in di used to measure the intradocument weightfor the term (or word) t , while ln(N/Nt) is the inverse docu-ment frequency (IDF), where Nt is the total number of docu-ments in the collection that include the term (or word) t, andN is the total number of documents in the collection D. IDF isto measure the interdocument discriminativity for the term t,reflecting the fact that indexing terms appearing in more doc-uments are less useful in identifying the relevant documents.The query Q is also represented by a vector �Q constructed inexactly the same way, i.e., with components wq,t in exactly thesame form as in (1). The cosine measure is then used to esti-mate the query-document relevance scores:

R( �Q, �dt

)=

( �Q · �dt

)/(‖ �Q‖ · ‖�di‖

), (2)

which apparently matches Q and di literally based on the terms.This model has been widely used because of its simplicity andsatisfactory performance.

The literal term matching can also be performed with proba-bilities, and the N-gram-based [12] and hidden Markov model(HMM)-based [13] approaches are good examples. In these mod-els, each document di is interpreted as a generative model com-posed of a mixture of N-gram probability distributions forobserving a query Q, while the query Q is considered as observa-tions, expressed as a sequence of indexing terms (or words)Q = t1 t2 . . . tj . . . tn, where tj is the jth indexing term in Q andn is the length of the query (as illustrated in Figure 1). The N-gram distributions for the terms tj, for example P(tj |dt)

and P(tj | tj−1, di) for unigrams and bigrams, are estimated fromthe document di and then linearly interpolated with the back-ground unigram and bigram models estimated from a large out-side text corpus C, P(tj |C) and P(tj |tj−1, C ). The relevancescore for a document di and the query Q = t1 t2 . . . tJ . . . tn canthen be expressed as (with unigram and bigram models)

P(Q|di) = [m1 P(t1|di) + m2 P(t1|C)]

×n∏

j=2

[m1 P(tj|di) + m2 P(tj|C)

+m3 P(tj|tj−1, di) + m4 P(tj|tj−1, C)], (3)

IEEE SIGNAL PROCESSING MAGAZINE [44] SEPTEMBER 2005

[FIG1] An illustration of the HMM-based retrieval model.

m4

m3

m2

m1

An HMM Model for the Document di

Query

Q = t1t2 ..tj ..tn

P (tj tj−1, C)

P (tj tj−1, di)

P (tj C)

P (tj di)

Authorized licensed use limited to: National Taiwan University. Downloaded on January 20, 2009 at 02:26 from IEEE Xplore. Restrictions apply.

Page 4: Spoken Documentscholars.lib.ntu.edu.tw/bitstream/123456789/375142/1/17.pdf · 2018-09-10 · text-based media or spoken/multimodal dialogs. Text-to-speech synthesis can be used to

IEEE SIGNAL PROCESSING MAGAZINE [45] SEPTEMBER 2005

which again matches Q and di based literally on the terms.The unigram and bigram probabilities, as well as the weightingparameters, m1, . . . , m4, can be further optimized. For exam-ple, they can be optimized by the expectation-maximization(EM) or minimum classification error (MCE) training algo-rithms, given a training set of query exemplars with the corre-sponding query-document relevance information [14].

CONCEPT MATCHINGBoth approaches mentioned above are based on matching theterms (or words) and thus very often suffer from the problem ofword-usage diversity (or vocabulary mismatch) when the queryand its relevant documents are using quite different sets ofwords. In contrast, the concept matching strategy tries to dis-cover the latent topical information inherent in the query anddocuments, based on which the retrieval is performed; the latentsemantic indexing (LSI) model is a good example [15]. LSI startswith a “term-document” matrix W, describing the intra- andinterdocument statistical relationships between all the termsand all the documents in the collection D, in which each term tis characterized by a row vector and each document di in D ischaracterized by a column vector of W. Singular value decom-position (SVD) is then performed on the matrix W to project allthe term and document vectors onto a single latent semanticspace with significantly reduced dimensionality R :

W ≈ W = U�VT, (4)

where W is the rank-R approximation to the term-documentmatrix W, U is the right singular matrix, � is the R × R diago-nal matrix of the R singular values, V is the right singularmatrix, and T denotes matrix transposition. In this way, therow/column vectors representing the terms/documents in theoriginal matrix W can all be mapped to the vectors in the samelatent semantic space with dimensionality R. As shown inFigure 2, in this latent semantic space, each dimension isdefined by a singular vector and represents some kind of latentsemantic concept. Each term t and each document di can nowbe properly represented in this space, with components in eachdimension related to the weights of the term t and document di

with respect to the dimension, or the associated latent semanticconcept. The query Q or other documents that are not repre-sented in the original analysis can be folded in, i.e., similarlyrepresented in this space via some simple matrix operations. Inthis way, indexing terms describing related concepts will be nearto each other in the latent semantic space even if they never co-occur in the same document, and the documents describingrelated concepts will be near to each other in the latent seman-tic space even if they never use the same set of words. Thus, thisis concept matching rather than literal term matching. The rele-vance score between the query Q and a document di is estimat-ed by computing the cosine measure between the correspondingvectors in this latent semantic space. A more detailed elucida-tion of LSI and its applications can be found in [16].

In recent years, new efforts have been made to establish theprobabilistic framework for the above latent topical approaches,including improved model training algorithms. The probabilis-tic latent semantic analysis (PLSA ) or aspect model [17] is oftenconsidered as a representative of this category. PLSA introducesa set of latent topic variables {Tk, k = 1, 2, . . . , K} to character-ize the term-document cooccurrence relationships, as shown inFigure 3. A query Q is again treated as a sequence of observedterms, Q = t1 t2 . . . tj . . . tn, while the document di and a term tj

are both assumed to be independently conditioned on an associ-ated latent topic Tk. The conditional probability of a documentdi generating a term tj thus can be parameterized by

P(tj |di) =K∑

k= 1

P(tj |Tk)P(Tk |di). (5)

[FIG2] Three-dimensional schematic representation of the latentsemantic space and the LSI retrieval model.

di

Qt

••

First Dimension

Second Dimension

Third Dimension

[FIG3] Graphical representation of the PLSA-based retrievalmodel.

dN

di

d2

d1

.

.

.

.

Tk

TK

T2

T1

.

.

tj

tn

t2

t1

.

.

P (di)P (Tk di)

P (tj Tk)

Documents Latent Topics Query Q

Q = t1t2 .. tj ..tn

Authorized licensed use limited to: National Taiwan University. Downloaded on January 20, 2009 at 02:26 from IEEE Xplore. Restrictions apply.

Page 5: Spoken Documentscholars.lib.ntu.edu.tw/bitstream/123456789/375142/1/17.pdf · 2018-09-10 · text-based media or spoken/multimodal dialogs. Text-to-speech synthesis can be used to

When the terms in the query Q are further assumed to be inde-pendent given the document, the relevance score between thequery and document can be expressed as

P(Q|di) =n∏

j= 1

[K∑

k= 1

P(tj |Tk)P(Tk |di)

]. (6)

Notice that this relevance score is not obtained directly from thefrequency of the respective query term tj occurring in di butinstead through the frequency of tj in the latent topic Tk as wellas the likelihood that di generates the latent topic Tk. A queryand a document thus may have a high relevance score even ifthey do not share any terms in common; this, therefore, is con-cept matching. The PLSA model can be optimized by the EMalgorithm in either an unsupervised manner using each individ-ual document in the collection as a query exemplar to train itsown PLSA model or in a supervised manner using a training setof query exemplars with the corresponding query-document rel-evance information.

SPOKEN DOCUMENTS/QUERIESAll the retrieval models mentioned above can be equally appliedto text or spoken documents with text or spoken queries. Theprimary difficulties for the spoken documents and/or queries arethe inevitable speech-recognition errors, including the prob-lems with spontaneous speech (such as pronunciation variationand disfluencies) and the out-of-vocabulary (OOV) problem forwords outside the vocabulary of the speech recognizer. A princi-pal approach to the former, in addition to the many approachesfor improving the recognition accuracy, is to develop morerobust indexing terms for audio signals. For example, multiplerecognition hypotheses obtained from N-best lists, wordgraphs, or sausages can provide alternative representatives forthe confusing portions of the spoken query or documents [18].Improved scoring methods using different confidence measures(for example, posterior probabilities incorporating acoustic andlanguage model likelihoods) or other measures consideringrelationships among the recognized word hypotheses [19]–[20],as well as prosodic features including pitch, energy stress, andduration measure [21], can also help to properly weight theterm hypotheses. The use of subword units (e.g., phonemes forEnglish and syllables for Chinese) or segments of them, ratherthan words as the indexing terms t mentioned above has alsobeen shown to be very helpful [18], [19], [22]. Because theincorrectly recognized spoken words may include several sub-word units correctly recognized, matching on the subword levelhas the advantages of partial matching. Moreover, all words,whether within the vocabulary of the recognizer or not, arecomposed of a few subword-level units. Therefore, matching onthe subword level is very often an effective approach for bypass-ing the OOV problem mentioned above because the words in thespoken query/documents may be reasonably matched even ifthey are not in the vocabulary and cannot be correctly recog-nized. Furthermore, because word-level terms possess more

semantic information, whereas subword-level terms are morerobust against the speech recognition errors and the OOV prob-lem, there are good reasons to fuse the information obtainedfrom the two different levels of terms t. In addition, another setof approaches tries to expand the representation of thequery/document using the conventional IR techniques (such aspseudo-relevance feedback). Techniques based on the acousticconfusion statistics and/or semantic relationships among theword- or subword-level terms derived from some training cor-pus have been shown to be very helpful as well [20], [23]. Moredetailed issues regarding retrieval of spoken documents will alsobe reviewed by Koumpis and Renals [24].

TECHNOLOGY AREAS FOR SPOKEN DOCUMENTUNDERSTANDING AND ORGANIZATION As mentioned previously, a number of technology areas areinvolved in spoken document understanding and organizationfor efficient retrieval/browsing. Each of these areas will bebriefly reviewed in this section.

NAMED ENTITY EXTRACTIONFROM SPOKEN DOCUMENTSNamed entities (NEs) include 1) proper nouns such as names forpersons, locations, organizations, artifacts, and so on, 2) tempo-ral expressions such as “Oct. 10 2003” or “1:40 p.m.,” and 3)numerical quantities such as “fifty dollars” or “thirty%.” Theyare usually the keywords of the documents. The temporalexpressions and numerical quantities can be easily modeled andextracted by rules; therefore, they will not be further mentionedhere. The person/location/organization names, however, aremuch more difficult to identify in text or spoken documents.Sometimes it is even difficult to identify which kind of namesthey are, for example, “White House” can be either an organiza-tion name or a location name based on the context [25]. TheNEs, as well as their categories, are the first and most funda-mental knowledge required for understanding and organizingthe content of the documents. The task of automatic extractionof NEs originated from the Message Understanding Conferences(MUC) sponsored by a U.S. Defense Advanced Research ProjectsAgency (DARPA) program [26] in the 1990s; this program wasaimed at the extraction of information from text documents.More research work on NE extraction from spoken documentshas been carried out on American English broadcast news (alsounder DARPA-sponsored programs )since the late 1990s, and ithas been extended to many other languages in the past severalyears [27], [28]. Substantial work has been done in developingrule-based approaches for locating the NEs [29]. For example,the cue-word “Co.” possibly indicates the existence of a companyname in the span of its predecessor words and a cue-word “Mr.”possibly indicates the existence of a person name in the span ofits successor words. Such an approach has been very useful.However, the rules may become very complicated when we wishto cover all different possibilities. It will be very time consumingand difficult to handcraft all the rules, especially when the taskdomain becomes more general or when new sources of docu-

IEEE SIGNAL PROCESSING MAGAZINE [46] SEPTEMBER 2005

Authorized licensed use limited to: National Taiwan University. Downloaded on January 20, 2009 at 02:26 from IEEE Xplore. Restrictions apply.

Page 6: Spoken Documentscholars.lib.ntu.edu.tw/bitstream/123456789/375142/1/17.pdf · 2018-09-10 · text-based media or spoken/multimodal dialogs. Text-to-speech synthesis can be used to

IEEE SIGNAL PROCESSING MAGAZINE [47] SEPTEMBER 2005

ments are being handled. This may be the reason that model-based (or machine-learning-based) approaches were proposedlater on, such that some model can be initially trained using asufficient quantity of data annotated with the types and loca-tions of the NEs, and the model can then be steadily improvedthrough use [25]. In many cases, the rule-based and model-based approaches can be properly integrated to take advantageof the nice properties of both.

In model-based NE extraction, the goal is usually to find thesequence of NE labels (person name, location name, or otherwords), E = n1n2 . . . nj . . . nn, for a sentence or term sequence,S = t1 t2 . . . tj . . . tn, that maximizes the probability P(E |S). Scan be a sequence of recognized words with recognition errorsand incorrect sentence boundaries in the transcription of spo-ken documents and nj is the NE label for the term tj. The HMM,as depicted in Figure 4, is probably the best typical representa-tive model used in this category [25]. This model consists of onestate modeling each type of the NE (person, location, or organi-zation), plus one state modeling other words in the general lan-guage (non-named-entity words), with a possible transitionbetween states. Each state is characterized by a bigram or tri-gram language model estimated for that state, and state-transi-tion probabilities can be similarly trained. The Viterbi algorithmcan thus be used to find the most likely state sequence, or NElabel sequence E, for the input sentence; the segment of consec-utive words in the same NE state is taken as an NE. As anotherimportant representative, in the recent past the maximumentropy (ME) modeling approach has been used in NE extrac-tion. In this approach, many different linguistic and statisticalfeatures, such as part-of-speech (POS) information, rule-basedknowledge, and term frequencies, can all be represented andintegrated in the framework of a statistical model with whichthe NEs can be identified. It was shown that very promisingresults can be obtained with this approach [28], [30].

At National Taiwan University, we recently developed a newapproach for NE extraction with the help of the global informa-tion from the entire document using the PAT tree data structure[31]. Very often, some NEs are difficult to identify in a single sen-tence. However, if the scope of observation can be extended to theentire document, it will be found that this entity appears severaltimes in several different sentences; it has a higher likelihood tobe an NE when all those occurrences in different sentences can beconsidered jointly. The PAT tree is an efficient data structure suc-cessfully used in information retrieval. It is especially efficient forindexing a continuous data stream, such that the frequencycounts of all segments of terms in a document can be obtainedeasily from a tree structure constructed with that document. Itwas shown that by incorporating the framework of variousapproaches of NE extraction with a PAT tree for the entire docu-ment, significantly improved performance can be achieved. In allcases, the NEs are very often OOV, or unknown, words. A typicalapproach for dealing with this problem, for example in the HMMmodeling mentioned above, is to divide the training data into twoparts during training. In each half, every segment of terms orwords that does not appear in the other half is marked as

“unknown,” such that the probabilities for both known andunknown words occurring in the respective NE states can beproperly estimated. During testing, any segment of terms that isnot seen before can thus be labeled “unknown,” and the Viterbialgorithm can be carried out to give the desired results [25].

All the approaches mentioned above are equally useful forNE extraction from both text and spoken documents. Usually,the spoken documents are first transcribed into a sequence ofterms or words, and the same approaches for text can then beapplied. However, NE extraction from spoken documents ismuch more difficult than that from text documents, not onlybecause of the presence of recognition errors and problems ofspontaneous speech but also due to the absence of the textualclues such as punctuations or sentence boundaries (in particu-lar the capital letters indicating proper nouns). In fact, a moredifficult problem is that many NEs are OOV words and thus can-not be correctly recognized in the transcriptions in the firstplace [27]. Extra measures are, therefore, definitely needed inthe case of spoken documents. Performing the extraction onmultiple recognition hypotheses, such as word graphs orsausages, and using confidence scores have been two typicalapproaches. Class-based language models with different POS asclasses for the state of general language words to be used in theabove HMM approach were also found useful [31]–[33]. Inrecent years, several approaches were developed to try to utilizethe rich resource of the Internet as the background knowledgefor extracting from spoken documents NEs that are OOV words.In such approaches, selected words with higher confidencemeasures in the transcription obtained with the initial recogni-tion on the spoken document can be used to construct queriesto retrieve from the Internet text documents relevant to the spo-ken document being considered. NE extraction can then be per-formed on these retrieved text documents, the obtained NEs canbe added to the recognition vocabulary, and the language modelcan be adapted using the retrieved text documents. Second-passtranscription of the spoken document based on the updatedvocabulary and language model may then be able to correctlyrecognize the OOV words. In some similar approaches, for therecognized words in the initial transcription with lower

[FIG4] An illustration of HMM-based named-entity extraction.

General Language

Person

Location

Organization

Sentence

S = t1t2 ..tj ..tn

Authorized licensed use limited to: National Taiwan University. Downloaded on January 20, 2009 at 02:26 from IEEE Xplore. Restrictions apply.

Page 7: Spoken Documentscholars.lib.ntu.edu.tw/bitstream/123456789/375142/1/17.pdf · 2018-09-10 · text-based media or spoken/multimodal dialogs. Text-to-speech synthesis can be used to

confidence measures, the associated phone (or syllable)sequence can also be directly matched with the relevant textdocuments or the newly obtained NEs, such that the incorrectlyrecognized OOV words may possibly be recovered. In both cases,the achievable improvements depend heavily on the accurateretrieval of the highly relevant documents that can successfullyinclude the OOV words for the spoken documents, which inturn relies on good choice of the indexing terms and the con-struction of good queries [31], [34].

SPOKEN DOCUMENT SEGMENTATIONAutomatic spoken document segmentation is used to set theboundaries between different small topics being mentioned inlong streams of audio signals and divide the spoken documentsinto a set of cohesive paragraphs sharing some common centraltopic. This is less critical in text documents, which are usuallywell segmented into paragraphs, sections, or chapters by theauthors, although much work has also been done on text docu-ment segmentation. For multimedia or spoken documents, suchas a three-hour video of a course lecture, a two-hour movie, or aone-hour news episode, such segmentation becomes importantfor efficient retrieval and browsing. In such cases, the automaticsegmentation is to be performed on long streams of transcribedwords with recognition errors, even without sentence bound-aries. Substantial efforts have been made on feature-based seg-mentation approaches, in which cue words or phrases indicatingthe beginning/ending of a topic or a switching point of the top-ics can be identified as features for segmentation either statisti-cally or with the help of human knowledge [35]. For example, aset of discourse markers consisting of cue words obtained withword statistics was successfully used in conjunction with pauseinformation for the task of detecting paragraph breaks in a tech-nical presentation archive [36]. Approaches based on similarityamong sentences have also been extensively investigated. Here,a “sentence” may be a short sequence of transcribed words withfixed length, or the sentence boundaries may be assumed atlonger pause durations in the acoustic signal. As an example ofsuch approaches, the similarity between two sentences can beestimated by the number of terms commonly used or cosinemeasures for the vector representations of the sentencesobtained by VSM or LSI in IR. This similarity measure can beevaluated for each pair of sentences within a paragraph hypothe-

sis, as well as for each pair of sentences taken from two adjacentparagraph hypotheses across a segmentation point candidate. Bycomparing these two different sets of similarities while shiftingthe segmentation point candidate, the segmentation point canbe identified. As another example, assuming a sequence of sen-tences S1S2 . . . Si has been determined to describe the sametopic, the sentences can be considered as a document and therelevance scores between the next sentence Si+1 and this docu-ment can be evaluated with any of the above IR approaches(VSM or LSI); based on this, it can be decided if Si+1 belongs tothe same paragraph or is the beginning of a new paragraph.

Recent work on segmentation seems to be more focused onmodel-based approaches, primarily the HMM approach [37],[38]. In this approach (as shown in Figure 5), a total of I topicclusters, {uk, k = 1, 2, . . . , I}, form the I states of the HMM.These topic clusters are trained with a training corpus of seg-mented text documents with labeled topics, or by, for example,K-mean algorithm if the training segments are not labeled. Theunsegmented transcription of spoken documents is taken as theobservations in the form of a sequence of sentences,S1S2 . . . Si . . . SM . Similar to the above, the “sentences” heremay not have correct boundaries. The probability for each “sen-tence” Si = t1 t2 . . . tj . . . tn , where tj is the j th term, to beobserved for each topic cluster (or state) uk can then be evaluat-ed by N-gram probabilities trained from the training documentsin the topic cluster uk

P(Si|uk) =m1 P(t1|uk) ×n∏

j=2

[m1 P(tj|uk) + m2 P(tj|tj−1, uk)

],

(7)

assuming uni- and bigram probabilities are used, and m1 andm2 are weighting parameters. The above equation is very simi-lar to (3) for HMM for IR, except the document di in (3) isreplaced by the topic cluster uk here, and we may not need anoutside corpus C for smoothing the uni- and bigram models ifenough training corpus is available. Each topic cluster (state)may have a fixed transition probability p1 for transition to a dif-ferent state as well as another p2 for remaining in the samestate. The transition probabilities may also be estimated usingthe frequency counts of transitions between clusters in thetraining set. Various approaches can be developed to improvethis model. For example, p1 can be further modified by a pauseduration model (so a longer pause in the audio signal implieshigher probability to transit to a different topic), and p2 can befurther modified by a paragraph-length model (e.g., p2 can besmaller when the present paragraph has been long enough), andso on [38]. Viterbi algorithm can then be performed on theinput observations, and the state transition obtained is taken asa segmentation point. This HMM-based approach has beenrecently extended to embed the PLSA model in the representa-tion of the state observation probabilities, and a considerableperformance gain was indicated [39].

IEEE SIGNAL PROCESSING MAGAZINE [48] SEPTEMBER 2005

[FIG5] HMM-based spoken document segmentation.

u1 u2 u3 uI

p1

p2

p1 p1

Sentence Sequence

S1S2 ..Si ..SM

Authorized licensed use limited to: National Taiwan University. Downloaded on January 20, 2009 at 02:26 from IEEE Xplore. Restrictions apply.

Page 8: Spoken Documentscholars.lib.ntu.edu.tw/bitstream/123456789/375142/1/17.pdf · 2018-09-10 · text-based media or spoken/multimodal dialogs. Text-to-speech synthesis can be used to

IEEE SIGNAL PROCESSING MAGAZINE [49] SEPTEMBER 2005

INFORMATION EXTRACTION FOR SPOKEN DOCUMENTS Information extraction (IE) has long been considered a veryimportant area in natural language understanding. It usuallyrefers to the processes of extracting the salient facts about someprespecified templates of information regarding the entities orevents (in particular the relationships among them) from thedocuments (or paragraphs of documents as discussed here) andthen organizing some semantic representation for such informa-tion for efficient use. The approaches used in IE vary significantlyfor different cases, but the core processes include lexical process-ing, syntactic analysis, semantic analysis, and output presenta-tion [40], [41]. POS tagging is usually an important componentin lexical analysis. In this component, a most likely lexical classtag is automatically assigned to each word in a sentence. The NEextraction discussed previously is commonly considered as a partof this component, because the person/location/organizationnames are very important lexical classes that can be tagged to thewords in addition to other lexical classes. In fact, in many cases,the final output of IE is exactly the relationships among the NEsin the documents. Syntactic analysis, on the other hand, is veryoften accomplished by syntactic parsing, in which the core con-stituent structures of sentences (e.g., the noun, verb, and prepo-sitional phrases) are identified based on a grammar, an exampleof which is a set of finite state rules with or without probabilitydistributions. Then, the head words in each of the identified con-stituents usually give the key information needed. For example,the head words in the subject-noun phrase and the object-nounphrase are very often some domain-specific NEs, the verb thattakes them as arguments may represent the relationship betweenthem, and all these together may describe some event. Theprepositional phrases may reveal the temporal and location infor-mation for the objects or events. Semantic analysis then tries tofurther explore some additional linguistic cues via semantics. Asa simple example, it is often necessary to identify the variousforms of the same objects or NEs throughout a given context(e.g., acronyms or aliases) and to resolve the references of pro-nouns or demonstratives in most cases based on the earlier men-tioned terms. The final step of IE is to organize or present theextracted information in some form of appropriate templates (orabstract data structures) for efficient use, for example being auto-matically entered into a database to be used in indexing/retrievalor question-answering, or being used in composing a summaryor a title with natural language.

All the processes mentioned above can be accomplished byeither rule- or model-based approaches or the combination ofboth. Take the component of POS tagging as an example [42]. Asmentioned above, POS tagging can be considered as an extendedor generalized version of the NE extraction problem because otherlexical class tags in addition to the NEs are also to be assigned toall the words in a sentence. It can therefore be imagined thatapproaches and issues similar to those for NE extraction equallyapply here. For example, the rule-based approaches are apparentlyuseful, in which a large set of disambiguation rules derived eithermanually or statistically are able to solve many of the problems.On the other hand, HMM-based tagging is a good example for

model-based approaches. In this approach, the best sequence oftags Q = p1 p2 . . . pj . . . pn for a sentence S = t1 t2 . . . tj . . . tn ,where pj is the POS tag for the word (or term) tj, can be chosen bymaximizing the probability P(Q|S) based on a probabilistic gener-ative model or an HMM. This model has an extended form similarto that in Figure 4 and can be trained by a pretagged corpus. Thereis also the so-called transformation-based tagging, which is in factan integration of both rule- and model-based approaches. In thiscase, a set of templates (or transformations) that condition the tagspecification of a given word on the context of its preceding and/orsucceeding words was developed for deducing the rules that cancapture the interdependencies between the words and the corre-sponding tags. With the aid of a pretagged training corpus, theserules can then be incrementally learned by selecting and sequenc-ing the transformations that can transform the training sentencesto a best set of POS tag sequences closest to the set of the corre-sponding training tag sequences.

All these approaches mentioned above apply equally to textand spoken documents, although at the moment most work onIE is focused on text documents and relatively limited work onspoken documents have been reported [32], [33]. In the case ofspoken documents, apparently more difficult problems (such asthose due to speech recognition errors and spontaneous speech)must be addressed.

SPOKEN DOCUMENT SUMMARIZATIONResearch work in automatic summarization of text documentsdates back to the late 1950s, and the efforts have continuedthrough the decades. The World Wide Web not only led to a ren-aissance in this area but extended it to cover a wider range ofnew tasks, including multidocument, multilingual, and multi-media summarization [43]. The summarization can, in general,be either extractive or abstractive. Extractive summarizationtries to select a number of indicative sentences, passages, orparagraphs from the original document according to a targetsummarization ratio and then sequence them to form a summa-ry. Abstractive summarization, on the other hand, tries to pro-duce a concise abstract of desired length that can reflect the keyconcepts of the document. The latter seems to be more difficult,and recent approaches have focused more on the former. As oneexample, the VSM model for IR can be used, respectively, to rep-resent each sentence of the document as well as the whole docu-ment in vector form. The sentences that have the highestrelevance scores to the whole document, as evaluated with (2),are selected to be included in the summary. When it is desired tocover more important but different concepts in the summary,after the first sentence with the highest relevance score is select-ed, indexing terms in that sentence can be removed from therest of sentences and the vectors can be reconstructed; based onthis, the next sentence can be selected, and so on [44]. As anoth-er example, the LSI model for IR can be used to represent eachsentence of a document as a vector in the latent semantic spacefor that document, which is constructed by performing SVD onthe “term-sentence” matrix W ′ for that document. The rightsingular vectors with larger singular values represent

Authorized licensed use limited to: National Taiwan University. Downloaded on January 20, 2009 at 02:26 from IEEE Xplore. Restrictions apply.

Page 9: Spoken Documentscholars.lib.ntu.edu.tw/bitstream/123456789/375142/1/17.pdf · 2018-09-10 · text-based media or spoken/multimodal dialogs. Text-to-speech synthesis can be used to

dimensions for more important latent semantic concepts in thatdocument. Therefore, the sentences with vector representationshaving the largest components in these dimensions can beincluded in the summary [44]. As still another example, each sen-tence in the document, S = t1 t2 . . . tj . . . tn , represented as asequence of terms, can be simply given a score I(S), which isevaluated based on some statistical measure s(tj) (such as TF/IDFin (1) or similar) and linguistic measure l(tj) (e.g., NEs and differ-ent parts-of-speech (POSs) are given different weights, functionwords are not counted) for all the terms tj in the sentence,

I(S) = 1n

n∑

j = 1

[λ1 s(tj) + λ2 l(tj)

], (8)

where λ1 and λ2 are some weighting parameters, and the sen-tence selection can be based on this score [45]. These selectedsentences in all the above cases can also be further condensed byremoving some less important terms if a higher compressionratio is desired.

The above approaches equally apply to both text and spokendocuments. However, the spoken documents bring added diffi-culties such as recognition errors, problems with spontaneousspeech, and lack of correct sentence or paragraph boundaries.To avoid the redundant or incorrect parts while selecting theimportant and correct information, multiple recognitionhypotheses, confidence scores, language model scores, andother grammatical knowledge have been utilized [46]. As anexample, (8) may be extended as

I(S) = 1n

n∑

j = 1

[λ1 s(tj) + λ2 l(tj) + λ3 c(tj) + λ4 g(tj)

] + λ5b(S),

(9)

where c(tj) and g(tj) are obtained from the confidence score andN-gram score for the term tj, b(S) is obtained from the gram-matical structure of the sentence S, and λ3, λ4, and λ5 areweighting parameters. In addition, prosodic features (intonation,pitch, energy, or pause duration) can be used as important cluesfor summarization, although reliable and efficient approaches touse these prosodic features are still under active research [46].The summary of spoken documents can be in either text orspeech form. The text form has the advantages of easier browsingand further processing but is inevitably subject to speech recog-nition errors as well as the loss of the speaker/emotional/prosodicinformation carried only by the speech signals. The speech formof summary can perverse the latter information. It is free fromrecognition errors but with the difficult problem of smooth con-catenation of speech segments.

TITLE GENERATION FOR SPOKEN DOCUMENTSA title is different from a summary, in addition to being especiallyshort. The summary is supposed to tell the key points, majorconclusion, or central concepts of the document. A title onlytells what the document is about, using only several words or

phrases concatenated in a readable form. Generally, a title is atmost a single sentence, if not shorter. But the title exactly com-plements the summary during browsing. The user can easilyselect the desired document with a glance at the list of titles.Title generation seems to be more difficult than summarization,and much less work has been reported. A few years ago, theInformedia Project at CMU compared several basic approachesbased on a framework assuming the availability of some train-ing corpus of text documents with human-generated titles[47]. This assumption is reasonable at least for broadcast newsbecause all text news stories do have human-generated titles.Let D = {dj, j = 1, 2, . . . , L} be the set of L training documentsand Z = {zj, j = 1, 2, . . . , L} be the set of correspondinghuman-generated titles, all in text form. The basic idea is thento learn the relationships between dj and its correspondinghuman-generated title zj using D and Z during training and tryto extend such relationships to the given spoken document di toobtain the automatically generated title zi. Such a framework isalso quite different from document summarization.

There are several basic approaches for title generation. In theK-nearest-neighbor (KNN) approach [48], for each given spokendocument di, rather than creating a new title zi, one may try tofind an appropriate title in the training corpus zj ∈ Z for a train-ing document dj ∈ D that is the nearest to di. The distance meas-ure between documents can be evaluated with statistics of theindexing terms such as TF/IDF in (1). Because zi is actually gen-erated by a human, it is well readable. Yet, this approach has thefatal problem of requiring some training documents highly corre-lated to the given spoken documents. It cannot perform well at allfor a document telling a complete new story. In the nativeBayesian approach with limited terms (NBL) [49], one tries toidentify the document-term/title-term pair t cooccurring in allpairs of documents and its corresponding human-generated title(dj, zj) in the training corpus, where t ∈ dj, dj ∈ D and t ∈ zj,zj ∈ Z. Then, one estimates the probability for a term t in thetraining document dj to be selected and used in its human-gener-ated title zj. This estimation was achieved by some statisticalmeasures based on the term frequencies for all such document-term/title-term pairs t in its respective title zj and document dj inthe training corpus, as well as some other statistics from thetraining corpus. For a given spoken document, the terms to beused in the automatically generated title are then selected fromthe terms in the transcription based on the probabilities men-tioned above and other statistical parameters, such as the termfrequencies for all the terms t in the transcription. In still anotherapproach, those terms to be used in the titles can be simplyselected from the extracted NEs with higher scores, and theTF/IDF scores can also be used. But in these latter approaches,the selected terms may not be in good order; therefore, they mayneed to be resequenced to produce a readable title. One way toachieve this purpose is to perform a Viterbi algorithm on an HMMfor the selected terms, in which each selected term is a state andN-gram probabilities are assigned as state transition probabilities.

At National Taiwan University, we successfully developed aimproved approach [50], [51] that tries to integrate the nice

IEEE SIGNAL PROCESSING MAGAZINE [50] SEPTEMBER 2005

Authorized licensed use limited to: National Taiwan University. Downloaded on January 20, 2009 at 02:26 from IEEE Xplore. Restrictions apply.

Page 10: Spoken Documentscholars.lib.ntu.edu.tw/bitstream/123456789/375142/1/17.pdf · 2018-09-10 · text-based media or spoken/multimodal dialogs. Text-to-speech synthesis can be used to

IEEE SIGNAL PROCESSING MAGAZINE [51] SEPTEMBER 2005

properties of the above approaches. We first extract the NEswhen transcribing the given spoken document di and obtainthe top j training documents in the training corpus D nearestto this transcription using the KNN approach. The human-gen-erated titles of these j training documents are then rescoredbased on the term scores obtained from the probabilities esti-mated in the NBL approach mentioned above, and the besttraining title z∗

i for the given spoken document di is chosen.But this title is still for a training document. So the terms thatappear in the selected training title z∗

i but do not yet appear inthe transcription of the given spoken document di should bereplaced by the NEs extracted from di with the highest scoresbut that do not yet appear in the selected title z∗

i . All the termsin the title thus obtained should be resequenced by an Viterbialgorithm for the terms, as mentioned above, to produce thefinal title for the given spoken document. This new approachwas shown to produce better titles than the previous approach-es. The basic structure of the title is human generated and,therefore, well readable. The keywords of the given spoken doc-uments, very often NEs or OOV words, can be selected and usedin the title thanks to the NE extraction and the probabilities forterms to be used in the titles estimated by the NBL approach.

TOPIC ANALYSIS AND ORGANIZATIONTopic analysis and organization refers to analyzing the inherenttopics discussed in each document and offering an overall knowl-edge of the semantic content of the entire document collectionin some form of hierarchical structure with concise visual pres-entation. The purpose is to enable comprehensive and efficientaccess to the document collection and to derive a best set ofquery terms for each topic cluster. This is usually some kind ofdata-driven organization. It can help the users to compose betterqueries and browse across the documents efficiently as well as tobetter grasp the relationships among the documents. BBN’sRough’n’Ready system [4] may represent one of the few earlyefforts for spoken documents in this direction. In this system,each spoken document (broadcast news story) was automaticallygiven a short list of topic labels describing the main themes ofthe document (serving the purposes of the titles as mentionedabove), and all documents were organized in a tree-structurehierarchy classified by dates, sources for the news, and so on.

The WebSOM approach [52], [53] is another typical exampleof data-driven topic organization for a document collection. Inthis approach, each document in the collection can be repre-sented by a vector using the LSI model in IR. These vectors arethen taken as the input to derive the self-organizing map (SOM)for the documents. SOM is a well-known neural network modelthat can be trained in an unsupervised way [54]. The documentscan thus be clustered based on the latent semantic concepts,and the document clusters and the relationships among theclusters can be presented as a two-dimensional (2-D) map. Onthis map, each topic cluster is represented as a lattice point (or aneuron), and the closely located lattice points (or neurons) innearby regions on the map represent related topic clusters. Eachdocument can be assigned a unique coordinate corresponding

to the closest cluster representative. In this way, the users areable to browse the semantically related documents on the map,either within the same lattice point (or topic cluster) or acrossneighboring lattice points.

The ProbMap [55] is yet another typical example with pur-poses similar to the above but extended from the PLSA model.The documents are organized into latent topic clusters, and therelationships among the clusters can be presented as a 2-D mapon which every latent topic cluster is a lattice point. In thisapproach, an additional set of latent variables{Yk, k = 1, 2, . . . , K} is introduced with respect to the latenttopics {Tk, k = 1, 2, . . . , K}. Each latent variable Yk defines aprobability distribution {P(Tl|Yk), l = 1, 2, . . . , K} that repre-sents the statistical correlation between the latent topic Tk andeach of the other latent topics Tl. This distribution not onlydescribes the semantic similarity among the latent topics, but itcan blend in additional semantic contributions from relatedlatent topics Tl to a given latent topic Tk. In this way, the condi-tional probability of observing a term tj in a document di ,P(tj |di), previously expressed by (5) in PLSA, can be modified as

P(tj |di) =K∑

k = 1

P(Tk|di)P(tj |Tk)

=K∑

k = 1

P(Tk |di)

[K∑

l = 1

P(Tl |Yk)P(tj |Ti)

]. (10)

The probability distribution P(Tl |Yk) can be expressed as aneighborhood function in terms of some distance measurebetween the location of the latent topic Tk and those for otherlatent topics Tl on the 2-D map. This model can be trained in anunsupervised way by maximizing the total log-likelihood LT ofthe document collection {di, i = 1, 2, . . . , N} in terms of theunigram P(tj |di) of all terms tj observed in the document col-lection, using the EM algorithm:

LT =N∑

i = 1

N ′∑

j = 1

c(tj, di) log P(tj |d1), (11)

where N is total number of documents in the collection, N′ isthe total number of different terms observed in the documentcollection, c(tj, di) is the frequency count for the term tj in thedocument di, and P(tj |di) is the probability obtained in (10). Inthis way, the probability of observing a term tj in a latent topicTk is further contributed to by the probabilities of observing thisterm tj in all different latent topics Tl but weighted by the dis-tance between Tk and Tl on the map, as indicated in (10). Bymaximizing the total likelihood LT in (11), those latent topicswith higher statistical correlation are to be located close to eachother on the map. Different from the WebSOM mentioned abovein which each document di is assigned to a single topic cluster,here a document di can be assigned to many different latent top-ics with different probabilities; this intuitively seems to be morereasonable. Also, with the total log-likelihood adopted in training,

Authorized licensed use limited to: National Taiwan University. Downloaded on January 20, 2009 at 02:26 from IEEE Xplore. Restrictions apply.

Page 11: Spoken Documentscholars.lib.ntu.edu.tw/bitstream/123456789/375142/1/17.pdf · 2018-09-10 · text-based media or spoken/multimodal dialogs. Text-to-speech synthesis can be used to

quite a few theoretically attractive machine learning algorithmscan be applied.

INTEGRATION RELATIONSHIPSAMONG THE INVOLVED TECHNOLOGY AREAS As can be understood, all the various technology areas discussedabove are closely related to one another and should be properlyintegrated to achieve the desired functionalities and the bestpossible performance. Figure 6 is a diagram for such integrationrelationships among these areas in a hierarchical structure. Asin the figure, the various technology areas may be considered asbeing on four different levels from bottom to top, including theterm, concept, summary, and topic levels. NE extraction belongsto the lowest term level. It provides the most important terms tobe used by all the other processes; therefore, it is the base for allhigher level processes. Spoken document segmentation dividesa long spoken document stream into short paragraphs based onthe concepts mentioned. Information extraction further identi-fies the exact concepts within each paragraph. Therefore, theyboth belong to the concept level. Precise NEs are certainlyimportant for both of these processes. On the other hand, it isbelieved that the information extraction actually offers a verygood framework helpful in composing good summaries. Yet, it isstill not well known today how this can be achieved becausemost of the spoken document summarization techniques today

are actually structured in a different way, for example by select-ing the important sentences rather than via information extrac-tion. Similarly, it is also believed that information extraction canoffer some key knowledge very helpful in producing good titles,although today again the titles are not generated in this way.The terms used in the titles may naturally be the key terms inthe summaries, and the titles may be considered as very concisesummaries as well. Therefore, not only can the summarizationand title generation both be considered to be on the summarylevel, but the information extraction can also be considered toextend from the concept level up to the summary level; thus, theconcepts identified on the concept level can be further con-cretized into summaries and titles on the summary level.Finally, the titles usually include key terms indicating the topicarea, so the title generation can also be considered to extendfrom the summary level up to the topic level. On the topic level,the topic analysis and labels may help to produce good titles,and the titles may help in the topic analysis. Topicanalysis/organization and title generation on the topic level areagain helpful to each other.

The above description provides bottom-up relationships inFigure 6. The complete relationships in Figure 6 also includethose that are top down. For example, the knowledge about thetopic areas of the documents is definitely helpful to all-lowerlevel processes, for example, to the NE extraction in the lowest

IEEE SIGNAL PROCESSING MAGAZINE [52] SEPTEMBER 2005

[FIG6] Integration relationships among the involved technology areas for spoken document understanding and organization forretrieval/browsing applications.

Topic Analysisand

Organization

Topic Level

Spoken DocumentSummarization

Summary Level

Spoken DocumentSegmentation

Concept Level

Named-EntityExtraction from

Spoken Documents

Term Level

Title Generationfor

Spoken Documents

Information Extractionfor

Spoken Documents

InformationRetrieval

andBrowsing

Authorized licensed use limited to: National Taiwan University. Downloaded on January 20, 2009 at 02:26 from IEEE Xplore. Restrictions apply.

Page 12: Spoken Documentscholars.lib.ntu.edu.tw/bitstream/123456789/375142/1/17.pdf · 2018-09-10 · text-based media or spoken/multimodal dialogs. Text-to-speech synthesis can be used to

IEEE SIGNAL PROCESSING MAGAZINE [53] SEPTEMBER 2005

term level and the spoken document segmentation and infor-mation extraction in the second lowest concept level. In fact, itis easy to see that the understanding achieved on each level ishelpful to all other levels, both upward and downward. As aresult, an efficient interaction among the different processeson the different levels will be very beneficial, and a goodapproach incorporating both bottom-up and top-down integra-tion will be highly desired. One possible way to achieve thismay be to begin the processes bottom-up and then feed theinformation or knowledge obtained in some higher-levelprocesses back to some lower levels. The lower-level processescan then be restarted and the bottom-up processes can bereexecuted. This can be carried out recursively in several itera-tions to achieve the best results, although it is still not quiteclear at the moment how this can be exactly accomplished.Finally, all these processes are to serve the purposes of efficientretrieval/browsing applications, and can therefore be consid-ered as extended areas for today’s information retrieval.

AN INITIAL PROTOTYPE SYSTEMAn initial prototype system of spoken document understandingand organization for efficient retrieval/browsing applications,as discussed above, has been successfully developed at NationalTaiwan University. This system will be briefly summarizedhere. The broadcast news segments (primarily in speech formonly, but some including video parts) are taken as the examplespoken/multimedia documents. The document collection to beretrieved and browsed, D = {di, i = 1, 2, . . . , N} , includesroughly 130 hours of about 7,000 broadcast news stories, all inMandarin Chinese. They were allrecorded from radio/TV stations inTaipei from February 2002 to May2003. Because of the special struc-ture of the Chinese language, it hasbeen found that special efforts inselecting the indexing terms t men-tioned above (usually words for west-ern languages, but they can besegments of one or more syllables,characters, or words in Chinese) mayresult in significantly better per-formance in both IR [14], [18] andspoken document understanding andorganization [50], [51], [56].

BRIEF SUMMARY FOR THETECHNOLOGIES USED IN THEINITIAL PROTOTYPE SYSTEMThe spoken document retrieval systemwas implemented using a combinationof several different syllable/word-levelindexing terms [14], [18]. Only theVSM model for literal term matchingfor IR was used for simplicity,although it has been shown that the

hybrid retrieval model or those with a similar structure asPLSA for concept matching gave better performance forChinese spoken document retrieval [57], [58].

For NE extraction, the broadcast news segments were firsttranscribed into word graphs, on which words with higherconfidence scores were identified. The words were used toconstruct queries for retrieval of the text news in the sametime period (available over the Internet) to obtain text docu-ments relevant to the spoken documents being considered.The retrieved relevant text documents were used for NEextraction and vocabulary/language model updating for sec-ond-pass transcription. Selected parts of these corpora werealso used for directly matching NEs, as mentioned previously.This is the way to extract the many NEs in Chinese news thatare OOV and cannot be correctly transcribed in the wordgraphs. Both forward and backward PAT trees were construct-ed to provide complete data structures for the global contextinformation for each of the spoken documents as well as theretrieved relevant text documents. Both rule- and HMM-basedapproaches for NE extraction were incorporated. A multilay-ered Viterbi search was performed with a class-based languagemodel [59] to handle the problem that an NE in Chinese maybe a concatenation of several NEs of different types. In thisapproach, a lower-layer Viterbi search was first performedwithin those longer NEs, while a top-layer Viterbi searchfinally performed over the whole sentence [31].

For spoken document segmentation, the purpose is to seg-ment those one-hour-long news episodes into short news sto-ries. The HMM-based segmentation approach mentioned

[FIG7] Block diagram of the initial prototype system for Chinese broadcast news.

Named-EntityExtraction

Segmentation

Topic Analysisand Organization

Summarization

Title Generation

InformationRetrieval

Two-Dimensional Tree Structurefor Organized Topics

Chinese BroadcastNews Archive

RetrievalResults

Titles,Summaries

Input Query

Authorized licensed use limited to: National Taiwan University. Downloaded on January 20, 2009 at 02:26 from IEEE Xplore. Restrictions apply.

Page 13: Spoken Documentscholars.lib.ntu.edu.tw/bitstream/123456789/375142/1/17.pdf · 2018-09-10 · text-based media or spoken/multimodal dialogs. Text-to-speech synthesis can be used to

previously was adopted. The approaches based on sentence simi-larities were also implemented and tested but found to haveslightly worse performance. The total number I for topic clus-ters (or states) of the HMM was carefully chosen during the K-means training using a training corpus without topic labeling. Itwas found that I = 200 offered reasonably good results.Adaptive algorithms to adjust the transition probabilities werealso implemented, including that for p1 for transition to a differ-ent topic cluster based on a pause duration model, and that forp2 for remaining in the same topic cluster based on a storylength model. Confidence scores for the recognized terms werealso considered [56], [60]. The information extraction, on theother hand, has not yet been implemented.

For spoken document summarization, every segmented newsstory was automatically given a summary, but only some prelimi-nary work was completed. The most important sentences in thedocuments were automatically selected and directly concatenat-ed to form a summary. The selected sentences were not furthercondensed. Different approaches for choosing the most impor-tant sentences were tested, and the two with the best perform-ance were finally selected. The first approach used the termfrequency and inverse document frequency as well as VSM in IR.The other approach was in fact very similar, except a simplifiedversion of (9) mentioned above was used [61], [62]. In both cases,the sentences with the highest scores were concatenated withthe original order and played in audio form as the summary suchthat no recognition errors would be perceived [63]. The audioform of summaries also complements the titles in text form. Forautomatic title generation, the improved approach developed at

National Taiwan University was used.It generated a title in text form foreach segmented news story in addi-tion to a summary.

Topic analysis for spoken docu-ments, on the other hand, was per-formed with the ProbMap approachbased on the PLSA model. All thenews stories in the document col-lection were first automaticallydivided into 20 categories, such asinternational political news andlocal business news. All the newsstories within a category were thenclustered into m × m latent topicsTk based on the PLSA model andthen organized as an m × m map.Every latent topic Tk was then char-acterized by the several terms(words including NEs) selected withthe probabilities P(tj |Tk) . Theseterms (words) served as the topiclabels for the cluster. This map wasdisplayed as a square of m × mblocks, with the topic labels shownon each block to indicate what the

documents in this cluster are all about. As mentioned previ-ously, the distance between two blocks on the map has to dowith the relationship between the latent topics represented bythe blocks. All the documents in each block (or cluster) couldthen be further analyzed and organized into another l × lsmaller latent topic cluster and represented as another l × lmap in the next layer, in which the blocks or clusters againrepresent the fine structures of the smaller latent topics. Inthis manner, the relationships among the topics of the seg-mented news stories can be organized into a 2-D tree structureor a multilayer map, both of which enable much easierretrieval and browsing [64].

BRIEF SUMMARY FOR THE FUNCTIONALITIESOF THE INITIAL PROTOTYPE SYSTEMFigure 7 is the block diagram of the initial prototype systemfor Chinese broadcast news. There are three parts in thesystem: the automatic generation of titles and summaries is onthe left, the IR is on the right, and the rest of the modules arein the middle. The output below is the 2-D tree structure forthe organized topics of the broadcast news.

A typical example broadcast news story di within thedocument collection is shown in Figure 8. The waveform ofthe complete news story (with length 63.0 s) is shown inFigure 8(a). The waveform of the automatically generatedsummary, which is the concatenation of a few selected sen-tences (with length 8.5 s), is shown in Figure 8(b). The text of this summary is “ ,

,

IEEE SIGNAL PROCESSING MAGAZINE [54] SEPTEMBER 2005

[FIG8] A typical example of the spoken document (a broadcast news story) in the collection:(a) the complete audio signal waveform, (b) the waveform of the automatically generatedsummary (a few selected sentences), and (c) the automatically generated title (in text form).

(a)

(b)

(c)

Authorized licensed use limited to: National Taiwan University. Downloaded on January 20, 2009 at 02:26 from IEEE Xplore. Restrictions apply.

Page 14: Spoken Documentscholars.lib.ntu.edu.tw/bitstream/123456789/375142/1/17.pdf · 2018-09-10 · text-based media or spoken/multimodal dialogs. Text-to-speech synthesis can be used to

IEEE SIGNAL PROCESSING MAGAZINE [55] SEPTEMBER 2005

(A military coup took place in Sao Tome,a country in West Africa. The prime minister and many govern-ment officers were detained. UN Secretary General Annan issued astatement of strong condemnation.),” which includes severalcharacter errors (not shown here). As the speech waveform wassimply extracted from the original audio signal, the text is com-pletely correct. The automatically generated title in text form isprinted in Figure 8(c), the English translation of which is, “UNSecretary General Annan strongly condemned the military coupin Sao Tome.” This title is not a sentence in the original newsstory but automatically put together with a set of the most impor-

tant NEs. It should be pointed out that all the NEs here are OOVwords; yet they are correctly recovered, and the title here is actu-ally smoothly readable. Note that the summary and title here maynot be very good for the original news story. While certainly worsethan those generated by humans, they may be reasonably goodfor retrieval/browsing purposes. The algorithms for automaticallygenerating the summary and the title shown in Figure 8 wereperformed on all the 7,000 news stories in the collection as men-tioned above. The generated summaries and titles were all storedin the collection together with the original news stories. Figure 8is a typical example out of the 7,000 stories.

[FIG9] The top-down browsing and button-up retrieval functionalities of the initial prototype system.

Broadcast News Retrieval / Browsing System

[International Political News][Local Political News]

[International Business][Local Business][International Entertainment][Local Entertainment][International Sports][Local Sports]

Topic MapTopic Map

Topic MapTopic MapTopic MapTopic MapTopic MapTopic Map

[ 1] [sum.]02.09.23[sum.]02.09.20[sum.]02.10.22[sum.]02.10.01[sum.]02.09.21[sum.]02.11.23[sum.]02.06.09[sum.]02.02.12[sum.]02.04.20

[ 2][ 3][ 4][ 5][ 6][ 7]

[ 8][ 9]

(May 03/‘02/12:00)

(May 06/‘02/12:00)

(Sep 20/‘02/12:00)

(Oct 30/‘02/12:00)

(Nov02/‘02/12:00)

[Summary]

[Summary]

[Summary]

[Summary]

[Summary]

Go to Level-2

Go to Level-1

(e)

(a)

(b)

(c)

(d)

Authorized licensed use limited to: National Taiwan University. Downloaded on January 20, 2009 at 02:26 from IEEE Xplore. Restrictions apply.

Page 15: Spoken Documentscholars.lib.ntu.edu.tw/bitstream/123456789/375142/1/17.pdf · 2018-09-10 · text-based media or spoken/multimodal dialogs. Text-to-speech synthesis can be used to

The functionalities of the initial broadcast newsretrieval/browsing prototype system are shown in Figure 9 anddescribed below. First consider the top-down browsing function-alities. The home page of the browsing system lists the 20 newscategories as in Figure 9(a) (not completely shown). When theuser clicks the first category of “international political news,” asshown here, a 2-D map of 3 × 3 latent topic structure (with nineblocks) appears as shown in Figure 9(b) (only four blocks areshown here). Each block represents a major latent topic in thearea of “international political news” for the news collection,characterized by roughly four topic labels (terms selected withprobabilities) shown in the block. As can be found, the block onthe upper-right corner has labels “ (Israel),”“ (Arafat),” “ (Palestine)” and “(Gaza City),” which clearly specify the topic. The block to its left,on the other hand, has labels “ (Iraq),” “(Baghdad),” “ (American Amy)” and “ (MarineCorps),” whereas the block below it in the middle-right haslabels “ (United Nations),” “ (SecurityCouncil),” “ (military inspectors),” and “(weapons).” Clearly, these are all different but related topics, andthe distance between the blocks has to do with the relationshipsbetween the latent topics. The user can then click one of theblocks (for example, the one on the upper-right corner as shownhere) to see the next layer 3 × 3 map for the fine structure ofsmaller latent topics for this cluster, as shown in Figure 9(c).As can be found in Figure 9(c), the block on the upper-right cor-ner now has labels “ (Israel),” “ (Shilom),”“ (Jordan River)” and “ (USA),” while the blockbelow it has labels “ (Middle East),” “ (Powell),”“ (peace),” and “ (roadmap),” and so on. Apparently,the collection of broadcast news stories is now organized in a 2-D tree structure or a multilayer map for better indexing andeasier browsing. Here the second-layer clusters are in fact theleaf nodes, and the user may wish to see all the news storieswithin such a node. With just a click, the automatically generat-ed titles for all news stories clustered into that node are shownin a list, as in Figure 9(d) for the upper-middle small block inFigure 9(c) labeled with “ (Arafat).” This listincludes the automatically generated titles for five news storiesclustered into this block, together with the position of this nodewithin the 2-D tree as shown in the lower-right corner of thescreen. The user can further click the “summary” button aftereach title to listen to the automatically generated summaries, orclick the title to listen to the complete news story. This 2-D treestructure with topic labels and the titles/summaries is thereforevery helpful for browsing the news stories.

The retrieval functionalities, on the other hand, are generallybottom-up. The screen of the retrieval system output for aninput speech query (can be in either speech or text form),“ ( P l e a s efind news stories relevant to Israel and Arafat)” is shown inFigure 9(e). A nice feature of this system is that all retrievednews stories, as listed in the upper half of Figure 9(e), haveautomatically generated titles and summaries. The user can

therefore select the news stories by browsing through thetitles or listening to the summaries, rather than listeningto the entire news story and then finding that it was notthe one he was looking for. The user can also click anotherfunctional button to see how a selected retrieved newsitem is located within the 2-D tree structure, as men-tioned previously in a bottom-up process. For example, if he selected the second item in the title list of Figure 9(e),“ ( A r a f a tobjected to Israel’s proposal for conditions of lifting the siege),”he can see the list of news titles in Figure 9(d), including thetitles of all news stories clustered in that smaller latent topic(or leaf node). Alternatively, he can go one layer up to see thestructure of different smaller latent topics in Figure 9(c) or goup one layer further to see the structure of different majorlatent topics in Figure 9(b), and so on. This bottom-up processis very helpful for the user to identify the desired news storiesor find the related news stories, even if they are not retrieved inthe first step as shown in Figure 9(e).

We also successfully implemented a prototype subsystemthat allows the user to retrieve the broadcast news via a PDAusing speech queries. A small client program was implementedon the PDA to transmit the acoustic features of the speech queryto the IR server. The retrieved results are then sent back to thePDA, and the user can use the PDA to browse through the titlesand the 2-D tree described above, listen to the summaries of theretrieved documents, and click to play the audio files for thecomplete news story sent from the audio streaming server.

PERFORMANCE EVALUATION FOR THE INITIAL PROTOTYPE SYSTEMThe performance evaluation for the initial prototype system,especially for each of the individual modules, is reported below.

TRANSCRIPTION AND RETRIEVALOF THE BROADCAST NEWSFor transcription of the broadcast news, a corpus of 112 hours ofradio and TV broadcast news collected in Taipei from 1998 to 2004was used in acoustic model training. This is different from the 130hours of document collection D = {di, i = 1, 2, . . . , N} used inthe initial prototype system. Out of this corpus, 7.7 hours ofspeech were equipped with orthographic transcriptions, in which4.0 hours were used to bootstrap the acoustic model training andthe other 3.7 hours were used for testing. The other 104.3 hours ofuntranscribed speech were used for lightly supervised acousticmodel training [65]. The language model used consisted of bi- andtrigram models estimated from a news text corpus of roughly 170million Chinese characters with Katz smoothing. The characterand syllable error rates of 14.29% and 8.91%, respectively, wereachieved for the transcription task [65]. Note that word error rate(WER) is not a good performance measure for the Chinese lan-guage in general due to the ambiguity in segmenting a Chinesesentence into words; therefore, it is not used here.

In the retrieval experiments, a set of 20 simple queries withlength of one to several words, in both text and speech forms,

IEEE SIGNAL PROCESSING MAGAZINE [56] SEPTEMBER 2005

Authorized licensed use limited to: National Taiwan University. Downloaded on January 20, 2009 at 02:26 from IEEE Xplore. Restrictions apply.

Page 16: Spoken Documentscholars.lib.ntu.edu.tw/bitstream/123456789/375142/1/17.pdf · 2018-09-10 · text-based media or spoken/multimodal dialogs. Text-to-speech synthesis can be used to

IEEE SIGNAL PROCESSING MAGAZINE [57] SEPTEMBER 2005

was manually created. Four speakers (two males and twofemales) produced the 20 queries using an Acer n20 PDA withits original microphone in an environment of slight backgroundnoise. To recognize these spoken queries, another read speechcorpus consisting of 8.5 hours of speech produced by an addi-tional 39 male and 38 femalespeakers over the same type ofPDAs was used for trainingthe speaker-independentHMMs for recognition of thespoken queries. Significantlyhigher character and syllable error rates of 27.61% and 19.47%,respectively, were obtained for the spoken queries as comparedto those for broadcast news mentioned above. The retrievalexperiments were performed with respect to a collection ofabout 21,000 broadcast news stories, all recorded in Taipei from2001–2004 (including the 7,000 used in the initial prototypesystem presented here). The results in terms of the mean aver-age precision at a document cutoff number of ten were 0.8038and 0.6237 for text and spoken queries, respectively. At a docu-ment cutoff number of 30, the mean average precision were0.6692 and 0.5232 for text and spoken queries, respectively.

PERFORMANCE EVALUATION FOR NE EXTRACTION The performance evaluation for NE extraction was performedwith both text documents and the broadcast news (spoken docu-ments). For the test with text documents, the Chinese test cor-pus of Multilingual Entity Task (MET-2) of MUC-7 [26] was used,which included 100 Chinese documents on the same topic withmanually identified reference NEs. The evaluation was based onthe recall rate r1, precision rate r2, and F1 score

F 1 = 2 r1 r2

r1 + r2, (12)

as defined by the MUC-7 test references. The results are listed inthe upper half of Table 1, in which the baseline (BSL) was basedon a recently published, very successful algorithm including themultilayered Viterbi search with some special approaches [59].The improved approach (IMP) was that described earlier usingPAT trees to consider global context infor-mation but not including those specifical-ly developed for spoken documents. Verysignificant improvements can be foundwith IMP as compared to BSL; with IMP,the recall/precision rates and F1 scores forthe three types of NEs are all well above0.90. In fact, these numbers for IMP inTable 1 represent the best publishedresults for this MET-2 task for the Chineselanguage up to this moment. For the testwith broadcast news, 200 news storiesbroadcast in September 2002 and recordedin Taipei were used as the test corpus. Themanually identified NEs were taken as ref-

erences, including 315 person names, 457 location names, and500 organization names. Many of them are actually OOV words.The text news corpora searched to recover the NEs in the broad-cast news that are OOV words were the Chinese text news avail-able from the “Yahoo! Kimo News Portal” [66] for the whole

month of September 2002,including about 38,800 textnews stories. The results arelisted in the lower half of Table1, in which BSL are those withexactly the same approach as

BSL used for text documents mentioned above, performeddirectly on the transcriptions of the broadcast news. IMP arethose with exactly the approaches reported earlier, includingusing the knowledge obtained from the retrieved text news cor-pora, using PAT trees to analyze the global context information,and considering confidence scores on word graphs. There was aclear gap between the results for spoken documents and thosefor text documents, but the proposed IMP approach offered verysignificant improvements as compared to BSL in almost allcases. The overall F1 score for IMP reached exactly 0.80, whichwas very satisfactory for Chinese broadcast news that includedmany OOV words.

PERFORMANCE EVALUATION FOR BROADCASTNEWS SEGMENTATION, SUMMARIZATION,AND TITLE GENERATIONPreliminary tests for segmentation of Chinese broadcast newswere performed with TDT 2001 evaluation data [67]. TDT-2 wasused as the training corpus, including 55.3 hours of audio signals,or about 3,000 news stories. TDT-3 was employed as the testingcorpus, including 127.0 hours of audio signals, or about 4,600news stories, all in Mandarin Chinese. The segmentation costdefined by TDT evaluation was used here and included the cost forfalse alarm and missing [67]. Very good initial results wereobtained. It was found that the confidence measures and the adap-tive adjustment of the transition probabilities p1 and p2 (by apause duration model and a story length model, respectively) forthe HMM model as mentioned previously can offer very significantimprovements. Proper choice of the terms t (e.g., segments of two

EXPERIMENTS NAMED ENTITIES RECALL (r1) PRECISION (r2) F1 SCORETEXT BSL PERSON NAME 0.94 0.66 0.775

DOCUMENTS LOCATION NAME 0.90 0.77 0.831ORGANIZATION NAME 0.75 0.89 0.814

IMP PERSON NAME 0.95 0.96 0.955LOCATION NAME 0.95 0.92 0.935ORGANIZATION NAME 0.91 0.96 0.934

BROADCAST BSL PERSON NAME 0.65 0.73 0.688NEWS LOCATION NAME 0.81 0.62 0.702

ORGANIZATION NAME 0.77 0.44 0.560OVERALL 0.74 0.53 0.618

IMP PERSON NAME 0.73 0.85 0.785LOCATION NAME 0.87 0.91 0.890ORGANIZATION NAME 0.67 0.95 0.740OVERALL 0.74 0.88 0.800

[TABLE 1] PERFORMANCE EVALUATION FOR CHINESE NAMED ENTITY EXTRACTIONFROM TEXT DOCUMENTS AND BROADCAST NEWS.

A TITLE IS DIFFERENT FROMA SUMMARY, IN ADDITION TO

BEING ESPECIALLY SHORT.

Authorized licensed use limited to: National Taiwan University. Downloaded on January 20, 2009 at 02:26 from IEEE Xplore. Restrictions apply.

Page 17: Spoken Documentscholars.lib.ntu.edu.tw/bitstream/123456789/375142/1/17.pdf · 2018-09-10 · text-based media or spoken/multimodal dialogs. Text-to-speech synthesis can be used to

syllables or two characters), considering the structure of theChinese language, to replace the role of the words commonly usedfor western languages certainly made the difference. The lowestsegmentation cost is only slightly above 0.08 [56], [60].

In the preliminary tests forbroadcast news summariza-tion, the training corpusincluded roughly 150,000news stories in text form fromJanuary to December 2000provided by the Central NewsAgency of Taipei. These samples were used to calculate the IDFand other statistical parameters. The testing corpus included 200news stories broadcast in August 2001 by a few radio stations atTaipei. Three human subjects (students at National TaiwanUniversity) were requested to do the human summarization intwo forms: first, rank the importance of the sentences in eachtranscribed news story from the top to the middle (since here wesimply try to select the most important sentences as the summa-ry); and second, write an abstract for the news story with alength being roughly 25% of the original news story. Two sum-marization ratios, 20% and 30%, were tested; these are the ratiosof summary length to the total length. In each case, the twohuman-produced summaries were used. The first, y1, was theconcatenation of the top several important sentences selected bythe human subject. The second, y2, was simply the one he wroteby himself. The summarization accuracy for the jth news story,Aj, was then the averaged similarity score [56], [63], [68] for themachine-produced summary, y, with respect to y1 and y2,

Aj = 12

[S ( y, y1) + S ( y, y2)] , (13)

where the similarity scores S ( y, y1), S ( y, y2) were calculatedbased on the vectors of the TF/IDF values. In this way, higheraccuracy would be obtained if more words that were importantin the news stories were included in the machine-producedsummaries. The final summarization accuracy was then theaverage of Aj in (13) over all the 200 news stories and all thethree human subjects. The final summarization accuracy wasfound to be slightly above 0.381 and 0.422 for 20% and 30%summarization ratios, respectively. Proper choice and reason-able combination of a few word- and subword-level terms t dur-ing the process of automatic summarization were certainly keyto achieving better accuracy [56], [63].

In the preliminary tests for title generation, the same train-ing corpus used in summarization experiments (includingroughly 150,000 news stories in text form) was used in train-ing, except here the human-generated titles for all the textnews stories were used in training as well; 210 broadcast newsstories recorded in 2001 were used in testing. The referencetitles for these broadcast news stories were produced by thestudents of the Graduate Institute of Journalism of NationalTaiwan University. These reference titles were used in the per-formance measures presented below. The objective perform-

ance measure was based on F1 scores in (12), where the preci-sion and recall rates were calculated from the number of identi-cal Chinese characters in automatically generated andhuman-generated titles. In addition, five-level, subjective

human evaluation was alsoperformed, where five was thebest and one was the worst.Two different metrics wereused in the subjective humanevaluation: “relevance,” cali-brating the correlation

between the automatically generated titles and the broadcastnews, and “readability,” indicating if the automatically generat-ed title was readable. In performing the subjective human eval-uation, each subject was given in advance the example titleswith reference scores for both “relevance” and “readability” offive, three, and one for some example broadcast news stories.They were then asked to follow the calibration of the examplesto give the scores between five and one, so the results could bemore consistent for different subjects. The best results wereobtained with proper choice of the terms t (segments of two orthree syllables/characters to replace the role of words in west-ern languages) and careful integration of them considering thestructure of the Chinese language. It was found that the newapproach developed at National Taiwan University performedmuch better than the few previously proposed approaches. AnF1 score slightly above 0.356 was obtained, with average scoresof 3.294 and 4.615 for “relevance” and “readability,” respective-ly, for the subjective human evaluation [50], [51], [64].

PERFORMANCE EVALUATION FOR TOPIC ANALYSISAND ORGANIZATION FOR BROADCAST NEWSVery rigorous performance evaluation for the ProbMap approachhas been performed based on the TDT-3 Chinese broadcast newscorpus. A total of about 4,600 news stories in this corpus wereused to train the 2-D tree structure for the topics. A total of 47different topics have been manually defined in TDT-3, and eachnews story was assigned to one of the topics manually, orassigned as “out of topic.” These 47 classes of news stories withgiven topics were used as the reference for the two evaluationmeasures as defined below.

Intuitively, those news stories manually assigned to thesame topic should be located on the map as close to each otheras possible, while those manually assigned to different topicsshould be located on the map as far away from each other aspossible. We therefore define the “between-class to within-class” distance ratio R as in

R = hB

hw, (14)

where hB is the average of the distance on the map for all pairs ofnews stories manually assigned to different topics (thus is the“between-class distance”), and hw is the similar average, but overall pairs of news stories manually assigned to identical topics (thus

IEEE SIGNAL PROCESSING MAGAZINE [58] SEPTEMBER 2005

SPEECH IS THE PRIMARY ANDMOST CONVENIENT MEANS

OF COMMUNICATIONBETWEEN INDIVIDUALS.

Authorized licensed use limited to: National Taiwan University. Downloaded on January 20, 2009 at 02:26 from IEEE Xplore. Restrictions apply.

Page 18: Spoken Documentscholars.lib.ntu.edu.tw/bitstream/123456789/375142/1/17.pdf · 2018-09-10 · text-based media or spoken/multimodal dialogs. Text-to-speech synthesis can be used to

IEEE SIGNAL PROCESSING MAGAZINE [59] SEPTEMBER 2005

the “within-class distance”). So the ratio R in (14) describes howdistant the news stories with different manually defined topics areon the map. Apparently, the higher the values of R , the better.

On the other hand, for each news story di, the probabilityP(Tk |di) for each latent topic Tk, k = 1, 2, . . . , K, was given bythe model. Thus, the total entropy for topic distribution for thewhole document collection with respect to the organized topicclusters can be defined as

H =N∑

i = 1

K∑

k = 1

P(Tk |di) log1

P(Tk |di), (15)

where N is the total number of news stories used in the evalua-tion. Apparently, lower total entropy means the news storieshave probability distributions more focused on fewer topics.

Table 2 lists the results of the two performance measuresproposed above. There are several choices of the terms consid-ering the special structure of the Chinese language, i.e., W(words), S2 (segments of two syllables), C2 (segments of twocharacters), and combinations. As we can see, the words [W inrow 1] were certainly not a good choice of terms for the pur-poses of topic analysis here. Segments of two syllables [S2 inrow 2] were apparently better, with much higher distance ratioR and much lower total entropy H. Segments of two characters[C2 in row 3] turned out to be even better. The last row indicat-ed that integration of S2 and C2 may be another good choice,with better distance ratio R, though slightly higher totalentropy H. This is consistent with the results in retrieval, seg-mentation, summarization, and title generation. In all thesecases, the word is not a good choice of term for Chinese spokendocuments, considering the structure of the Chinese language.

CONCLUSIONSpeech usually carries the core concepts for the ever-increasingmultimedia content in the network era, and therefore spokendocument understanding and organization will be the key forefficient retrieval/browsing applications in the future. This arti-cle presents a concise, comprehensive, and integrated overviewof various technology areas reaching towards such a goal in aunified context. The involved technology areas covered hereinclude NE extraction, segmentation, and information extrac-tion for the spoken documents as well as automatic summariza-tion, title generation, and topic analysis and organization. Therelevant problems and issues, general principles, and basicapproaches for each area were briefly reviewed. A framework forproperly integrating all these different technology areas wasproposed, in which four different levels of processes weredefined (term, concept, summary, and topic levels) and bottom-up and top-down relationships were discussed. An initial proto-type system for such purposes recently developed at NationalTaiwan University was also presented. This system used broad-cast news in Mandarin Chinese as example spoken documents.Preliminary performance results for the various functionalitiesfor the initial prototype system were reported as well.

AUTHORSLin-shan Lee received a B.S. degree from National TaiwanUniversity in 1974 and a Ph.D. from Stanford University in 1977.He has been a professor at National Taiwan University since 1982.He also holds a joint appointment with Academia Sinica as aresearch fellow. His research areas include digital communicationsand spoken language processing. He developed several of the earli-est versions of Chinese spoken language processing systems in theworld, including text-to-speech systems, natural language analyz-ers, voice dictation systems, spoken document retrieval systems,and spoken dialogue systems. He is a member of the PermanentCouncil of the International Conference on Spoken LanguageProcessing (ICSLP) and a board member of International SpeechCommunication Association (ISCA). He was the vice president forthe International Affairs and the Awards Committee chair of IEEECommunications Society. He is a Fellow of IEEE.

Berlin Chen received his B.S. and M.S. degrees in computerscience and information engineering from National Chiao TungUniversity, Hsinchu, Taiwan, in 1994 and 1996, respectively. Hereceived his Ph.D. in computer science and information engi-neering in 2001 from National Taiwan University, Taipei. He iscurrently an assistant professor with National Taiwan NormalUniversity. His research interests include acoustic and languagemodeling, search algorithms for large-vocabulary continuousspeech recognition, and speech IR and organization.

REFERENCES[1] B.H. Juang and S. Furui, “Automatic recognition and understanding of spokenlanguage—a first step toward natural human-machine communication,” Proc.IEEE, vol. 88, no. 8, pp. 1142–1165, 2000.

[2] CMU Informedia Digital Video Library project [Online]. Available: http://www.informedia.cs.cmu.edu/

[3] Multimedia Document Retrieval project at Cambridge University [Online].Available: http://mi.eng.cam.ac.uk/research/Projects/Multimedia_Document_Retrieval/

[4] D.R.H. Miller, T. Leek, and R. Schwartz, “Speech and language technologies foraudio indexing and retrieval,” Proc. IEEE, vol. 88, no. 8, pp. 1338–1353, 2000.

[5] S. Whittaker, J. Hirschberg, J. Choi, D. Hindle, F. Pereira, and A. Singhal, “SCAN:Designing and evaluating user interface to support retrieval from speech archives,”in Proc. ACM SIGIR Conf. R&D in Information Retrieval, 1999, pp. 26–33.

[6] A. Merlino and M. Maybury, “An empirical study of the optimal presentation ofmultimedia summaries of broadcast news,” in Automated Text Summarization, I. Mani and M. Maybury, Eds. Cambridge, MA: MIT Press, 1999, pp. 391–401.

[7] SpeechBot Audio/Video Search at Hewlett-Packard (HP) Labs [Online].Available: http://www. speechbot.com/

[8] S. Furui, “Recent advances in spontaneous speech recognition and understand-ing,” in Proc. ISCA & IEEE Workshop on Spontaneous Speech Processing andRecognition, 2003, pp. 1–6.

[9] Y.Y. Wang, L. Deng, and A. Acero, “Spoken language understanding,” IEEE SignalProcessing Mag., vol. 22, no. 5, pp. 16–31, Sept. 2005.

[10] Text Retrieval Conference [Online]. Available: http://trec.nist.gov/

[11] G. Salton and M.E. Lesk, “Computer evaluation of indexing and text process-ing,” J. ACM, vol. 15, no. 1, pp. 8–36, 1968.

CHOICE OF TERMS DISTANCE RATIO R TOTAL ENTROPY H1) W 2.34 5135.622) S2 3.38 4637.713) C2 3.65 3489.214) S2 + C2 3.78 4096.68

[TABLE 2] PERFORMANCE EVALUATION FOR THEPROBMAP APPROACH FOR TOPIC ANALYSISAND ORGANIZATION OF BROADCAST NEWS.

Authorized licensed use limited to: National Taiwan University. Downloaded on January 20, 2009 at 02:26 from IEEE Xplore. Restrictions apply.

Page 19: Spoken Documentscholars.lib.ntu.edu.tw/bitstream/123456789/375142/1/17.pdf · 2018-09-10 · text-based media or spoken/multimodal dialogs. Text-to-speech synthesis can be used to

[12] J.M. Ponte and W.B. Croft, “A language modeling approach to informationretrieval,” in Proc. ACM SIGIR Conf. R&D in Information Retrieval, 1998, pp. 275–281.

[13] D.R.H. Miller, T. Leek, and R. Schwartz, “A hidden Markov model informationretrieval system,” in Proc. ACM SIGIR Conf. R&D in Information Retrieval, 1999,pp. 214–221.

[14] B. Chen, H.M. Wang, and L.S. Lee, “A discriminative HMM/N-gram-basedretrieval approach for Mandarin spoken documents,” ACM Trans. Asian Lang.Inform. Processing, vol. 3, no. 2, pp. 128–145, 2004.

[15] G.W. Furnas, S. Deerwester, S.T. Dumais, T.K. Landauer, R. Harshman, L.A. Streeter, and K.E. Lochbaum, “Information retrieval using a singular valuedecomposition model of latent semantic structure,” in Proc. ACM SIGIR Conf.R&D in Information Retrieval, 1988, pp. 465–480.

[16] J.R. Bellegarda, “Latent semantic mapping,” IEEE Signal Processing Mag.,vol. 22, no. 5, pp. 70–80, Sept. 2005.

[17] T. Hofmann, “Probabilistic latent semantic indexing,” in Proc. ACM SIGIRConf. R&D in Information Retrieval, 1999, pp. 50–57.

[18] B. Chen, H.M. Wang, and L.S. Lee, “Discriminating capabilities of syllable-based features and approaches of utilizing them for voice retrieval of speech infor-mation in Mandarin Chinese,” IEEE Trans. Speech Audio Processing, vol. 10, no.5, pp. 303–314, 2002.

[19] K. Ng and V.W. Zue, “Subword-based approaches for spoken documentretrieval,” Speech Commun., vol. 32, no. 3, pp. 157–186, 2000.

[20] S. Srinivasan and D. Petkovic, “Phonetic confusion matrix based spoken documentretrieval,” in Proc. ACM SIGIR Conf. R&D in Information Retrieval, 2000, pp. 81–87.

[21] B. Chen, H.M. Wang, and L.S. Lee, “Improved spoken document retrieval byexploring extra acoustic and linguistic cues,” in Proc. European Conf. SpeechCommunication and Technology, 2001, pp. 299–302.

[22] E. Chang, F. Seide, H. Meng, Z. Chen, Y. Shi, and Y.C. Li, “A system for spokenquery information retrieval on mobile devices,” IEEE Trans. Speech AudioProcessing, vol. 10, no. 8, pp. 531–541, 2002.

[23] A. Singhal and F. Pereira, “Document expansion for speech retrieval,” in Proc.ACM SIGIR Conf. R&D in Information Retrieval, 1999, pp. 34–41.

[24] K. Koumpis and S. Renals, “Content-based access to spoken audio,” IEEESignal Processing Mag., vol. 22, no. 5, pp. 61–70, Sept. 2005.

[25] D.M. Bikel, R. Schwartz and R.M. Weischedel, “An algorithm that learnswhat’s in a name,” Mach. Learn., vol. 34, no. 1–3, pp. 211–231, 1999.

[26] Message Understanding Conference [Online]. Available:http://www.itl.nist.gov/iaui/894.02/related_projects/muc/.

[27] M. Federico, N. Bertoldi, and V. Sandrini, “Bootstrapping named entity recog-nition for Italian broadcast news,” in Proc. ACL Conf. Empirical Methods inNatural Language Processing, 2002, pp. 296–303.

[28] A. Kobayashi, F.J. Och, and H. Ney, “Named entity extraction from Japanesebroadcast news,” in Proc. European Conf. Speech Communication andTechnology, 2003, pp. 1125–1128.

[29] D.E. Appelt, R.H. Jerry, J. Bear, D. Israel, M. Kameyama, A. Kehler, D. Martin,K. Myers, and M. Tyson, “SRI international FASTUS system MUC-6 test results andanalysis,” in Proc. 6th Message Understanding Conf. (MUC-6), 1995, pp. 237–248.

[30] A. Borthwick, “A maximum entropy approach to named entity recognition,”Ph.D. dissertation, New York Univ., NY, 1999.

[31] Y.I. Liu, “An initial study on named entity extraction from Chinese text/spoken doc-uments and its potential applications,” M.S. thesis, National Taiwan Univ., July 2004.

[32] D.D. Palmert, M. Ostendorf, and J.D. Burger, “Robust information extractionfrom spoken language data,” in Proc. European Conf. Speech Communication andTechnology, 1999, pp. 1035–1038.

[33] D.D. Palmert, “Modeling uncertainty for information extraction from speechdata,” Ph.D. dissertation, Univ. Washington, 2001.

[34] A. Fujii, K. Itou, and T. Ishikawa, “A method for open-vocabulary speech-driventext retrieval,” in Proc. 2002 Conf. Empirical Methods in Natural LanguageProcessing, 2002, pp. 188–195.

[35] D. Beeferman, A. Berger, and J. Lafferty, “Statistical models for text segmenta-tion,” Mach. Learn., vol. 34, no. 1–3, pp. 1–34, 1999.

[36] T. Kawahara, M. Hasegawa, K. Shitaoka, T. Kitade, and H. Nanjo, “Automaticindexing of lecture presentations using unsupervised learning of presumed discoursemarkers,” IEEE Trans. Speech Audio Processing, vol. 12, no. 4, pp. 409–419, 2004.

[37] J.P. Yamron, I. Carp, L. Gillick, S. Lowe, and P. van Mulbreget, “A hiddenMarkov model approach to text segmentation and event tracking,” in Proc. IEEEInt. Conf. Acoustics, Speech, Signal Processing, 1998, vol. 1, pp. 333–336.

[38] W. Grei, A. Morgan, R. Fish, M. Richards, and A. Kundu, “Fine-grained hiddenMarkov modeling for broadcast-news story segmentation,” in Proc. HumanLanguage Technology Conf., 2001.

[39] D.M. Blei and P.J. Moreno, “Topic segmentation with an aspect hidden Markovmodel,” in Proc. ACM SIGIR Conf. R&D in Information Retrieval, 2001, pp.343–348.[40] D.E. Appelt and D.J. Israel, “Introduction to information extraction technology,”in Proc. Int. Joint Conf. Artificial Intelligence, 1999.

[41] R. Engels and B. Bremdal, “Information extraction: State-of-the-art report,”On-To-Knowledge Consortium, CognIT a.s, Asker, Norway, IST Project IST-1999-10132, 2000.

[42] C. Manning and H. Schutze, Foundations of Statistical Natural LanguageProcessing. Cambridge, MA: MIT Press, 1999.

[43] I. Mani and M.T. Maybury, Eds., Advances in Automatic Text Summarization.Cambridge, MA: MIT Press, 1999.

[44] Y. Gong and X. Liu, “Generic text summarization using relevance measureand latent semantic analysis,” in Proc. ACM SIGIR Conf. R&D in InformationRetrieval, 2001, pp. 19–25.

[45] J. Goldstein, M. Kantrowitz, V. Mittal and J. Carbonell, “Summarizing textdocuments: Sentence selection and evaluation metrics,” in Proc. ACM SIGIR Conf.R&D in Information Retrieval, 1999, pp. 121–128.

[46] S. Furui, T. Kikuchi, Y. Shinnaka, and C. Hori, “Speech-to-text and speech-to-speech summarization of spontaneous speech,” IEEE Trans. Speech AudioProcessing, vol. 12, no. 4, pp. 401–408, 2004.

[47] R. Jin and A.G. Hauptmann, “Title generation for spoken broadcast news using atraining corpus,” in Proc. Int. Conf. Spoken Language Processing, 2000, pp. 680–683.

[48] Y. Yang, C.G. Chute, “An example-based mapping method for text classifica-tion and retrieval,” ACM Trans. Inform. Syst., vol. 12, no. 3, pp. 252–77, 1994.

[49] M. Witbrock and V. Mittal, “Ultra-summarization: A statistical approach togenerating highly condensed non-extractive summaries,” in Proc. ACM SIGIRConf. R&D in Information Retrieval, 1999, pp. 315–316.

[50] S.C. Chen and L.S. Lee, “Automatic title generation for Chinese spoken docu-ments using an adaptive K-nearest-neighbor approach,” in Proc. European Conf.Speech Communication and Technology, 2003, pp. 2813–2816.

[51] L.S. Lee and S.C. Chen, “Automatic title generation for Chinese spoken docu-ments considering the special structure of the language,” in Proc. European Conf.Speech Communication and Technology, 2003, pp. 2325–2328.

[52] T. Kohonen, S. Kaski, K. Lagus, J. Salojvi, J. Honkela, V. Paatero, and SaarelaA, “Self organization of a massive document collection,” IEEE Trans. NeuralNetworks, vol. 11, no. 3, pp. 574–585, 2000.

[53] M. Kurimo, “Thematic indexing of spoken documents by using self-organizingmaps,” Speech Commun., vol. 38, no. 1, pp. 29–45, 2002.

[54] T. Kohonen, Self-Organizing Maps, 2nd ed. Berlin: Springer, 1997.

[55] T. Hofmann, “ProbMap—A probabilistic approach for mapping large docu-ment collections,” J. Intell. Data Anal., vol. 4, no. 2, pp. 149–164, 2000.

[56] L.S. Lee, Y. Ho, J.F. Chen, and S.C. Chen, “Why is the special structureof the language important for Chinese spoken language processing?—Examples on spoken document retrieval, segmentation and summarization,”in Proc. European Conf. Speech Communication and Technology, 2003, pp.49–52.

[57] C.J. Wang, B. Chen, and L.S. Lee, “Improved Chinese spoken documentretrieval with hybrid modeling and data-driven indexing features,” in Proc. Int.Conf. Spoken Language Processing [CD-ROM], 2002, pp. 1985–1988.

[58] B. Chen, J.W. Kuo, Y.M. Huang, and H.M. Wang, “Statistical Chinese spokendocument retrieval using latent topical information,” in Proc. Int. Conf. SpokenLanguage Processing, 2004, pp. 1621–1625.

[59] J. Sun, J. Gao, L. Zhang, M. Zhou, and C. Huang, “Chinese named entity iden-tification using class-based language model,” in Proc. Int. Conf. ComputationalLinguistics, Taipei, 2002, pp. 967–973.

[60] J.F. Chen, “Chinese spoken document segmentation with consideration of fea-tures, language models and extra information—Examples using broadcast news,”M.S. thesis, National Taiwan Univ., July 2003.

[61] T. Kikuchi, S. Furui, and C. Hori, “Two-stage automatic speech summariza-tion by sentence extraction and compaction,” in Proc. IEEE and ISCA Workshopon Spontaneous Speech Processing and Recognition, 2003, pp. 207–210.

[62] C. Hori and S. Furui, “Automatic speech summarization based on word signif-icance and linguistic likelihood,” in Proc. IEEE Int. Conf. Acoustics, Speech,Signal Processing, 2000, vol. 3, pp. 1579–1582.

[63] Y. Ho, “An initial study on automatic summarization of Chinese spoken docu-ments,” M.S. thesis, National Taiwan Univ., July 2003.

[64] S.C. Chen, “Initial studies on Chinese spoken document analysis—Topic seg-mentation, title generation, and topic organization,” M.S. thesis, National TaiwanUniv., July 2004.

[65] B. Chen, J.W. Kuo, and W.H. Tsai, “Lightly supervised and data-drivenapproaches to Mandarin broadcast news transcription,” in Proc. IEEE Int. Conf.Acoustics, Speech, Signal Processing, 2004, vol. 1, pp. 777–780.

[66] Yahoo! Kimo News Portal [Online]. Available: http://tw.news.yahoo.com/

[67] The Topic Detection and Tracking 2001 (TDT-2001) Evaluation Plan [Online].Available: http://www.nist.gov/speech/tests/tdt/tdt2001/evalplan.htm

[68] E. Hovy and D. Marcu, “Automated text summarization tutorial,” in Proc.36th Ann. Meeting Association for Computational Linguistics and 17th Int. Conf.Computational Linguistics, Montreal, Quebec, Canada, 1998.

IEEE SIGNAL PROCESSING MAGAZINE [60] SEPTEMBER 2005

[SP]

Authorized licensed use limited to: National Taiwan University. Downloaded on January 20, 2009 at 02:26 from IEEE Xplore. Restrictions apply.