Audio mining

Presented By:Patel Prashant(CE 137)

The Web, databases, and other digitized information storehouses contain a growing volume of audio content.

For example newscasts, sporting events, telephone conversations, recordings of meetings, Webcasts, documentary archives etc.

Users want to make the most of this material by searching and indexing the digitized audio content.

In the past, companies had to create and manually analyze written transcripts of audio content because using computers to recognize, interpret, and analyze digitized speech was difficult.

However, the development of faster microprocessors, larger storage capacities, and better speech-recognition algorithms has made audio mining easier.

Audio mining, also called audio searching, takes a text-based query and locates the search term or phrase in an audio file.

This helps users by, for example, letting them quickly get to specific places in a recorded conversation or determine when a company is mentioned in a newscast.

Audio indexing uses speech recognition to analyze an entire file and produce a searchable index of content bearing words and their locations.

This is critical because audio content is in a binary format that is otherwise not readily searchable.

Indexing audio content thus enables searching.

There are two main approaches to audio mining :

1.Text-based indexing :oIt converts speech to text and then identifies words in a dictionary that can contain several hundred thousand entries. If a word or name is not in the dictionary, the system will choose the most similar word it can find.

oThe system uses language understanding to create a confidence level for its findings. For findings with less than a 100 percent confidence level, the system offers other possible word matches.

2. Phoneme-based indexing:o It doesn’t convert speech to text but instead works only with sounds.

o The system first analyzes and identifies sounds in a piece of audio content to create a phonetic-based index. It then uses a dictionary of several dozen phonemes to convert a user’s search term to the correct phoneme string.

o Phonemes are the smallest units of speech in a language, all words are set of phonemes.

oFinally, the system looks for the search terms in the index.

A phonetic system requires a more efficient search tool because it must phoneticize the query term, then try to match it with the existing phonetic string output .This is considerably more complex than using one of the many existing text-based search tools.

Phoneme-based searches can result in more false matches than the text-based approach, particularly for short search terms, because many words sound alike or sound like parts of other words.

However, phonetic indexing can still be useful if the analyzedmaterial contains important words that are likely to be missing from a text system’s dictionary, such as foreign terms and names of people and places

Text- and phoneme-based systems operate in much the same way, except that the former uses a text-based dictionary and the latter uses a phonetic dictionary.

A speech recognizer converts the observed acoustic signal into the corresponding written representation of the spoken words.

Speech recognition software contains acoustic models of the way in which all phonemes are represented.

Also, there is a statistical language model that indicates how likely words are to follow each other in a specific Language.

By using these capabilities, as well as complex probability analysis, the technology can take a speech signal of unknown content and convert it to aseries of words .

Figure: ScanSoft Audio Mining System

Since music is often described by genre, we would like to annotate our music data with genre.

Classification by genre is useful for music search and retrieval and also for playlist generation

Linear and non-linear neural networks

Gaussian Classification

Gaussian mixture models

Hidden Markov model

Let us consider a simple weather forecast problem and try to emulate a model that can predict tomorrow's weather based on today’s condition. In this example we have three stationary all day weather, which could be sunny (S), cloudy (C), or Rainy (R). From thehistory of the weather of the town under investigation we have the following table

•We refer to the weather conditions by state q that are sampled at instant t and the problem is to find the probability of weather condition of tomorrow given today's condition

P(qt+1 /qt).

•An acceptable approximation for n instants history is :

•P(qt+1/qt , qt-1 , qt-2 , ….. , qt-n ) » P(qt+1 /qt).

•This is the first order Markov chain as the history is considered to be one instant only.

Let us now ask this question: Given today as sunny (S) what is the probability that the next following five days are S , C , C , R and S, having the above model?

The answer resides in the following formula using first order Markov chain:P(q1 = S, q2=S, q3=C, q4=C, q5=R, q6=S) =P(S).P(q2=S/q1=S). P(q3=C/q2=S). P(q4=C/q3=C). P(q5=R/q4=C). P(q6=S/q5=R)= 1 x 0.7 x 0.2 x 0.8 x 0.15 x 0.15

= 0.00252

The initial probability P(S) = 1, as it is assumed that today is sunny.

The model is completely defined by these three sets of parameters a, b, and p and the model of N states and M observations can be referred to by :λ = (A , B , p ) where A = {aij}, B = {bj(wk)} 1 <=i , j <=N and 1< =k <= M.

aij represents the probability of state transition (probability of being in state Sj given state Si )

aij = P(qt+1=Sj / qt=Si) .

bj(wk) is the probability distribution in a state Sj

w is the alphabet and k is the number of symbols in this alphabet.π = {1 0 0 } is the initial state probability distribution.

Each phoneme(unit of sound) could be represented by a state of different and varying duration.

Accordingly, the transition between different phonemes to form a word can be represented by A = {aij}.

The observations in this case are the sounds produced in each position and due to the variations in the evolution of each soundthis can be also represented by a probabilistic function B = {bj(wk)}.

A major challenge for speech recognition tools has been recognizing the speech of different users in different environments.

With this in mind, BBN, IBM, Fast-Talk, and ScanSoft have designed their audio mining technology to be largely speaker independent.

For example, Fast-Talk’s acoustic models are trained to recognize numerous speakers via exposure to audio data from speakers representing various ages, dialects, and speaking styles.

Some audio mining technology uses acoustic models tuned to understand speech from different environments— such as telephony, TV, or radio.

Designing filters to reduce the background noise that can interfere with accurate speech recognition.

Creating efficient data structures for representing content.

Developing algorithms that quickly work through the data structures during indexing and searching.

Precision is improving but it is still a key issue impeding the technology’s widespread adoption, particularly in such accuracy-critical applications as court reporting and medical dictation.

Processing conversational speech can be particularly difficult because of such factors as overlapping words and background noise.

Breakthroughs in natural language understanding will eventually lead to big improvements, but until then audio mining will get better only incrementally.

It’s currently seen as a ‘nice to have,’ not quite yet a ‘need to have’ technology.

Companies could use audio mining to analyze customer-service and helpdesk conversations or even voice mail.

Law enforcement and intelligence organizations could use the technology to analyze intercepted phone conversations.

Broadcast companies like CNN and Radio Free Asia are already usingaudio mining to quickly retrieve important background information from previous broadcasts when new stories break.

A US prison is using ScanSoft’s audio mining product to analyze recordings of prisoners’ phone calls to identify illegal activity.

Musical audio mining relates to the identification of perceptually important characteristics of a piece of music such as melodic, harmonic or rhythmic structure.

Searches can then be carried out to find pieces of music that are similar in terms of their melodic, harmonic and/or rhythmic characteristics.

This type of analysis can also be used in music to determine characteristics like beats per minute (BPM), musical key, and musical structure, information that is employed to classify music.

Music downloading sites that categorize music by genre uses audio mining to organize the music.

Research paper: “Lets Hear It For Audio Mining” by Neal Leavitt.

Research paper: “Tendencies, Perspectives, and Opportunities of Musical Audio-Mining” by Ghent University, Belgium.

wikipedia.org/wiki/Audio_mining

Thank You

Audio mining

Education

Transcript of Audio mining