Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar,...
-
date post
19-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar,...
Content-Based Classification,
Search & Retrieval of Audio
Erling Wold, Thom Blum, Douglas Keislar, James Wheaton
Presented By: Adelle C. Knight
Agenda
Introduction Previous Research Analysis Techniques Statistical Techniques Performance Applications Future Work
Previous Research
Sounds traditionally described by pitch, loudness, duration, timbre Timbre can be identified by a tone because of their similar spectral
energy distributions Too much variation across range of pitches and dynamic levels to
“fingerprint” a single instrument tone
Algorithms that extract audio structure (i.e. find first occurrence of G-sharp)
– Algorithms were tuned to specific musical constructs and not appropriate for all sounds
Neural nets to index audio databases– Some success but it was difficult for user to specify which features were
important and which to ignore
Methods To Access Sounds
Simile Acoustical/Perceptual Features Subjective Features Onomatopoeia
Accomplish Methods
1. Analysis Techniques• Reduce sound to small set of parameters
2. Statistical Techniques• To accomplish classification & retrieval
Analysis Techniques
Analysis & Retrieval Engine
Exact Text Search
Sound Level
Fuzzy Text Search
Speech or Musical content 1. Measure variety of acoustical features of each sound1. Loudness2. Pitch3. Brightness4. Bandwidth5. Harmonicity
2. Set of N features is represented as an N-vector.3. Different aural properties map to different regions of N-
space.
Acoustical Features:Loudness
Approximated by signal’s Root-Mean-Square (RMS) measured in decibels
– RMS calculated by taking series of windowed frames of the sound and computing the square root of the sum of the squares of the sample values
Human ear: 120 db range Software: 100 db range from 16 bit recordings
Acoustical Features:Pitch
Estimated by taking series of short-time Fourier spectra
Frequencies & amplitudes of peaks measured for each frame
Approximate Greatest Common Divisor algorithm to calculate estimate of pitch
Store as log frequency Human ear: 20Hz – 20kHz Software: 50Hz – 10kHz
Acoustical Features:Brightness
Measure of higher frequency content of signal Computed as centroid of the short-time Fourier
magnitude spectra Stored as log frequency Varies over same range as pitch Can’t be less than pitch estimate at any given instant
Acoustical Features:Bandwidth
Difference of frequency components and centre frequency is taken
Summation of differences Divide by number of components to get average Examples:
– Single sine wave has bandwidth of 0– Ideal white noise has infinite bandwidth
Acoustical Features:Harmonicity
Harmonic vs. Inharmonic vs. Noise Computed by measuring deviation of sound’s line
spectrum from a perfectly harmonic spectrum Normalized range 0-1 Optional feature
Storage – Feature Vector
Trajectory in time computed but not stored For each trajectory, computes & stores:
– Average– Variance– Autocorrelation– Duration of sound
Training The System
For each sound entered into the db, the N-vector, a, is computed
Mean vector and covariance matrix R for the a vectors in each class are calculated:
µ = (1/M) ∑j .a[j]
R = (1/M) ∑j .(a[j]-µ)(a[j]-µ)T
Mean + Covariance = System’s model of perceptual property being trained by user
Statistical Techniques
Classifying Sounds
When a new sound needs to be classified, a distance measure is calculated from new sound’s a vector and previous model
Using weighted L2 or Euclidean distance:
D = ((a-µ)TR-1(a-µ))1/2
Likelihood value L based on normal distribution and given by:
L = exp(-D2/2)
Retrieving Sounds
Sort sounds by all acoustic features
Example:
– Retrieve top M sounds in class– Get all sounds in hyper-rectangle centered around mean with
volume V such that
V/V0=M/M0
– Compute distance measure for all sounds– Return closest M sounds– Increase ratio & Iterate of not enough sounds returned
2 Quality Measures
1. Magnitude of covariance matrix R• Measure of the compactness of the class• Quality measure of classification
2. Size of covariance matrix • Measure of particular dimension’s importance to the class• User can see if feature is too important or not important
enough
Segmentation
Apply acoustical analyses Look for transitions Transitions define segments of the signal to be treated
like individual sounds
Performance & Results
Laughter classification Touchtone classification
Example: Laughter classification
Returned:
•Laughing sounds•Animal sounds
Example: Touchtone classification
Returned:
•1 recording out of training set•Low likelihood touchtone - 7 digit telephone #•High likelihood – single digit tones
Applications
Audio databases & file systems– Fields: file name, sample rate, sample size, file format,
channels, dates, keywords, analysis feature vector, etc.
Audio database browser– Front-end db application (e.g.. SoundFisher) lets user
search for sounds using queries that can be content based– Permits general maintenance of entries – adding, deleting,
describing sounds
Applications
Audio editors– Include knowledge of audio content– Search commands like queries, build new classes on the fly
Surveillance– Identical to editor but identification & classification done in real
time– Detect sounds associated with criminal activity (eg. Glass
breaking, screams)
Automatic segmentation of audio & video– For large archives of raw audio & video– Audio-to-MIDI (Studio Vision Pro 3.0)
Future Work
More analytic features General phrase-level content based retrieval Source separation Sound synthesis
Conclusions