Audio Fingerprinting Based on Multiple
-
Upload
kvkumar128 -
Category
Documents
-
view
216 -
download
0
Transcript of Audio Fingerprinting Based on Multiple
-
8/8/2019 Audio Fingerprinting Based on Multiple
1/4
IEEE SIGNAL PROCESSING LETTERS, VOL. 16, NO. 6, JUNE 2009 525
Audio Fingerprinting Based on MultipleHashing in DCT Domain
Yu Liu, Hwan Sik Yun, and Nam Soo Kim, Member, IEEE
AbstractAudio fingerprinting techniques aim at successfullyperforming content-based audio identification even when the audiosignals are slightly or seriouslydistorted. In this letter, we propose anovel audio fingerprinting technique based on multiple hashing. Inorder to improve the robustness of hashing, multiple hash stringsare generated through the discrete cosine transform (DCT) whichis applied to the temporal energy sequence in each subband. Ex-perimental results show that the proposed algorithm outperformsthe Philips Robust Hash (PRH) algorithm [1] under various distor-tions.
Index TermsAudio fingerprinting, content-based audio identi-fication, discrete cosine transform (DCT), robust hashing.
I. INTRODUCTION
AN AUDIO fingerprint is a compact low-level con-
tent-based digest of an audio signal. It provides the
ability to identify short, unlabeled audio clips in a fast and
reliable way. The applications of audio fingerprinting include
broadcast monitoring, audio connecting, file sharing and auto-
matic music library organization [1]. There are several practical
requirements which a successful audio fingerprinting system
should satisfy [2]. First, it should be able to identify corrupted
audio clips in spite of degradations. Second, it should be ableto identify the clips of only a few seconds long. Finally, it
should be computationally efficient, both in calculating the
fingerprints and in searching for the best match in the database.
The application demands and the difficulties of designing such
systems boost the interest in audio fingerprinting techniques,
and many practical issues have been studied.
Among various algorithms, the system developed by A. Wang
from Shazam [3] has been considered as a successful and wide-
spread work. Besides, the Philips Robust Hash (PRH) algorithm
[1] is also a well studied content-based audio identification tech-
nique. The robustness of the PRH algorithm has been verified
mathematically via analyzing the overall bit error probability [4]
or bit error rate (BER) [5]. It is important for any fingerprinting
algorithm that it not only results in few bit errors, but also al-
lows for efficient searching [6]. Specifically, the PRH algorithm
Manuscript received December 01, 2008; revised February 05, 2009. Currentversion published April 24, 2009. This work was supported in part by the KoreaResearch Foundation Grant funded by the Korean Government (MOEHRD)(KRF-2008-313-D00783) and in part by the Korea Science and EngineeringFoundation (KOSEF) Grant funded by the Korean Government (MOST) (R0A-2007-000-10022-0). The associate editor coordinating the review of this manu-script and approving it for publication was Prof. Vesa Valimaki.
The authors are with the School of Electrical Engineering and the Institute ofNew Media and Communications, Seoul National University, Seoul 151-742,Korea (e-mail: [email protected]).
Digital Object Identifier 10.1109/LSP.2009.2016837
relies on the assumption that at least one of the so called subfin-
gerprints is invariant to noise. Although this assumption holds
in mild conditions (e.g., MP3 compression, downsampling, and
equalization), it fails to be valid for some seriously corrupted
audio clips such as those recorded in a noisy environment, hence
results in a serious performance degradation.
For such reasons, some efforts have been delivered to im-
prove the robustness of subfingerprints. One possible method
mentioned in [1] suggests to generate a list of most probable
candidates for each subfingerprint based on the reliability infor-
mation obtained from soft coding, albeit that the reliability in-
formation is in fact not reliable and less spectacular in practical
implementations. Another method is to view extracting subfin-
gerprints from the spectrogram as 2-D filtering in spectro-tem-
poral domain and try to substitute the filters [7], which is empir-
ical and not founded on a theoretical basis [4]. These approaches
attempt to enhance the robustness of individual subfingerprints
separately while joint improvement is considered more desir-
able.
In this letter, we propose a novel audio fingerprinting tech-
nique that generatesmultiple subfingerprints for each frame, and
the corresponding database searching scheme is also extended.
Specifically, discrete cosine transform (DCT) is applied to the
temporal sequence of energies in each subband, and one sub-fingerprint is generated for each DCT coefficient. Experimental
results show that the proposed approach outperforms the PRH
algorithm under various environments.
II. PRH ALGORITHM
There are two steps in the PRH algorithm: fingerprint extrac-
tion and database searching phases. The overall block diagram
for the fingerprint extraction stage is illustrated in Fig. 1. First,
the audio signal is divided into overlapping frames with the
length of about 370 ms, and the frame shift is 1/32 of the framelength. Second, power spectrum is obtained by performing FFT,
and then the energies for 33 non-overlapping logarithmically
spaced subbands (e.g., Bark Scale) covering the frequency range
from 300 Hz to 2000 Hz are calculated. Finally, hash strings
(referred to as subfingerprints) are computed from the subband
energies in each frame as follows:
(1)
and
(2)
1070-9908/$25.00 2009 IEEE
-
8/8/2019 Audio Fingerprinting Based on Multiple
2/4
526 IEEE SIGNAL PROCESSING LETTERS, VOL. 16, NO. 6, JUNE 2009
Fig. 1. Overall block diagram of the fingerprint extraction stage in the PRHalgorithm.
Fig. 2. Illustration of generating candidates from the hash table in the PRHalgorithm.
in which
(3)
In (1), denotes the th subband energy in the th
frame, and is the output difference. is the
32-bit subfingerprint of frame and is the th bit of
it.
For the audio files stored in the database, all the subfinger-
prints computed are registered in a hash table with the subfin-
gerprints being treated as the keys. Each entry of the hash table
stores a list of pointers to the positions in the audio files where
the subfingerprint occurs. In the stage of database searching,
256 subfingerprints which amount to approximately 3 seconds
are extracted from the query audio, and each subfingerprint ismatched with the hash table contents to find the candidate po-
sitions where it may come from. The process to generate the
candidates is illustrated in Fig. 2. A fingerprint block with the
same size as the query block ( bits) from the
candidate position is obtained, BER between the two blocks is
computed and compared with a threshold which is set to 0.35
in [1]. If the BER is less than the threshold, the two signals are
considered similar and the candidate audio is declared as the re-
sult.
III. MULTIPLE HASHING ALGORITHM
The subfingerprints generated from the PRH algorithm arethe spectro-temporal differences between adjacent frames and
Fig. 3. Overall block diagram of the fingerprint extraction stage in the MLHmethod.
subbands. This method provides an efficient way to summa-
rize the discriminative information of the audio spectra. Since,
however, it only makes use of the information in two neigh-
boring frames, it may be vulnerable to the possible interfer-
ences. The bits in subfingerprints can be flipped due to the cor-
ruption of local noise and mislead the audio fingerprint search.
The problem can be alleviated by including more frames and
performing low-pass filtering. By applying a set of low-pass fil-
ters, the information that is invariant to noise can be extractedfor a robust discrimination. The set of filters are designed to be
orthogonal so that the output features obtained can be more dis-
tinguished from each other.
In this section, we propose a new audio fingerprinting tech-
nique called the multiple hashing (MLH) method. In the pro-
posed algorithm, DCT is applied to the temporal sequence of
energies in each subband, and a subfingerprint is constructed for
each DCT coefficient stream. The reasons for employing DCT
in the MLH method are twofold. First, among all the orthogonal
transforms, the decorrelation performance of DCT is closest to
the KarhunenLove transform [9]. Second, DCT has a strong
energy compaction property [8] implying that most of the signal
energy tends to be concentrated in a few low-frequency compo-
nents. The decorrelation property ensures that each subfinger-
print can be treated separately and performance improvement
is possible via generating more subfingerprints. If the subband
energies are evolving slowly, only a few DCT coefficients are
sufficient to describe the subfingerprints.
The framework of fingerprint extraction in the MLH system
is depicted in Fig. 3. The first three parts, i.e., framing, FFT and
band energy calculation are the same as those in the PRH algo-
rithm. However, in contrast to the PRH algorithm, before com-
puting the hash strings, -point DCT is performed on the con-
secutive subband energies
. Among the DCT coefficients, only the lower-or-dered values are retained for the computation of subfinger-
prints. As a result, we obtain coefficients for each frame
and subband denoted by . Then,
each DCT coefficient is processed in a way similar to the PRH
algorithm as follows:
(4)
where represents the th output difference of sub-
band in frame . Let denote the th subfingerprint in
frame . Then,
(5)
-
8/8/2019 Audio Fingerprinting Based on Multiple
3/4
LIU et al.: AUDIO FINGERPRINTING BASED ON MULTIPLE HASHING IN DCT DOMAIN 527
Fig. 4. Illustration of generating candidates from the hash tables in the MLH
method.
in which
(6)
Note that in (4) we use and
instead of and to ensure that
they are obtained based on band energies which do not overlap
with those used to compute and .As in the PRH system, the subfingerprints extracted from
the database are registered in hash tables to enable an efficient
searching algorithm. Since subfingerprints are computed for
each frame, hash tables are constructed, for example, the
subfingerprints obtained from the second DCT coefficients in
each frame are registered in the second hash table. The data-
base searching scheme consists of three steps. First, the query
audio is divided into 256 frames, and subfingerprints are ob-
tained in each frame as in the fingerprint extraction phase. Con-
sequently, the query fingerprint block consists of
bits, and forms the finger-
print block . Second, the candidate positions are generated in
each hash table separately, i.e., the subfingerprints in fingerprintblock are matched with the contents in the th hash table as in
the PRH algorithm, and a candidate list is created by accumu-
lating all the search results in all included hash tables. Finally,
BERs are computed by comparing the query fingerprint block
with those stored at the candidate positions in the database, and
the most hit candidatewith BERless than the specified threshold
is returned as the result. The process to generate the candidates
is illustrated in Fig. 4.
IV. EXPERIMENTAL RESULTS
To evaluate the performance of the proposed MLH approach,
we conducted several experiments under various environments.
The database used in the experiments included 1500 music files
collected from commercial compact discs. The database con-
sisted of three groups of 500 files from classical, pop and rock
genres. As for the positive queries, 1200 music clips with the
length of about 3 s were randomly chosen from the database
among which there were 393 rock, 426 classical, and 381 pop
pieces. On the other hand, in order to compute the false positive
rates, 200 music files that were not included in the database were
collected from which another 1200 music clips were randomly
extracted to form the negative queries. To assess the robustness
of the algorithm, the following distortions were applied to both
kinds of queries.
Set 1: Additive white noise with the SNR at 5 db.
Set 2: Additive white noise with the SNR at 0 db.
Set 3: Playing and recording in a very quiet environment.
Set 4: Playing and recording in office noise environment.
There are several parameters which should be determined for
the implementation of the MLH method. In our experiments, the
DCT length was set to be 16, which was considered to provide
a good compromise between the frequency and time resolutions.
As for ,we s et since more than 90% of the total energywas found to concentrate on the first four coefficients in the
tested materials. Finally, to speed up computation, we applied
the running DCT algorithm [10] since the computation of DCT
shifts one sample at each time. To compare the performances,
we also implemented the PRH algorithm [1].
As can be seen from the description of the algorithms, there
are two types of false negatives: First, a distorted positive query
clip gets rejected when the BER is higher than the threshold.
Second, dismissal of a distorted positive query clip happens
when there exist no error-free subfingerprints in the query block,
even when the BER between the query fingerprint block and
the candidate it comes from in the database is less than thethreshold. The first type of false negative is due to the tradeoff
when trying to set a proper threshold, while the second type of
false negative comes with the hashing structure in the searching
algorithm. On the other hand, there is only one category of false
positives, in which a negative query clip is accepted as in the
database since the BER between the obtained fingerprint block
and the candidate pointed by the hash table is less than the
threshold. In order to evaluate the robustness of the system, we
plotted 1- false positive rate (FPR) versus 1-false negative rate
(FNR) by varying the threshold, and the obtained receiver op-
erating characteristic (ROC) curves are shown in Fig. 5 from
which we can discover the superiority of the proposed method.
Also note that we use the term recognition rate in the followingexperiments to represent 1-FNR, therefore it differs from the
-
8/8/2019 Audio Fingerprinting Based on Multiple
4/4
528 IEEE SIGNAL PROCESSING LETTERS, VOL. 16, NO. 6, JUNE 2009
Fig. 5. ROC curves for four query sets; (a) Set 1. (b) Set 2. (c) Set 3. (d) Set 4.
TABLE IRECOGNITION RATES (%) OF THE PRH AND MLH ALGORITHMS
WITH DIFFERENT COMBINATIONS OF HASH TABLES
conventional recall rate in that the latter excludes only the first
type of false negatives.
In addition, to show the performance variation with different
combinations of hash tables, we set the threshold to be 0.35 as in
[1] and measured the results. The four hash tables employed are
denoted as HT1, HT2, HT3 and HT4. Here HT was built from
the th DCT coefficients, for example, HT1 was constructed
from the DC components. For comparison, the hash table con-structed in the PRH algorithm is represented as HT0. For each
query set, the MLH algorithms were applied using different
combinations of hash tables. The recognition rate achieved by
each algorithm is given in Table I. Note that HT in HT is
omitted in the table for simplicity, for example, 1, 2 means
the MLH method using hash tables HT1 and HT2.
It can be seen that the performance using HT1 was better than
that of the PRH algorithm. Moreover, as more hash tables wereadded, the recognition rate improved dramatically. However, it
is worth noting that if more hash tables are used, the memory
usage and the computational burden also increase. Specifically,
when hash tables are used, the memory usage is times
that of the PRH algorithm, and the computation complexity, in
the worst case, also becomes times. Thus it requires a careful
consideration on how many hash tables are needed depending on
the environment in which the audio fingerprinting system would
be deployed. As canbe seen from the experimental results, when
the clips are corrupted by mild noises, the MLH method using
only one hash table already achieves satisfactory results, how-
ever, for the seriously distorted clips, more hash tables should be
added to achieve a better performance at the expense of highermemory usage and computational burden.
V. CONCLUSION
In this letter, we have presented a novel audio fingerprinting
technique based on multiple hashing in DCT domain. Specif-
ically, DCT is applied to the temporal sequence of energies
in each subband and only the lower-ordered coefficients are
retained to compute the subfingerprints. Multiple hash tables
are built corresponding to the multiple subfingerprints in each
frame, and different combinations of hash tables are evaluated.
Experimental results have shown that the proposed MLH
scheme outperformed the conventional PRH algorithm undervarious conditions. Future works may include the investigation
of an efficient method for which reduced number of hash tables
are used while maintaining the high accuracy of MLH.
REFERENCES
[1] J. Haitsma and T. Kalker, A highly robust audio fingerprintingsystem, in Proc. 3rd Int. Conf. Music Information Retrieval, Oct.2002, pp. 107115.
[2] P. Cano, E. Batlle, T. Kalker, and J. Haitsma, A review of audio fin-gerprinting, J. VLSI Signal Process., vol. 41, no. 3, pp. 271284, Nov.2005.
[3] A. Wang, An industrial strength audio search algorithm, in Proc. 4thInt. Conf. Music Information Retrieval, Oct. 2003, pp. 713.
[4] F. Balado, N. Hurley, E. McCarthy, and G. Silvestre, Performanceanalysis of robust audio hashing, IEEE Trans. Inform. Forensics Se-curity, vol. 2, no. 2, pp. 254266, June 2007.
[5] P. Doets and R. Lagendijk, Distortion estimation in compressedmusic using only audio fingerprints, IEEE Trans. Audio, Speech,
Lang. Process., vol. 16, no. 2, pp. 302317, Feb. 2008.[6] J. Haitsma and T. Kalker, Speed-change resistant audio fingerprinting
using auto-correlation, in Proc. Int. Conf. Acoustics, Speech, andSignal Processing, Apr. 2003, vol. 4, pp. 728731.
[7] M. Park, H. Kim, Y. Ro, and M. Kim, Frequency filtering for a highlyrobust audio fingerprintingscheme in a real-noise environment,IEICETrans. Inform. Syst., vol. E89-D, no. 7, pp. 23242327, July 2006.
[8] K. Rao and P. Yip , Discrete Cosine Transform: Algorithms, Advan-tages, Applications. New York: Academic, 1990.
[9] N. Ahmed, T. Natarajan, and K. Rao, Discrete cosine transform,IEEE Trans. Comput., pp. 9093, Jan. 1974.
[10] J. Xi and J. Chicharo, Computing running DCTs and DSTs based ontheir second-order shift properties, IEEE Trans. Circuits Syst. I: Fund.Theory Applicat., vol. 47, no. 5, pp. 779783, May 2000.