Audio Fingerprinting Based on Multiple

8/8/2019 Audio Fingerprinting Based on Multiple

1/4

IEEE SIGNAL PROCESSING LETTERS, VOL. 16, NO. 6, JUNE 2009 525

Audio Fingerprinting Based on MultipleHashing in DCT Domain

Yu Liu, Hwan Sik Yun, and Nam Soo Kim, Member, IEEE

AbstractAudio fingerprinting techniques aim at successfullyperforming content-based audio identification even when the audiosignals are slightly or seriouslydistorted. In this letter, we propose anovel audio fingerprinting technique based on multiple hashing. Inorder to improve the robustness of hashing, multiple hash stringsare generated through the discrete cosine transform (DCT) whichis applied to the temporal energy sequence in each subband. Ex-perimental results show that the proposed algorithm outperformsthe Philips Robust Hash (PRH) algorithm [1] under various distor-tions.

Index TermsAudio fingerprinting, content-based audio identi-fication, discrete cosine transform (DCT), robust hashing.

I. INTRODUCTION

AN AUDIO fingerprint is a compact low-level con-

tent-based digest of an audio signal. It provides the

ability to identify short, unlabeled audio clips in a fast and

reliable way. The applications of audio fingerprinting include

broadcast monitoring, audio connecting, file sharing and auto-

matic music library organization [1]. There are several practical

requirements which a successful audio fingerprinting system

should satisfy [2]. First, it should be able to identify corrupted

audio clips in spite of degradations. Second, it should be ableto identify the clips of only a few seconds long. Finally, it

should be computationally efficient, both in calculating the

fingerprints and in searching for the best match in the database.

The application demands and the difficulties of designing such

systems boost the interest in audio fingerprinting techniques,

and many practical issues have been studied.

Among various algorithms, the system developed by A. Wang

from Shazam [3] has been considered as a successful and wide-

spread work. Besides, the Philips Robust Hash (PRH) algorithm

[1] is also a well studied content-based audio identification tech-

nique. The robustness of the PRH algorithm has been verified

mathematically via analyzing the overall bit error probability [4]

or bit error rate (BER) [5]. It is important for any fingerprinting

algorithm that it not only results in few bit errors, but also al-

lows for efficient searching [6]. Specifically, the PRH algorithm

Manuscript received December 01, 2008; revised February 05, 2009. Currentversion published April 24, 2009. This work was supported in part by the KoreaResearch Foundation Grant funded by the Korean Government (MOEHRD)(KRF-2008-313-D00783) and in part by the Korea Science and EngineeringFoundation (KOSEF) Grant funded by the Korean Government (MOST) (R0A-2007-000-10022-0). The associate editor coordinating the review of this manu-script and approving it for publication was Prof. Vesa Valimaki.

The authors are with the School of Electrical Engineering and the Institute ofNew Media and Communications, Seoul National University, Seoul 151-742,Korea (e-mail: [email protected]).

Digital Object Identifier 10.1109/LSP.2009.2016837

relies on the assumption that at least one of the so called subfin-

gerprints is invariant to noise. Although this assumption holds

in mild conditions (e.g., MP3 compression, downsampling, and

equalization), it fails to be valid for some seriously corrupted

audio clips such as those recorded in a noisy environment, hence

results in a serious performance degradation.

For such reasons, some efforts have been delivered to im-

prove the robustness of subfingerprints. One possible method

mentioned in [1] suggests to generate a list of most probable

candidates for each subfingerprint based on the reliability infor-

mation obtained from soft coding, albeit that the reliability in-

formation is in fact not reliable and less spectacular in practical

implementations. Another method is to view extracting subfin-

gerprints from the spectrogram as 2-D filtering in spectro-tem-

poral domain and try to substitute the filters [7], which is empir-

ical and not founded on a theoretical basis [4]. These approaches

attempt to enhance the robustness of individual subfingerprints

separately while joint improvement is considered more desir-

able.

In this letter, we propose a novel audio fingerprinting tech-

nique that generatesmultiple subfingerprints for each frame, and

the corresponding database searching scheme is also extended.

Specifically, discrete cosine transform (DCT) is applied to the

temporal sequence of energies in each subband, and one sub-fingerprint is generated for each DCT coefficient. Experimental

results show that the proposed approach outperforms the PRH

algorithm under various environments.

II. PRH ALGORITHM

There are two steps in the PRH algorithm: fingerprint extrac-

tion and database searching phases. The overall block diagram

for the fingerprint extraction stage is illustrated in Fig. 1. First,

the audio signal is divided into overlapping frames with the

length of about 370 ms, and the frame shift is 1/32 of the framelength. Second, power spectrum is obtained by performing FFT,

and then the energies for 33 non-overlapping logarithmically

spaced subbands (e.g., Bark Scale) covering the frequency range

from 300 Hz to 2000 Hz are calculated. Finally, hash strings

(referred to as subfingerprints) are computed from the subband

energies in each frame as follows:

(1)

and

(2)

1070-9908/$25.00 2009 IEEE


2/4

526 IEEE SIGNAL PROCESSING LETTERS, VOL. 16, NO. 6, JUNE 2009

Fig. 1. Overall block diagram of the fingerprint extraction stage in the PRHalgorithm.

Fig. 2. Illustration of generating candidates from the hash table in the PRHalgorithm.

in which

(3)

In (1), denotes the th subband energy in the th

frame, and is the output difference. is the

32-bit subfingerprint of frame and is the th bit of

it.

For the audio files stored in the database, all the subfinger-

prints computed are registered in a hash table with the subfin-

gerprints being treated as the keys. Each entry of the hash table

stores a list of pointers to the positions in the audio files where

the subfingerprint occurs. In the stage of database searching,

256 subfingerprints which amount to approximately 3 seconds

are extracted from the query audio, and each subfingerprint ismatched with the hash table contents to find the candidate po-

sitions where it may come from. The process to generate the

candidates is illustrated in Fig. 2. A fingerprint block with the

same size as the query block ( bits) from the

candidate position is obtained, BER between the two blocks is

computed and compared with a threshold which is set to 0.35

in [1]. If the BER is less than the threshold, the two signals are

considered similar and the candidate audio is declared as the re-

sult.

III. MULTIPLE HASHING ALGORITHM

The subfingerprints generated from the PRH algorithm arethe spectro-temporal differences between adjacent frames and

Fig. 3. Overall block diagram of the fingerprint extraction stage in the MLHmethod.

subbands. This method provides an efficient way to summa-

rize the discriminative information of the audio spectra. Since,

however, it only makes use of the information in two neigh-

boring frames, it may be vulnerable to the possible interfer-

ences. The bits in subfingerprints can be flipped due to the cor-

ruption of local noise and mislead the audio fingerprint search.

The problem can be alleviated by including more frames and

performing low-pass filtering. By applying a set of low-pass fil-

ters, the information that is invariant to noise can be extractedfor a robust discrimination. The set of filters are designed to be

orthogonal so that the output features obtained can be more dis-

tinguished from each other.

In this section, we propose a new audio fingerprinting tech-

nique called the multiple hashing (MLH) method. In the pro-

posed algorithm, DCT is applied to the temporal sequence of

energies in each subband, and a subfingerprint is constructed for

each DCT coefficient stream. The reasons for employing DCT

in the MLH method are twofold. First, among all the orthogonal

transforms, the decorrelation performance of DCT is closest to

the KarhunenLove transform [9]. Second, DCT has a strong

energy compaction property [8] implying that most of the signal

energy tends to be concentrated in a few low-frequency compo-

nents. The decorrelation property ensures that each subfinger-

print can be treated separately and performance improvement

is possible via generating more subfingerprints. If the subband

energies are evolving slowly, only a few DCT coefficients are

sufficient to describe the subfingerprints.

The framework of fingerprint extraction in the MLH system

is depicted in Fig. 3. The first three parts, i.e., framing, FFT and

band energy calculation are the same as those in the PRH algo-

rithm. However, in contrast to the PRH algorithm, before com-

puting the hash strings, -point DCT is performed on the con-

secutive subband energies

. Among the DCT coefficients, only the lower-or-dered values are retained for the computation of subfinger-

prints. As a result, we obtain coefficients for each frame

and subband denoted by . Then,

each DCT coefficient is processed in a way similar to the PRH

algorithm as follows:

(4)

where represents the th output difference of sub-

band in frame . Let denote the th subfingerprint in

frame . Then,

(5)


3/4

LIU et al.: AUDIO FINGERPRINTING BASED ON MULTIPLE HASHING IN DCT DOMAIN 527

Fig. 4. Illustration of generating candidates from the hash tables in the MLH

method.

in which

(6)

Note that in (4) we use and

instead of and to ensure that

they are obtained based on band energies which do not overlap

with those used to compute and .As in the PRH system, the subfingerprints extracted from

the database are registered in hash tables to enable an efficient

searching algorithm. Since subfingerprints are computed for

each frame, hash tables are constructed, for example, the

subfingerprints obtained from the second DCT coefficients in

each frame are registered in the second hash table. The data-

base searching scheme consists of three steps. First, the query

audio is divided into 256 frames, and subfingerprints are ob-

tained in each frame as in the fingerprint extraction phase. Con-

sequently, the query fingerprint block consists of

bits, and forms the finger-

print block . Second, the candidate positions are generated in

each hash table separately, i.e., the subfingerprints in fingerprintblock are matched with the contents in the th hash table as in

the PRH algorithm, and a candidate list is created by accumu-

lating all the search results in all included hash tables. Finally,

BERs are computed by comparing the query fingerprint block

with those stored at the candidate positions in the database, and

the most hit candidatewith BERless than the specified threshold

is returned as the result. The process to generate the candidates

is illustrated in Fig. 4.

IV. EXPERIMENTAL RESULTS

To evaluate the performance of the proposed MLH approach,

we conducted several experiments under various environments.

The database used in the experiments included 1500 music files

collected from commercial compact discs. The database con-

sisted of three groups of 500 files from classical, pop and rock

genres. As for the positive queries, 1200 music clips with the

length of about 3 s were randomly chosen from the database

among which there were 393 rock, 426 classical, and 381 pop

pieces. On the other hand, in order to compute the false positive

rates, 200 music files that were not included in the database were

collected from which another 1200 music clips were randomly

extracted to form the negative queries. To assess the robustness

of the algorithm, the following distortions were applied to both

kinds of queries.

Set 1: Additive white noise with the SNR at 5 db.

Set 2: Additive white noise with the SNR at 0 db.

Set 3: Playing and recording in a very quiet environment.

Set 4: Playing and recording in office noise environment.

There are several parameters which should be determined for

the implementation of the MLH method. In our experiments, the

DCT length was set to be 16, which was considered to provide

a good compromise between the frequency and time resolutions.

As for ,we s et since more than 90% of the total energywas found to concentrate on the first four coefficients in the

tested materials. Finally, to speed up computation, we applied

the running DCT algorithm [10] since the computation of DCT

shifts one sample at each time. To compare the performances,

we also implemented the PRH algorithm [1].

As can be seen from the description of the algorithms, there

are two types of false negatives: First, a distorted positive query

clip gets rejected when the BER is higher than the threshold.

Second, dismissal of a distorted positive query clip happens

when there exist no error-free subfingerprints in the query block,

even when the BER between the query fingerprint block and

the candidate it comes from in the database is less than thethreshold. The first type of false negative is due to the tradeoff

when trying to set a proper threshold, while the second type of

false negative comes with the hashing structure in the searching

algorithm. On the other hand, there is only one category of false

positives, in which a negative query clip is accepted as in the

database since the BER between the obtained fingerprint block

and the candidate pointed by the hash table is less than the

threshold. In order to evaluate the robustness of the system, we

plotted 1- false positive rate (FPR) versus 1-false negative rate

(FNR) by varying the threshold, and the obtained receiver op-

erating characteristic (ROC) curves are shown in Fig. 5 from

which we can discover the superiority of the proposed method.

Also note that we use the term recognition rate in the followingexperiments to represent 1-FNR, therefore it differs from the


4/4

528 IEEE SIGNAL PROCESSING LETTERS, VOL. 16, NO. 6, JUNE 2009

Fig. 5. ROC curves for four query sets; (a) Set 1. (b) Set 2. (c) Set 3. (d) Set 4.

TABLE IRECOGNITION RATES (%) OF THE PRH AND MLH ALGORITHMS

WITH DIFFERENT COMBINATIONS OF HASH TABLES

conventional recall rate in that the latter excludes only the first

type of false negatives.

In addition, to show the performance variation with different

combinations of hash tables, we set the threshold to be 0.35 as in

[1] and measured the results. The four hash tables employed are

denoted as HT1, HT2, HT3 and HT4. Here HT was built from

the th DCT coefficients, for example, HT1 was constructed

from the DC components. For comparison, the hash table con-structed in the PRH algorithm is represented as HT0. For each

query set, the MLH algorithms were applied using different

combinations of hash tables. The recognition rate achieved by

each algorithm is given in Table I. Note that HT in HT is

omitted in the table for simplicity, for example, 1, 2 means

the MLH method using hash tables HT1 and HT2.

It can be seen that the performance using HT1 was better than

that of the PRH algorithm. Moreover, as more hash tables wereadded, the recognition rate improved dramatically. However, it

is worth noting that if more hash tables are used, the memory

usage and the computational burden also increase. Specifically,

when hash tables are used, the memory usage is times

that of the PRH algorithm, and the computation complexity, in

the worst case, also becomes times. Thus it requires a careful

consideration on how many hash tables are needed depending on

the environment in which the audio fingerprinting system would

be deployed. As canbe seen from the experimental results, when

the clips are corrupted by mild noises, the MLH method using

only one hash table already achieves satisfactory results, how-

ever, for the seriously distorted clips, more hash tables should be

added to achieve a better performance at the expense of highermemory usage and computational burden.

V. CONCLUSION

In this letter, we have presented a novel audio fingerprinting

technique based on multiple hashing in DCT domain. Specif-

ically, DCT is applied to the temporal sequence of energies

in each subband and only the lower-ordered coefficients are

retained to compute the subfingerprints. Multiple hash tables

are built corresponding to the multiple subfingerprints in each

frame, and different combinations of hash tables are evaluated.

Experimental results have shown that the proposed MLH

scheme outperformed the conventional PRH algorithm undervarious conditions. Future works may include the investigation

of an efficient method for which reduced number of hash tables

are used while maintaining the high accuracy of MLH.

REFERENCES

[1] J. Haitsma and T. Kalker, A highly robust audio fingerprintingsystem, in Proc. 3rd Int. Conf. Music Information Retrieval, Oct.2002, pp. 107115.

[2] P. Cano, E. Batlle, T. Kalker, and J. Haitsma, A review of audio fin-gerprinting, J. VLSI Signal Process., vol. 41, no. 3, pp. 271284, Nov.2005.

[3] A. Wang, An industrial strength audio search algorithm, in Proc. 4thInt. Conf. Music Information Retrieval, Oct. 2003, pp. 713.

[4] F. Balado, N. Hurley, E. McCarthy, and G. Silvestre, Performanceanalysis of robust audio hashing, IEEE Trans. Inform. Forensics Se-curity, vol. 2, no. 2, pp. 254266, June 2007.

[5] P. Doets and R. Lagendijk, Distortion estimation in compressedmusic using only audio fingerprints, IEEE Trans. Audio, Speech,

Lang. Process., vol. 16, no. 2, pp. 302317, Feb. 2008.[6] J. Haitsma and T. Kalker, Speed-change resistant audio fingerprinting

using auto-correlation, in Proc. Int. Conf. Acoustics, Speech, andSignal Processing, Apr. 2003, vol. 4, pp. 728731.

[7] M. Park, H. Kim, Y. Ro, and M. Kim, Frequency filtering for a highlyrobust audio fingerprintingscheme in a real-noise environment,IEICETrans. Inform. Syst., vol. E89-D, no. 7, pp. 23242327, July 2006.

[8] K. Rao and P. Yip , Discrete Cosine Transform: Algorithms, Advan-tages, Applications. New York: Academic, 1990.

[9] N. Ahmed, T. Natarajan, and K. Rao, Discrete cosine transform,IEEE Trans. Comput., pp. 9093, Jan. 1974.

[10] J. Xi and J. Chicharo, Computing running DCTs and DSTs based ontheir second-order shift properties, IEEE Trans. Circuits Syst. I: Fund.Theory Applicat., vol. 47, no. 5, pp. 779783, May 2000.

Audio Fingerprinting Based on Multiple

Documents

Transcript of Audio Fingerprinting Based on Multiple