Audio Fingerprinting Based on Multiple

download Audio Fingerprinting Based on Multiple

of 4

Transcript of Audio Fingerprinting Based on Multiple

  • 8/8/2019 Audio Fingerprinting Based on Multiple

    1/4

    IEEE SIGNAL PROCESSING LETTERS, VOL. 16, NO. 6, JUNE 2009 525

    Audio Fingerprinting Based on MultipleHashing in DCT Domain

    Yu Liu, Hwan Sik Yun, and Nam Soo Kim, Member, IEEE

    AbstractAudio fingerprinting techniques aim at successfullyperforming content-based audio identification even when the audiosignals are slightly or seriouslydistorted. In this letter, we propose anovel audio fingerprinting technique based on multiple hashing. Inorder to improve the robustness of hashing, multiple hash stringsare generated through the discrete cosine transform (DCT) whichis applied to the temporal energy sequence in each subband. Ex-perimental results show that the proposed algorithm outperformsthe Philips Robust Hash (PRH) algorithm [1] under various distor-tions.

    Index TermsAudio fingerprinting, content-based audio identi-fication, discrete cosine transform (DCT), robust hashing.

    I. INTRODUCTION

    AN AUDIO fingerprint is a compact low-level con-

    tent-based digest of an audio signal. It provides the

    ability to identify short, unlabeled audio clips in a fast and

    reliable way. The applications of audio fingerprinting include

    broadcast monitoring, audio connecting, file sharing and auto-

    matic music library organization [1]. There are several practical

    requirements which a successful audio fingerprinting system

    should satisfy [2]. First, it should be able to identify corrupted

    audio clips in spite of degradations. Second, it should be ableto identify the clips of only a few seconds long. Finally, it

    should be computationally efficient, both in calculating the

    fingerprints and in searching for the best match in the database.

    The application demands and the difficulties of designing such

    systems boost the interest in audio fingerprinting techniques,

    and many practical issues have been studied.

    Among various algorithms, the system developed by A. Wang

    from Shazam [3] has been considered as a successful and wide-

    spread work. Besides, the Philips Robust Hash (PRH) algorithm

    [1] is also a well studied content-based audio identification tech-

    nique. The robustness of the PRH algorithm has been verified

    mathematically via analyzing the overall bit error probability [4]

    or bit error rate (BER) [5]. It is important for any fingerprinting

    algorithm that it not only results in few bit errors, but also al-

    lows for efficient searching [6]. Specifically, the PRH algorithm

    Manuscript received December 01, 2008; revised February 05, 2009. Currentversion published April 24, 2009. This work was supported in part by the KoreaResearch Foundation Grant funded by the Korean Government (MOEHRD)(KRF-2008-313-D00783) and in part by the Korea Science and EngineeringFoundation (KOSEF) Grant funded by the Korean Government (MOST) (R0A-2007-000-10022-0). The associate editor coordinating the review of this manu-script and approving it for publication was Prof. Vesa Valimaki.

    The authors are with the School of Electrical Engineering and the Institute ofNew Media and Communications, Seoul National University, Seoul 151-742,Korea (e-mail: [email protected]).

    Digital Object Identifier 10.1109/LSP.2009.2016837

    relies on the assumption that at least one of the so called subfin-

    gerprints is invariant to noise. Although this assumption holds

    in mild conditions (e.g., MP3 compression, downsampling, and

    equalization), it fails to be valid for some seriously corrupted

    audio clips such as those recorded in a noisy environment, hence

    results in a serious performance degradation.

    For such reasons, some efforts have been delivered to im-

    prove the robustness of subfingerprints. One possible method

    mentioned in [1] suggests to generate a list of most probable

    candidates for each subfingerprint based on the reliability infor-

    mation obtained from soft coding, albeit that the reliability in-

    formation is in fact not reliable and less spectacular in practical

    implementations. Another method is to view extracting subfin-

    gerprints from the spectrogram as 2-D filtering in spectro-tem-

    poral domain and try to substitute the filters [7], which is empir-

    ical and not founded on a theoretical basis [4]. These approaches

    attempt to enhance the robustness of individual subfingerprints

    separately while joint improvement is considered more desir-

    able.

    In this letter, we propose a novel audio fingerprinting tech-

    nique that generatesmultiple subfingerprints for each frame, and

    the corresponding database searching scheme is also extended.

    Specifically, discrete cosine transform (DCT) is applied to the

    temporal sequence of energies in each subband, and one sub-fingerprint is generated for each DCT coefficient. Experimental

    results show that the proposed approach outperforms the PRH

    algorithm under various environments.

    II. PRH ALGORITHM

    There are two steps in the PRH algorithm: fingerprint extrac-

    tion and database searching phases. The overall block diagram

    for the fingerprint extraction stage is illustrated in Fig. 1. First,

    the audio signal is divided into overlapping frames with the

    length of about 370 ms, and the frame shift is 1/32 of the framelength. Second, power spectrum is obtained by performing FFT,

    and then the energies for 33 non-overlapping logarithmically

    spaced subbands (e.g., Bark Scale) covering the frequency range

    from 300 Hz to 2000 Hz are calculated. Finally, hash strings

    (referred to as subfingerprints) are computed from the subband

    energies in each frame as follows:

    (1)

    and

    (2)

    1070-9908/$25.00 2009 IEEE

  • 8/8/2019 Audio Fingerprinting Based on Multiple

    2/4

    526 IEEE SIGNAL PROCESSING LETTERS, VOL. 16, NO. 6, JUNE 2009

    Fig. 1. Overall block diagram of the fingerprint extraction stage in the PRHalgorithm.

    Fig. 2. Illustration of generating candidates from the hash table in the PRHalgorithm.

    in which

    (3)

    In (1), denotes the th subband energy in the th

    frame, and is the output difference. is the

    32-bit subfingerprint of frame and is the th bit of

    it.

    For the audio files stored in the database, all the subfinger-

    prints computed are registered in a hash table with the subfin-

    gerprints being treated as the keys. Each entry of the hash table

    stores a list of pointers to the positions in the audio files where

    the subfingerprint occurs. In the stage of database searching,

    256 subfingerprints which amount to approximately 3 seconds

    are extracted from the query audio, and each subfingerprint ismatched with the hash table contents to find the candidate po-

    sitions where it may come from. The process to generate the

    candidates is illustrated in Fig. 2. A fingerprint block with the

    same size as the query block ( bits) from the

    candidate position is obtained, BER between the two blocks is

    computed and compared with a threshold which is set to 0.35

    in [1]. If the BER is less than the threshold, the two signals are

    considered similar and the candidate audio is declared as the re-

    sult.

    III. MULTIPLE HASHING ALGORITHM

    The subfingerprints generated from the PRH algorithm arethe spectro-temporal differences between adjacent frames and

    Fig. 3. Overall block diagram of the fingerprint extraction stage in the MLHmethod.

    subbands. This method provides an efficient way to summa-

    rize the discriminative information of the audio spectra. Since,

    however, it only makes use of the information in two neigh-

    boring frames, it may be vulnerable to the possible interfer-

    ences. The bits in subfingerprints can be flipped due to the cor-

    ruption of local noise and mislead the audio fingerprint search.

    The problem can be alleviated by including more frames and

    performing low-pass filtering. By applying a set of low-pass fil-

    ters, the information that is invariant to noise can be extractedfor a robust discrimination. The set of filters are designed to be

    orthogonal so that the output features obtained can be more dis-

    tinguished from each other.

    In this section, we propose a new audio fingerprinting tech-

    nique called the multiple hashing (MLH) method. In the pro-

    posed algorithm, DCT is applied to the temporal sequence of

    energies in each subband, and a subfingerprint is constructed for

    each DCT coefficient stream. The reasons for employing DCT

    in the MLH method are twofold. First, among all the orthogonal

    transforms, the decorrelation performance of DCT is closest to

    the KarhunenLove transform [9]. Second, DCT has a strong

    energy compaction property [8] implying that most of the signal

    energy tends to be concentrated in a few low-frequency compo-

    nents. The decorrelation property ensures that each subfinger-

    print can be treated separately and performance improvement

    is possible via generating more subfingerprints. If the subband

    energies are evolving slowly, only a few DCT coefficients are

    sufficient to describe the subfingerprints.

    The framework of fingerprint extraction in the MLH system

    is depicted in Fig. 3. The first three parts, i.e., framing, FFT and

    band energy calculation are the same as those in the PRH algo-

    rithm. However, in contrast to the PRH algorithm, before com-

    puting the hash strings, -point DCT is performed on the con-

    secutive subband energies

    . Among the DCT coefficients, only the lower-or-dered values are retained for the computation of subfinger-

    prints. As a result, we obtain coefficients for each frame

    and subband denoted by . Then,

    each DCT coefficient is processed in a way similar to the PRH

    algorithm as follows:

    (4)

    where represents the th output difference of sub-

    band in frame . Let denote the th subfingerprint in

    frame . Then,

    (5)

  • 8/8/2019 Audio Fingerprinting Based on Multiple

    3/4

    LIU et al.: AUDIO FINGERPRINTING BASED ON MULTIPLE HASHING IN DCT DOMAIN 527

    Fig. 4. Illustration of generating candidates from the hash tables in the MLH

    method.

    in which

    (6)

    Note that in (4) we use and

    instead of and to ensure that

    they are obtained based on band energies which do not overlap

    with those used to compute and .As in the PRH system, the subfingerprints extracted from

    the database are registered in hash tables to enable an efficient

    searching algorithm. Since subfingerprints are computed for

    each frame, hash tables are constructed, for example, the

    subfingerprints obtained from the second DCT coefficients in

    each frame are registered in the second hash table. The data-

    base searching scheme consists of three steps. First, the query

    audio is divided into 256 frames, and subfingerprints are ob-

    tained in each frame as in the fingerprint extraction phase. Con-

    sequently, the query fingerprint block consists of

    bits, and forms the finger-

    print block . Second, the candidate positions are generated in

    each hash table separately, i.e., the subfingerprints in fingerprintblock are matched with the contents in the th hash table as in

    the PRH algorithm, and a candidate list is created by accumu-

    lating all the search results in all included hash tables. Finally,

    BERs are computed by comparing the query fingerprint block

    with those stored at the candidate positions in the database, and

    the most hit candidatewith BERless than the specified threshold

    is returned as the result. The process to generate the candidates

    is illustrated in Fig. 4.

    IV. EXPERIMENTAL RESULTS

    To evaluate the performance of the proposed MLH approach,

    we conducted several experiments under various environments.

    The database used in the experiments included 1500 music files

    collected from commercial compact discs. The database con-

    sisted of three groups of 500 files from classical, pop and rock

    genres. As for the positive queries, 1200 music clips with the

    length of about 3 s were randomly chosen from the database

    among which there were 393 rock, 426 classical, and 381 pop

    pieces. On the other hand, in order to compute the false positive

    rates, 200 music files that were not included in the database were

    collected from which another 1200 music clips were randomly

    extracted to form the negative queries. To assess the robustness

    of the algorithm, the following distortions were applied to both

    kinds of queries.

    Set 1: Additive white noise with the SNR at 5 db.

    Set 2: Additive white noise with the SNR at 0 db.

    Set 3: Playing and recording in a very quiet environment.

    Set 4: Playing and recording in office noise environment.

    There are several parameters which should be determined for

    the implementation of the MLH method. In our experiments, the

    DCT length was set to be 16, which was considered to provide

    a good compromise between the frequency and time resolutions.

    As for ,we s et since more than 90% of the total energywas found to concentrate on the first four coefficients in the

    tested materials. Finally, to speed up computation, we applied

    the running DCT algorithm [10] since the computation of DCT

    shifts one sample at each time. To compare the performances,

    we also implemented the PRH algorithm [1].

    As can be seen from the description of the algorithms, there

    are two types of false negatives: First, a distorted positive query

    clip gets rejected when the BER is higher than the threshold.

    Second, dismissal of a distorted positive query clip happens

    when there exist no error-free subfingerprints in the query block,

    even when the BER between the query fingerprint block and

    the candidate it comes from in the database is less than thethreshold. The first type of false negative is due to the tradeoff

    when trying to set a proper threshold, while the second type of

    false negative comes with the hashing structure in the searching

    algorithm. On the other hand, there is only one category of false

    positives, in which a negative query clip is accepted as in the

    database since the BER between the obtained fingerprint block

    and the candidate pointed by the hash table is less than the

    threshold. In order to evaluate the robustness of the system, we

    plotted 1- false positive rate (FPR) versus 1-false negative rate

    (FNR) by varying the threshold, and the obtained receiver op-

    erating characteristic (ROC) curves are shown in Fig. 5 from

    which we can discover the superiority of the proposed method.

    Also note that we use the term recognition rate in the followingexperiments to represent 1-FNR, therefore it differs from the

  • 8/8/2019 Audio Fingerprinting Based on Multiple

    4/4

    528 IEEE SIGNAL PROCESSING LETTERS, VOL. 16, NO. 6, JUNE 2009

    Fig. 5. ROC curves for four query sets; (a) Set 1. (b) Set 2. (c) Set 3. (d) Set 4.

    TABLE IRECOGNITION RATES (%) OF THE PRH AND MLH ALGORITHMS

    WITH DIFFERENT COMBINATIONS OF HASH TABLES

    conventional recall rate in that the latter excludes only the first

    type of false negatives.

    In addition, to show the performance variation with different

    combinations of hash tables, we set the threshold to be 0.35 as in

    [1] and measured the results. The four hash tables employed are

    denoted as HT1, HT2, HT3 and HT4. Here HT was built from

    the th DCT coefficients, for example, HT1 was constructed

    from the DC components. For comparison, the hash table con-structed in the PRH algorithm is represented as HT0. For each

    query set, the MLH algorithms were applied using different

    combinations of hash tables. The recognition rate achieved by

    each algorithm is given in Table I. Note that HT in HT is

    omitted in the table for simplicity, for example, 1, 2 means

    the MLH method using hash tables HT1 and HT2.

    It can be seen that the performance using HT1 was better than

    that of the PRH algorithm. Moreover, as more hash tables wereadded, the recognition rate improved dramatically. However, it

    is worth noting that if more hash tables are used, the memory

    usage and the computational burden also increase. Specifically,

    when hash tables are used, the memory usage is times

    that of the PRH algorithm, and the computation complexity, in

    the worst case, also becomes times. Thus it requires a careful

    consideration on how many hash tables are needed depending on

    the environment in which the audio fingerprinting system would

    be deployed. As canbe seen from the experimental results, when

    the clips are corrupted by mild noises, the MLH method using

    only one hash table already achieves satisfactory results, how-

    ever, for the seriously distorted clips, more hash tables should be

    added to achieve a better performance at the expense of highermemory usage and computational burden.

    V. CONCLUSION

    In this letter, we have presented a novel audio fingerprinting

    technique based on multiple hashing in DCT domain. Specif-

    ically, DCT is applied to the temporal sequence of energies

    in each subband and only the lower-ordered coefficients are

    retained to compute the subfingerprints. Multiple hash tables

    are built corresponding to the multiple subfingerprints in each

    frame, and different combinations of hash tables are evaluated.

    Experimental results have shown that the proposed MLH

    scheme outperformed the conventional PRH algorithm undervarious conditions. Future works may include the investigation

    of an efficient method for which reduced number of hash tables

    are used while maintaining the high accuracy of MLH.

    REFERENCES

    [1] J. Haitsma and T. Kalker, A highly robust audio fingerprintingsystem, in Proc. 3rd Int. Conf. Music Information Retrieval, Oct.2002, pp. 107115.

    [2] P. Cano, E. Batlle, T. Kalker, and J. Haitsma, A review of audio fin-gerprinting, J. VLSI Signal Process., vol. 41, no. 3, pp. 271284, Nov.2005.

    [3] A. Wang, An industrial strength audio search algorithm, in Proc. 4thInt. Conf. Music Information Retrieval, Oct. 2003, pp. 713.

    [4] F. Balado, N. Hurley, E. McCarthy, and G. Silvestre, Performanceanalysis of robust audio hashing, IEEE Trans. Inform. Forensics Se-curity, vol. 2, no. 2, pp. 254266, June 2007.

    [5] P. Doets and R. Lagendijk, Distortion estimation in compressedmusic using only audio fingerprints, IEEE Trans. Audio, Speech,

    Lang. Process., vol. 16, no. 2, pp. 302317, Feb. 2008.[6] J. Haitsma and T. Kalker, Speed-change resistant audio fingerprinting

    using auto-correlation, in Proc. Int. Conf. Acoustics, Speech, andSignal Processing, Apr. 2003, vol. 4, pp. 728731.

    [7] M. Park, H. Kim, Y. Ro, and M. Kim, Frequency filtering for a highlyrobust audio fingerprintingscheme in a real-noise environment,IEICETrans. Inform. Syst., vol. E89-D, no. 7, pp. 23242327, July 2006.

    [8] K. Rao and P. Yip , Discrete Cosine Transform: Algorithms, Advan-tages, Applications. New York: Academic, 1990.

    [9] N. Ahmed, T. Natarajan, and K. Rao, Discrete cosine transform,IEEE Trans. Comput., pp. 9093, Jan. 1974.

    [10] J. Xi and J. Chicharo, Computing running DCTs and DSTs based ontheir second-order shift properties, IEEE Trans. Circuits Syst. I: Fund.Theory Applicat., vol. 47, no. 5, pp. 779783, May 2000.