Engineering Convention Paper 9905 - AES

Audio Engineering Society

Convention Paper 9905Presented at the 143rd Convention

2017 October 18–21, New York, NY, USAThis convention paper was selected based on a submitted abstract and 750-word precis that have been peer reviewed by at least two qualified anonymous reviewers. The complete manuscript was not peer reviewed. This convention paper has been reproduced from the author’s advance manuscript without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. This paper is available in the AES E-Library (http://www.aes.org/e-lib), all rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society.

Blind Estimation of the Reverberation Fingerprint ofUnknown Acoustic EnvironmentsPrateek Murgai1, Mark Rau1, and Jean-Marc Jot2

1Center for Computer Research in Music and Acoustics (CCRMA), Stanford University2Magic Leap

Correspondence should be addressed to Prateek Murgai ([email protected])

ABSTRACT

Methods for blind estimation of a room’s reverberation properties have been proposed for applications includingspeech dereverberation and audio forensics. In this paper, we evaluate algorithms for online estimation of a room’s“reverberation fingerprint”, defined by its volume and its frequency-dependent diffuse reverberation decay time.Both quantities are derived adaptively by analyzing a single-microphone reverberant signal recording, withoutaccess to acoustic source reference signals. The accuracy and convergence of the proposed techniques is evaluatedexperimentally against the ground truth obtained from geometric and impulse response data. The motivationsof the present study include the development of improved headphone 3D audio rendering techniques for mobilecomputing devices.

1 Introduction

Blind reverberation identification has been the objectof extensive research targeting applications to speechcommunication and speech recognition [1]. More re-cently, it was studied in the context of audio forensics[2] and suggested for controlling headphone 3D audioartificial reverberation processing for augmented realityapplications [3].

In these applications, it is desirable to estimate cer-tain reverberation properties of a local environment byanalyzing a single-microphone recording of the rever-berant signal, although no reference/source signal isavailable to the system. The definition of the “room-print” [2] or “reverberation fingerprint” [3] includes the

room reverberation time vs. frequency (i.e. measuredin sub-bands) and the reverberation power level (or, al-ternatively, the room’s cubic volume). An advantage ofsuch a characterization is that, in typical enclosures, thereverberation time is independent of source or receiverposition, directivity or orientation, and thus may beconsidered a property of the room itself. On the otherhand, these source and receiver parameters may affectreverberation power and spectrum [3] [4], and mustbe eliminated or compensated for when it is desired tocharacterize the room itself. In [3], it is proposed toisolate the effect of the room by restricting the “rever-beration fingerprint” characterization to the diffuse partof the reverberation decay and exploiting a stochasticreverberation model previously proposed in [4].

Murgai, Rau, and Jot Reverberation Fingerprint Estimation

Augmented (or Mixed) reality (AR/MR) presents anew challenge where the space in which the user issituated may have audibly different acoustic propertiesthan the space inhabited by the virtual audio sources.Jot and Lee proposed a method to alter a known roomresponse so as to simulate a room having a differentreverberation fingerprint, in order to achieve a morenatural virtual 3D audio reproduction [3]. This methodassumes that the local acoustic environment’s rever-beration fingerprint is known but does not propose anonline technique for measuring it. The present workfocuses on the blind online estimation of reverberationdecay time and room volume, applicable to augmentedreality audio applications.

The sub-band reverberation time is an essential measurefor characterizing the acoustics of a space and quantify-ing the subjective persistence of a sound source withinthat space, thus making its estimation of prime impor-tance for AR applications. As explicit measurement ofreverberation decay time using test signals is not practi-cal while using AR applications in unknown spaces, wepropose a method which blindly estimates the reverber-ation time by passively listening to the sound sourcesthat are already present in the space. Several previousstudies describe offline methods for the blind estima-tion of the sub-band reverberation time from recordedspeech signals [5] [6] [7]. We build on this prior workto propose an algorithm that employs a running energyenvelope estimator coupled with a peak detection algo-rithm in order to segment intermittent decays that occurin a continuous speech signal recording. The methodassumes that the reverberation time will be revealed bythe fastest decay observed among the detected decaysegments, and converges to a reverberation time esti-mate based on that assumption. In the following, anonline implementation of this decay time estimationtechnique is described.

The diffuse-field reverberation power can be assumedto be independent of source or listener positions ina “well-behaved” (or “mixing”) room. For a givendecay time, source and listener, the power of the diffusefield reverberation is inversely proportional to the cubicvolume of the room [4]. If the volume and reverberationtime of an unknown room are detected, a reverberationimpulse response measured in a known room can bemodified and scaled accordingly, in order to match theunknown room’s diffuse field reverberation power, asproposed in [3]. Statistical methods similar to that ofShabtai [8] may be applicable to the blind prediction

of the cubic volume of an unknown room from locallyrecorded speech signals. In the following, we explorean adaptation of this approach to online blind predictionof diffuse-field reverberation power.

2 Methods

2.1 Reverberation Decay Rate Prediction

A blind reverberation time estimation technique thatmonitors intermittent decays in running speech is pro-posed. The method assumes that, in a given frequencyband, the reverberation time is derived from the fastestdecay rate observed among all the energy envelopedecays that occur in the recorded speech signal. Thetechnique is broken down into the following steps.

2.1.1 Energy Envelope Estimation

In a speech signal recorded in a room, the rate of en-velope decays is limited (slowed down) by the room’sreverberation decay rate. The reverberation time canbe measured during speech pauses and determines thefastest decay rate observed during signal energy enve-lope decay segments. In order to detect each of thesedecay segments, we first employ a running energy esti-mator based on Equation 1.

e[k] = α · x[k−1]2 +(1−α) · x[k]2 (1)

Where x is the input speech signal, e[k] is the estimatedrunning energy and α is the forgetting factor whichestimates the weighting between the current and pastsample in power calculation. In our experiment, α isset to 0.2.

Once the running energy has been estimated, a leakyintegrator based root mean square detector is used toapproximate its envelope. The envelope detector isgiven by Equation 2.

y[k] =12

√β · y[k−1]2 +(1−β ) · e[k]2 (2)

Where y[k] is the current sample of the detector ande[k] is the current sample of the running energy derivedfrom Equation 1. β = e

−1τ·Fs , Fs is the sampling rate for

the input speech signal and τ is the time constant for thedetector. To ensure that the detector follows the inputclosely, τ is assigned a small value of 5 milliseconds.

AES 143rd Convention, New York, NY, USA, 2017 October 18–21 of 6


Fig. 1: Blind reverberation time estimation process

2.1.2 Decay Start and Stop Estimation

Our next task is to locate the points in the energy en-velope where each decay begin occurring. For thispurpose, we design a simple local peak estimation tech-nique which looks at the immediate past and futureneighboring e[k] samples to designate the current sam-ple in the energy envelope as a local peak.

We further select the peaks based on a threshold ampli-tude value that depends on the largest peak previouslydetected in the energy envelope. Once the decay startpoints have been estimated, we designate each peak asa valid start point only if the signal beyond this point de-creases monotonically after a duration at least equal toa threshold amount of time given by τthreshold . For ourapplication, we set τthreshold to 50 milliseconds. Thetimes where the signal stops to monotonically decreaseare tagged as decay stop points.

2.1.3 Intermittent Decay Time Estimation

We fit a straight line characterized by slope γ to each ofthe legitimate decays in the running energy envelope(on a dB scale). Then the decay time td for a givendecaying region is approximately given by:

td ≈−60

γ(3)

Equation 3 estimates the time it takes for the detectedsignal to drop by 60 dB. As a 60-dB drop is usually notavailable, we use Equation 3 to find the approximatedecay time.

Depending on the number of legitimate decays detected,we build a set which stores all the decay times that havebeen observed in the running signal:

T = {td1, td2, td3, td4 . . .} (4)

Based on our assumption that the reverberation time(tRT ) is the fastest decay among all the decays, tRT willbe given by:

tRT = min(T ) (5)

Figure 1 depicts the complete blind reverberation timeestimation process.

2.1.4 Reverberation Time Selection FromMinimum Decay Times

While we are deriving minimum decay times from arunning speech signal, it is important that, before des-ignating a decay time as the reverberation time of theacoustic space, we obtain a sufficient recurrence ofconsistent decay rate estimates. For this purpose, wesave all the minimum decay times and if the numberof minimum decay times which fall within a tolerancelimit ε is greater than η then the final reverberationtime is given by the average of those decay times. Inour experiment, ε is set to 0.1 and η is set to 5. Thesevalues are tuned based on multiple experimental runs.The results below illustrate the speed of convergenceobserved with this technique.

2.2 Room Volume Prediction

A statistical approach similar to that of Shabtai et. al.[8] was taken to predict the cubic volume of an un-known room from speech recordings. Acoustic featuresrelated to the volume of a room were calculated basedon impulse response approximations from reverberantspeech in known rooms. These features were used totrain a predictive model.

2.2.1 Feature Extraction

Room impulse response measurements captured in aquiet room using reliable equipment would be ideal, butnot practical in a real world setting. Here, a substitutefor the room impulse response is taken as the signalfollowing a speech utterance. While the speech signaldoes not resemble an impulse, this subtitution will stillprovide some insight into the room’s impulse responsecharacteristics. To find the temporal locations wherespeech fragments are stopped, a root mean square leveldetector was fit to the signal as described in Section2.1.1. When no increases in level are detected overa certain time threshold, a “speech stop" was deemedfound. An example of this process on sample speechaudio is shown in Figure 2.



Fig. 2: Example of energy envelope based speech stopdetection.

The acoustic features used were T60, C80, C50, den-sity of early reflections, kurtosis in the time domain,density of low frequency room modes, and kurtosisin the frequency domain. These features were chosenas their definitions are related to the room volume [9].Frequency-dependent T60, C80, and C50 measures werecalculated in octave bands from 20 Hz to 20 kHz.

The density of early reflections is related to the roomvolume, as there will be a higher number of early re-flections in a small room compared to large room. Anapproximation to the density of early reflections wasmade by calculating the number of local peaks in thefirst 100 ms of the impulse response. A high kurtosisvalue means that more of the signal variance is due toinfrequent, large signal variations, while a low valuemeans that there are frequent small signal fluctuations,hence the kurtosis is related to the room volume. Thetime domain kurtosis was calculated over the first 100ms of the estimated impulse response.

The density of low frequency modes is related to theroom volume, as a small room will have room modesstarting at higher frequencies. The density of low fre-quency room modes was estimated by calculating thenumber of local peaks in the lower 20% of the fre-quency spectrum. The frequency-domain kurtosis wasalso measured in the lower 20% of the frequency spec-trum.

2.2.2 Prediction

A Gaussian process regression model was used to pre-dict the volume of an unknown room from speech

data. Gaussian process is a supervised learning methodwhich uses a measure of the similarity between pointsto predict the value of an unknown point from train-ing data [10]. The extracted features and volumes ofknown rooms were used to train the model. The samefeatures are extracted from a speech recording made inan unknown room, and the classifier is used to predictthe volume of this room. Since real speech is continu-ous, multiple volume predictions can be made as timeprogresses. To achieve a more reliable volume predic-tion, a running average of volume predictions is usedas a more reliable volume prediction metric.

3 Experimental Setup and Results

3.1 Reverberation Decay Rate Prediction

To perform our experiments, we employ impulse re-sponses from the Open AIR Library1 [11], convolvethem with anechoic speech recordings and use the con-volved outputs to blindly estimate the reverberationtimes. As reverberation times are frequency dependent,we filter the convolved outputs with an octave filterand run our blind estimation technique on each of thefiltered output signals. Figure 3 shows the comparisonbetween the actual reverberation times and the blindestimates for four different acoustic spaces.

Next, we will look at how fast the blind estimateconverges towards stable reverberation values, as thiswould be an important factor while designing real timeapplications.

Fig. 3: Comparison between blind estimate and actualreverberation time

1openairlib.net



Fig. 4: Convergence of reverberation time estimate(500Hz frequency band)

Figure 4 illustrates the speed of convergence of thereverberation time estimate, in the four acoustic spacesconsidered in Figure 3.

3.2 Room Volume Prediction

The volume prediction method was tested with synthe-sized room impulse responses generated with the imagesource method using a MATLAB toolbox [12]. A to-tal of 154 shoe box shaped rooms ranging in volumefrom 8 m3 to 932 m3 were synthesized. The impulseresponses were convolved with 12 recordings of ane-choic speech, each 15 seconds in length. This resultedin 1848 total speech samples. The speech samples werethen analyzed to find locations where the speech wasstopped, resulting in an impulse response approxima-tion. A total of 4380 room impulse response approx-imations were found, and the features mentioned inSection 2.2 were extracted. These features and the as-sociated room volumes were used to train a Gaussianprocess regression model which was implemented inPython using the scikit-learn toolbox [13].

To test the prediction model, a set of 51 room impulseresponses over the same range, but having differentvolumes was generated. As with the training, theseroom impulse responses were convolved with anechoicspeech data, speech stops were found, and the featureswere extracted. Volume predictions were made andthen averages were taken for each speech signal used.Figure 5 shows volume predictions and the true labelsover a range of volumes for speech signals of length 15seconds.

Fig. 5: Room volume estimates compared to actualvolumes, for various room sizes.

Figure 6 illustrates how the volume prediction evolvesover time as more predictions are made. The top andbottom graphs show the average estimated volumesfor 200 m3 and 400 m3 rooms, respectively. Therewere 56 volume predictions obtained over a 3.5 minutereverberant speech recording.

4 Summary and Discussion

We present a preliminary study exploring the feasi-bility of continuous blind estimation of the reverber-ation fingerprint of an unknown room by monitoringrecorded speech signals. To this aim, we developedonline variants of previously published methods for theblind estimation of a room’s cubic volume and of itssub-band reverberation time. We tested the proposedmethods with recorded speech samples convolved withmeasured or calculated room impulse responses.

For reverberation time estimation, our initial results areencouraging, both in terms of convergence time and interms of accuracy, especially in rooms having relativelylong reverberation times. Our results suggest that animprovement of the proposed online blind estimationmethod would be beneficial in order to avoid under-estimation bias for rooms having shorter reverberationtimes, and improve accuracy at low frequencies.

Generalizing tentatively from the presented results, itseems realizable to achieve convergence to a close es-timate of the actual reverberation time after just a fewseconds of speech activity in the room.



Fig. 6: Room volume estimate with increasing numberof predictions, in two rooms having volumes200 m3 (top) and 400 m3 (bottom).

Room volume estimation based solely on the analysisof speech recordings is a challenging task. The methodexplored here, based on a machine learning approach,nevertheless tentatively appears capable of converging,after 10 seconds of speech activity in the room, towithin 25 percent of the actual volume (which translatesto approximately a 1 dB error in reverberation power).Further experiments will be necessary in order to verifywhether this objective is attainable in typical indoorenvironments.

References

[1] Naylor, P. and Gaubitch, N., Speech Dereverbera-tion, Berlin, Germany: Springer-Verlag, 2010.

[2] Moore, A. H., Brookes, M., and Naylor, P. A.,“Room Identification Using Roomprints,” in Proc.Audio Engineering Society 54th InternationalConference: Audio Forensics, 2014.

[3] Jot, J.-M. and Lee, K. S., “Augmented RealityHeadphone Environment Rendering,” in Proc. Au-dio Engineering Society Conference on Audio forVirtual and Augmented Reality, 2016.

[4] Jot, J.-M., Cerveau, L., and Warusfel, O., “Analy-sis and synthesis of room reverberation based on astatistical time-frequency model,” in Proc. 103rdAudio Engineering Society Convention, 1997.

[5] Prego, T. d. M., de Lima, A. A., Netto, S. L.,Lee, B., Said, A., Schafer, R. W., and Kalker, T.,“A blind algorithm for reverberation-time estima-tion using subband decomposition of speech sig-nals,” Journal of the Acoustical Society of Amer-ica, 131(4), pp. 2811–2816, 2012.

[6] Eaton, J., Gaubitch, N. D., and Naylor, P. A.,“Noise-robust reverberation time estimation usingspectral decay distributions with reduced compu-tational cost,” in Proc. IEEE International Confer-ence on Acoustics, Speech and Signal Processing(ICASSP), pp. 161–165, 2013.

[7] Diether, S., Bruderer, L., Streich, A., and Loeliger,H.-A., “Efficient blind estimation of subband re-verberation time from speech in non-diffuse en-vironments,” in Proc. IEEE International Confer-ence on Acoustics, Speech and Signal Processing(ICASSP), pp. 743–747, 2015.

[8] Shabtai, N., Rafaely, B., and Zigel, Y., “Roomvolume classification from reverberant speech,” inProc. International Workshop on Acoustic SignalEnhancement, 2010.

[9] Shabtai, N. R., Zigel, Y., and Rafaely, B., “Roomvolume classification from room impulse re-sponse using statistical pattern recognition andfeature selection,” Journal of the Acoustical Soci-ety of America, 128(3), pp. 1155–1162, 2010.

[10] Rasmussen, C. E. and Williams, C. K., Gaussianprocesses for machine learning, volume 1, MITPress, 2006.

[11] Murphy, D. T. and Shelley, S., “Openair: An in-teractive auralization web resource and database,”in Proc. 129th Audio Engineering Society Con-vention, 2010.

[12] Wabnitz, A., Epain, N., Jin, C., and Van Schaik,A., “Room acoustics simulation for multichan-nel microphone arrays,” in Proc. InternationalSymposium on Room Acoustics, pp. 1–6, 2010.

[13] Pedregosa, F., Varoquaux, G., Gramfort, A.,Michel, V., Thirion, B., Grisel, O., Blondel, M.,Prettenhofer, P., Weiss, R., Dubourg, V., et al.,“Scikit-learn: Machine learning in Python,” Jour-nal of Machine Learning Research, 12(Oct), pp.2825–2830, 2011.


Engineering Convention Paper 9905 - AES

Documents

Transcript of Engineering Convention Paper 9905 - AES