Study on method for protecting speech privacy by actively ... · domain masker signals are used to...

Study on method for protecting speech privacy byactively controlling speech transmission index in

simulated roomMasashi Unoki, Yuta Kashihara, Maori Kobayashi, and Masato Akagi

School of Information Science,Japan Advanced Institute of Science and Technology

1-1 Asahidai, Nomi, Ishikawa 923–1292 JapanE-mail: {unoki, kashihara, maori-k, akagi}@jaist.ac.jp

Abstract—Protecting speech privacy in a specific room isan important challenge in room acoustics. However, protectingpeople’s conversation from being overheard by an unintendedlistener, that is, making them not understandable, is difficult.This paper proposes a method for protecting speech privacyby actively controlling the speech transmission index (STI) in asimulated room containing an unintended listener. In this method,the STI in the simulated room can be controlled by manipulatingthe parameters of the simulated room impulse response (RIR).We can control the STI by convolving speech with the simulatedRIR because the presentation of speech and additive delayed-manipulated speech can be regarded as the convolution of speechwith late reverberation in the simulated room. Three experiments(world intelligibility, listening difficulty, and annoyance tests)were conducted to compare the proposed method with two con-ventional methods (noise masking and reverberation). The resultsshowed that speech privacy can be protected by controlling STIderived by manipulating the simulated RIR. The results alsoshowed that the proposed method can protect the privacy ofconversations as effectively as those other methods can by usinglower noise levels and shorter reverberation.

I. INTRODUCTION

Problems arising from our private conversations in semi-open spaces, e.g., with bank tellers, pharmacists, or health-care professionals must be solved to ensure speech privacy [1],[2]. Essentially, a private conversation should be protected inthese spaces by ensuring unintended listeners cannot overhearit. Therefore, private conversations should not be intelligibleor understandable to nearby unintended listeners. The mostimportant thing is that these listeners are unable to understandthe contents of these conversations.

Several methods have already been used for protectingspeech privacy. The most straightforward method is to con-struct sound-proof spaces by partitioning off semi-open spaces[3]. However, this approach is not reasonable for protect-ing speech privacy because these spaces are physically con-straining. A common approach is sound effects for maskingtarget speech using various kinds of noise such as pink orbabble noise [4], [5]. This approach has been used in manypractical applications, but it causes smearing that disturbsquiet environments due to noise maskers being added. Donleyet al. proposed a speech privacy between bright and quiet

zones in multizone reproduction scenarios [6]. Space-timedomain masker signals are used to protect speech privacy inbright zones so a large number of loudspeakers is required.Another approach is to reverberate the target speech to reducespeech intelligibility [7], [8], [9]. However, extreme over-reverberation causes annoyance because of the difficulty oftaking into account the exact room reverberation in semi-openspaces in advance.

However, the speech transmission index (STI) [10] is anobjective measurement for evaluating speech quality in roomacoustics and is the preferred method for predicting speechintelligibility by listeners in a room. The STI is calculatedfrom the room impulse response (RIR) on the basis of theconcept of the modulation transfer function (MTF) [11]. TheSTI highly correlates with listening difficulty [7], [12], and canbe predicted from the observed signal in the room by usingthe stochastic RIR model [13]. For this reason, the STI canbe used to speculate on the speech privacy protection in thesespaces.

The RIR generally consists of three parts: direct sound,early reverberation, and late reverberation. From our previousstudies, we know that the STI is especially dominated bylate reverberation. Thus, we speculate that the STI can bedirectly controlled by manipulating the parameters of the RIRmodel. Due to the inconvenience of directly controlling theroom acoustics in real communication, physically controllingthe STI in a room is difficult. Fortunately, we speculate that theSTI can be controlled by convolving speech with the simulatedRIR that we can manipulate because the presentation of speechand additive-delayed manipulated speech can be regarded asthe convolution of speech with late reverberation, Hence, wespeculate that our speech privacy in a simulated room can becontrolled by manipulating the parameters of the RIR modelin relation to the simulated room and by then controlling thederived STI.

In this paper, we present our investigation on the possibilityof enabling speech privacy protection by actively controllingthe STI related to the simulated RIR in which the parametersof the RIR model were manipulated. We also propose a methodbased on this approach for protecting speech privacy in a

Proceedings of APSIPA Annual Summit and Conference 2017 12 - 15 December 2017, Malaysia

978-1-5386-1542-3@2017 APSIPA APSIPA ASC 2017

Fig. 1. Block diagram of speech privacy protection by actively controlling speech transmission index.

simulated room. The highlight of the proposed method is toprotect conversations by making them unintelligible and notunderstandable to unintended listeners by actively controllingthe STI in the simulated room at any time.

II. METHOD OF SPEECH PRIVACY PROTECTION

A. Concept statement

Figure 1 shows a block diagram of protecting speech privacyby actively controlling the STI in a simulated room where anunintended listener is near speakers having a private conver-sation. It is assumed that the private conversation by speakersx(t) leaks out and that the unintended listener subsequentlyperceives the observed signal y(t) in the room where the actualRIR is h0(t). In this case, the observed signal y(t) consistsof direct sound x(t) ∗ h0(t) from the speaker and the delayedreverberant sound x(t) ∗ hL(t) as additive noise x̂(t), i.e.,

y(t) = x(t) ∗ h(t) = x(t) ∗ h0(t) + x(t) ∗ hL(t). (1)

The observed signal y(t) can be regarded as a convolution ofthe original sound x(t) with the simulated RIR h(t). Thus, theSTI in the simulated room can be derived from the simulatedRIR, h(t), and the degree to which the protected speechconversation is made unintelligible or not understandable isthen predicted by using the actively controlled STI in thesimulated room with the unintended listener. Because ourmethod of speech privacy protection can blindly estimate thetotal STI in the simulated room from the observed signal y(t)at any time, it can also preserve the actively controlled STI inthe simulated room to make a lower STI, thereby making thespeech unintelligible or not understandable to the unintendedlistener.

B. Manipulating room impulse response

In our approach, the simulated RIR is defined as

h(t) = h0(t) + hL(t− τ), (2)

where h0(t) is the actual RIR in the room, τ is the delay time,and hL(t) is late reverberation. The h0(t) can be representedas the stochastic RIR. The hL(t) can also be modeled by usingthe extended RIR model as follows:

hL(t) = hext(t− t0), t0 ≥ 0 (3)

hext(t) =

{aexp(6.9t/Th)ch(t), t < 0aexp(−6.9t/Tt)ch(t), t ≥ 0

(4)

where a is a gain factor of late reverberation, Th is a parameterof the growth of the power envelope of RIR, Tt is a parameterof the decay of the power envelope of RIR, that is, thereverberation time, t0 is a global peak position of RIR in thetime domain, and ch(t) is a random variable such as a whitenoise carrier.

The MTF in the simulated RIR, m(fm), can be representedas

m(fm) = g0 ·m0(fm) + gL ·mL(fm, Th, Tt), (5)

where g0 and gL are the weights, g0 =∫h20(t)dt/

∫h2(t)dt

and gL =∫h2L(t)dt/

∫h2(t)dt. Here, m0(fm) is the MTF of

h0(t) and mL(fm, Th, Tt) is the MTF of hL(t) that can berepresented as

mL(fm, Th, Tt)

=1√(

1 +(2πfm

Th

13.8

)2)(1 +

(2πfm

Tt

13.8

)2) .(6)

The infinite impulse response (IIR) filter of the MTF, Eh(z),is derived by using the impulse invariant method, as

Eh(z) = Eh0(z) +

gL(α− β)z−(t0+τ)fs

(1− αz−1)(1− βz−1), (7)

where Eh0(z) is the IIR filter of MTF of h0(t) and fs is the

sampling frequency. n0 = t0/fs, α = a2 exp(−13.8/Ttfs),and β = a2 exp(13.8/Thfs).



TABLE IRELATIONSHIP BETWEEN STI AND SPEECH QUALITY.

Quality Bad Poor Fair Good ExcellentSTI 0.00 0.30 0.45 0.60 0.75

∼0.29 ∼0.44 ∼0.59 ∼0.74 ∼1.00

Th! T

t!

τ!

δ!t"# a#

Fig. 2. Example of indirectly manipulated RIR model: (a) power envelopeand (b) corresponding h(t).

C. Calculation of Speech Transmission Index

The correspondence between the STI and its effectivenessin assessing the quality of speech transmission in the roomacoustics are summarized in Table I. For example, the qualityof speech transmission is “Bad” when STI is 0.00 and “Ex-cellent” when STI is 1.00.

The method of calculating the STI has been standardized bythe IEC 60268-16 [10]. It can be summarized in five steps: (1)calculating MTFs in seven octave-bands, (2) calculating SNRsfrom MTFs, (3) calculating transmission indices (TIs) bynormalizing the SNRs, (4) calculating modulation transmissionindices (MTIs) by averaging MI, and (5) calculating the STIby weighting summation of MTIs.

Figure 2(a) shows, as an example, the power envelope of theRIR composed of direct sound and late reverberation whereh0(t) is the direct path δ(t) (no reverberation). Figure 2(b)shows the RIR composed of direct sound and late reverbera-tion. Here, a = exp(−6.9 ·0.05/Tt), where τ is 50 ms relatedto Deutlichkeit (D50) [14]. In this case, the predicted MTFcan be represented as a function of Th and Tt in Eq. (5).

Figure 3 shows the predicted MTF as a function of Th andTt. In this figure, the dashed curve shows MTF at Th = 0.0001s and the solid curve shows MTF at Th = 0.1 s. Parametera was set a = exp(−6.9τ/Tt) where τ = 50 ms related toDeutlichkeit (D50). Figure 4 also shows the predicted STI as afunction of Th and Tt by using the predicted MTF. The resultsshowed that two parameters of the extended RIR model, Thand Tt, can control the STI in the range from 0.23 to 1.0. Forexample, STI was 0.23 when Th = 0.1 s and Tt = 10.5 s inthis figure.

Modulation Frequency, Fm (Hz)0 5 10 15 20

Mod

ulat

ion

inde

x

0

0.2

0.4

0.6

0.8

1

Tt = 0.09 s

Tt = 0.8 s

Tt = 1.8 s

Tt = 3.9 s

Tt = 10.5 s

0.470

Fig. 3. Predicted MTF as a function of Th and Tt.

Th = 0.1 s, T

t = 10.5 s!

Fig. 4. Predicted STI as a function of Th and Tt.

D. Actively controlling Speech Transmission Index

As shown in Sect. 2.2 and 2.3, our method can control theSTI of the simulated RIR in a space having an unintendedlistener, h(t), in Eq. (2) by manipulating parameters (τ , Th,and Tt) of the extended RIR as late reverberation. However, anSTI estimation [15], [16], [17] in the proposed method shownin Fig. 1 can estimate the STI from the observed signal y(t)at the unintended listener’s position. The current STI beforecontrolling the STI is labeled as STI0. The total STI aftercontrolling the STI is labeled as STIcnt. The target STI forprotecting speech privacy is labeled as STItgt. At this time,our method can actively control the estimated STI for theunintended listener, STIcnt, to make it the target STI, STItgt,under the current STI, STI0, by manipulating the parametersof the extended RIR, as shown in Fig. 1. In these evaluations,STItgt was set to 0.23 as “Bad Quality” for protecting speechprivacy, in which STI0 was 1.0.



STI0 0.23 0.375 0.525 0.675 0.875 1

Wor

d in

telli

gibi

lity

0

0.2

0.4

0.6

0.8

1(a)

WF=1.0~2.5WF=2.5~4.0WF=4.0~5.5WF=5.5~7.0

STI0 0.23 0.375 0.525 0.675 0.875 1

Lis

teni

ng d

iffi

culty

rat

e

0

0.2

0.4

0.6

0.8

1(b)

WF=1.0~2.5WF=2.5~4.0WF=4.0~5.5WF=5.5~7.0

STI0 0.23 0.375 0.525 0.675 0.875 1

Ann

oyan

ce r

ate

0

0.2

0.4

0.6

0.8

1(c)

WF=1.0~2.5WF=2.5~4.0WF=4.0~5.5WF=5.5~7.0

Fig. 5. Results of verification test of proposed method: (a) word intelligibility,(b) listening difficulty, and (c) annoyance.

III. EVALUATIONS

A. Measures

Three measures (word intelligibility, listening difficulty, andannoyance) were used to assess the ability of the system toensure speech privacy. Three experiments (word intelligibility,

listening difficulty, and annoyance tests) were conducted todetermine whether or not speech privacy can be protectedusing our method.

In the word intelligibility test, subjects were asked to repeatwords by typing characters as they listened. Word intelligibil-ity was calculated by counting the number of words correctlyheard and then rating them.

In the listening difficulty test, the subjects were askedto choose one of four levels: (I) Not difficult, (II) Slightlydifficult, (III) Fairly difficult, and (IV) Extremely difficult. Thecalculation of the listening difficulty rate (LDR) is defined as

LDR =N − Count(I)

N, (8)

where N is the total number of stimuli. Here, Count(I) is afunction of counting the number of stimuli in “Not difficult.”

In the annoyance test, the subjects were asked to choosefrom one of four levels: (i) Not annoying, (ii) Slightly annoy-ing, (iii) Moderately annoying, and (iv) Extremely annoying.The calculation of the annoyance rate (ANR) is defined as

ANR =N − Count(i)

N, (9)

where Count(i) is a function of counting the number of stimuliin “Not annoying.”

B. Stimuli

The test stimuli were chosen from the Familiarity-controlledWord-lists 2007 (FW07) [18]. The words were composed offour morae. The stimuli were spoken by a male speaker (mya)and the word familiarity (WF) was from 1.0 to 7.0. Twentyword-lists with the same WF were randomly selected for eachsubject. All speech signals had a sampling frequency of 48kHz.

C. Subjects

Ten males and one female aged between 23 and 31 partici-pated in the experiments. All listeners had normal hearing andwere native speakers of Japanese.

IV. RESULTS

A. Test of proposed method

Three experiments were conducted to determine whetheror not speech privacy can be protected by controlling theSTI in our method. Five RIRs were generated using Eq. (2)where h0(t) = δ(t) and the corresponding STIs were 0.875,0.675, 0.525, 0.375, and 0.230. Test stimuli were generatedby convolving the speech signals in FW07 with RIRs. Thetotal number of words was 1200 (5 STI conditions × 4 WFconditions × 20 word-lists × 3 experiments).

The experimental results are shown in Fig. 5. Four curvesare related to WF categories. Figure 5(a) shows that, overall,word intelligibility decreased as the STI decreased. Thesetrends depended on the WF categories. These results demon-strated that the word intelligibility can be controlled by ma-nipulating the STI in each word familiarity.



Figure 5(b) shows that the listening difficulty rate (LDR)increased and then saturated at 1.0 as the STI decreased. Thesetrends did not depend on the WF categories. Therefore, theresults demonstrate that listening difficulty can be controlledby manipulating the STI.

Figure 5(c) shows that the annoyance rate (ANR) drasticallyincreased and then saturated at 1.0 as the STI decreased. Thesetrends did not depend on the WF categories. Therefore, theresults demonstrate that the annoyance can be controlled bymanipulating the STI.

B. Performances in base and simulated tests

Three experiments were conducted to evaluate how ourmethod can protect speech privacy in comparison with twoother methods (masking and reverberation) in the free field(base test) and the real RIR (simulated test). In the first test,h0(t) was set to δ(t) (free field as the base test). In the maskingmethod, pink noise was used to protect speech privacy, and theSNR was set to an STI of 0.23. In the reverberation method,Schroeder’s RIR was used to protect speech privacy, and theSTI of this RIR was 0.23. The test stimuli were generated byconvolving speech signals in FW07 with RIRs or by addingpink noise to the speech signals. To reduce the cost, two WFconditions (highest and lowest) were used. The total numberof words was 360 (1 STI condition × 3 methods × 2 WFconditions × 20 word-lists × 3 experiments).

The experimental results in the base test are shown in Fig.6. In these cases, all figures were redrawn as a function ofthe SNR instead of the STI that is derived using the followingequation.

SNR = 10 log10m(fm)/(1−m(fm)). (10)

In the figure, the blue and red symbols indicate the results forthe lowest WI and highest WI, respectively. In almost all theresults, the protection performances of the three methods werealmost the same. However, our method had the highest SNR(lowest noise level) for protecting speech privacy.

For the second test, the experimental results in the simulatedtest are shown in Fig. 7. In this case, h0(t) was used as thereal RIR (#26, T60 = 0.62 s, meeting room (130 m3) inSMILE2004 dataset [19]). The other conditions were the sameas those in the base test. We found that the same tendency canbe seen in all tests. Therefore, our method can improve theSNR by 5 dB in comparison with the noise masking methodand effectively ensure speech privacy.

V. CONCLUSION

We investigated whether or not speech privacy in the sim-ulated room can be protected by controlling the speech trans-mission index (STI) of the extended room impulse response(RIR) model related to late reverberation. The results of threeexperiments on word intelligibility, listening difficulty, andannoyance demonstrated that word intelligibility is decreasedand listening difficulty and annoyance are increased by manip-ulating the RIR and by reducing the STI. A comparison withtwo other methods (noise masking and reverberation) revealed

that the proposed method offers the same protection for speechprivacy, while the required SNR can be improved by 5 dBhigher instead of using noise masking. These results suggestthat speech privacy can be protected by actively controllingthe STI in the simulated room.

ACKNOWLEDGEMENTS

This work was supported by a Grant-in-Aid for challengingExploratory Research (No. 16K12458) and Innovative Areas(No. 16H01669) from the Ministry of Education, Culture,Science and Technology (MEXT), Japan, and by the SecomScience and Technology Foundation.

REFERENCES

[1] Cavanaugh, W. J., Farrell, W. R., Hirtle, P. W., and Watters, B. G.,“Speech privacy in buildings,” J. Acoust. Soc. Am., vol. 34, no. 4, 475–492, 1962.

[2] Sato, H. and Shimizu, Y., “Hisory and recent topics of studies on speechprivacy,” J. Acoust. Soc. Jpn., vol. 68, no. 8, pp. 475–480, 2008 (writtenby Japanese with English abstract).

[3] Lee, P. J. and Jeon, J. Y., “A laboratory study for assessing speechprivacy in a simulated open-plan office,” Indoor Air., vol. 24, no. 3, pp.307–314, 2014.

[4] Saeki, T., Yamaguchi, S., and Tamesue, T., “Study on achieving speechprivacy using masking noise,” J. Acoust. Soc. Jpn., vol. 61, no. 10, pp.571–575, 2005 (written by Japanese with English abstract).

[5] Chanaud, R. C., “Progress in sound masking,” Acoustics Today, vol. 3,issue 4, pp. 21–26, 2007.

[6] Donley, J., Ritz, C., and Kleijn, W. B., “Improving Speech Privacy inPersonal Sound Zones,” Proc. ICASSP2016, pp. 311–315, 2016.

[7] Morimoto, M., Sato, H. and Kobayashi, M., “Using Listening DifficultyRatings of Conditions for Speech communication in rooms,” J. Acoust.Soc. Am., Vol. 116, pp. 1607–1613, 2004.

[8] Sato, H., Bradley, J. S., Morimoto, M., “Using listening difficulty ratingsof conditions for speech communication in rooms,” J. Acoust. Soc. Am.,vol. 117, no. 3, pp. 1157–1167, 2005.

[9] Hioka, Y., Tang, J. W., and Wan, J., “Effect of adding artificial reverber-ation to speech-like masking sound,” Applied Acousitcs, vol. 114, pp.171–178, 2016.

[10] IEC 60268 – 16: 2003. Sound system equipment – Part 16: Objectiverating of speech intelligibility by speech transmission index.

[11] Houtgast, T. and Steeneken, H. J. M., “The Modulation TransferFunction in Room Acoustics as a Predictor of Speech Intelligibility,”Acustica, Vol. 28, pp. 66-73, 1973.

[12] Morimoto, M., Sato, H. and Kobayashi, M., “Listening Difficulty as aSubjective Measure for Evaluation of Speech Transmission Performancein Public Spaces,” J. Acoust. Soc. Am., Vol. 116, pp. 1607–1613, 2004.

[13] Kashihara, Y. and Unoki, M., “Study on Speech Transmission Indexusing Models of ModulationTransfer Function,” IEICE Technical Report,EA 2016-33, pp.13–18, 2016.

[14] Kuttruff, H., “Room Acoustics,” 3rd ed. (Elsevier Science PublishersLtd., Lindin), 1991.

[15] Unoki,M., Sasaki, K., Miyauchi, R., Akagi, M., and Kim, N. S., “Blindmethod of estimating speech transmission index from reverberant speechsignals,” Proc. EUSIPCO2013, Marrakech, Morocco, Sept. 2013.

[16] Miyazaki, A., Morita, S., and Unoki, M., “Study on blind method ofestimating speech transmission index from noisy reverberant amplitude-modulated signals,” J. Signal Processing, vol. 18, no. 4, pp. 201–204,July 2014.

[17] Unoki, M., Morita, S., Miyazaki, A., and Akagi, M., “Preliminary Studyon Blind Estimation of Room Acoustic Parameters in Noisy ReverberantEnvironments,” Proc. 12th Western Pacific Acoustics Conferences 2015(WESPAC2015), pp. 428–435, Sigapore, Dec. 2015.

[18] Kondo, T., Sakamoto, S., Amano, S., and Suzuki, Y., “Compensationfor list-difference of word intelligibility by consitioning signal-to-noiseratio: Validation by using the familiarity-controlled word lists 2007(FW07),” J. Acoust. Soc. Jpn., vol. 69, no. 5, pp. 224–231, 2013 (writtenby Japanese with English abstract).

[19] Architectural Institute of Japan, “Sound library of architecture andenvironment,” Gihodo Shuppan Co., Ltd., Tokyo, 2004.



(a)!

(b)!

(c)!

Fig. 6. Results of base test of proposed method by comparison with two othermethods (masking and reverberation): (a) word intelligibility, (b) listeningdifficulty, and (c) annoyance.

(a)!

(b)!

(c)!

Fig. 7. Results of simulated test of proposed method by comparison withother two methods (masking and reverberation): (a) word intelligibility, (b)listening difficulty, and (c) annoyance.



Study on method for protecting speech privacy by actively ... · domain masker signals are used to...

Documents

Transcript of Study on method for protecting speech privacy by actively ... · domain masker signals are used to...