Near-field signal acquisition for smartglasses using two ... · Smartglasses Adaptive signal...

12
Speech Communication 83 (2016) 42–53 Contents lists available at ScienceDirect Speech Communication journal homepage: www.elsevier.com/locate/specom Near-field signal acquisition for smartglasses using two acoustic vector-sensors Dovid Y. Levin a,, Emanuël A.P. Habets b,1 , Sharon Gannot a a Bar-Ilan University, Faculty of Engineering, Building 1103, Ramat-Gan, 5290002, Israel b International Audio Laboratories Erlangen, Am Wolfsmantel 33,Erlangen91058, Germany a r t i c l e i n f o Article history: Received 21 February 2016 Accepted 12 July 2016 Available online 18 July 2016 PACS: 43.60.Fg 43.60.Mn 43.60.Hj Keywords: Beamforming Acoustic vector-sensors Smartglasses Adaptive signal processing a b s t r a c t Smartglasses, in addition to their visual-output capabilities, often contain acoustic sensors for receiving the user’s voice. However, operation in noisy environments may lead to significant degradation of the received signal. To address this issue, we propose employing an acoustic sensor array which is mounted on the eyeglasses frames. The signals from the array are processed by an algorithm with the purpose of acquiring the desired near-field speech signal produced by the wearer while suppressing noise signals originating from the environment. The array is comprised of two acoustic vector-sensors (AVSs) which are located at the fore of the glasses’ temples. Each AVS consists of four collocated subsensors: one pressure sensor (with an omnidirectional response) and three particle-velocity sensors (with dipole responses) ori- ented in mutually orthogonal directions. The array configuration is designed to boost the input power of the desired signal, and to ensure that the characteristics of the noise at the different channels are suffi- ciently diverse (lending towards more effective noise suppression). Since changes in the array’s position correspond to the desired speaker’s movement, the relative source-receiver position remains unchanged; hence, the need to track fluctuations of the steering vector is avoided. Conversely, the spatial statistics of the noise are subject to rapid and abrupt changes due to sudden movement and rotation of the user’s head. Consequently, the algorithm must be capable of rapid adaptation toward such changes. We propose an algorithm which incorporates detection of the desired speech in the time-frequency domain, and em- ploys this information to adaptively update estimates of the noise statistics. The speech detection plays a key role in ensuring the quality of the output signal. We conduct controlled measurements of the array in noisy scenarios. The proposed algorithm preforms favorably with respect to conventional algorithms. © 2016 Elsevier B.V. All rights reserved. 1. Introduction Recent years have witnessed an increased interest in wearable computers (Barfield, 2016; Randell, 2005). These devices consist of miniature computers worn by users which can perform certain tasks; the devices may incorporate various sensors and feature net- working capabilities. For example, a smartwatch may be used to display email messages, aid in navigation, and monitor the user’s heart rate (in addition to functioning as a timepiece). One specific type of wearable computer which has garnered much attention is the smartglasses a device which displays computer generated information supplementing the user’s visual Corresponding author. E-mail addresses: [email protected] (D.Y. Levin), emanuel.habets@ audiolabs-erlangen.de (E.A.P. Habets), [email protected] (S. Gannot). 1 A joint institution of the Friedrich-Alexander-University Erlangen-Nürnberg (FAU) and Fraunhofer IIS, Germany. field. A number of companies have been conducting research and development towards smartglasses intended for consumer usage (Cass and Choi, 2015) (e.g., Google Glass (Ackerman, 2013) and Mi- crosoft HoloLens). In addition to their visual-output capabilities, smartglasses may incorporate acoustic sensors. These sensors are used for hands-free mobile telephony applications, and for appli- cations using a voice-control interface to convey commands and information to the device. The performance of both of these applications suffers when op- erating in a noisy environment: in telephony, noise degrades the quality of the speech signal transmitted to the other party; simi- larly, the accuracy of automatic speech recognition (ASR) systems is reduced when the desired speech is corrupted by noise. A review of one prominent smartglasses prototype delineated these two is- sues as requiring improvement (Sung, 2014). To deal with these issues, we propose a system for the ac- quisition of the desired near-field speech in a noisy environment. The system is based on an acoustic array embedded in eyeglasses http://dx.doi.org/10.1016/j.specom.2016.07.002 0167-6393/© 2016 Elsevier B.V. All rights reserved.

Transcript of Near-field signal acquisition for smartglasses using two ... · Smartglasses Adaptive signal...

Page 1: Near-field signal acquisition for smartglasses using two ... · Smartglasses Adaptive signal processing a b s t r a c t Smartglasses, visual-output capabilities,addition containto

Speech Communication 83 (2016) 42–53

Contents lists available at ScienceDirect

Speech Communication

journal homepage: www.elsevier.com/locate/specom

Near-field signal acquisition for smartglasses using two acoustic

vector-sensors

Dovid Y. Levin

a , ∗, Emanuël A.P. Habets b , 1 , Sharon Gannot a

a Bar-Ilan University, Faculty of Engineering, Building 1103, Ramat-Gan, 5290 0 02, Israel b International Audio Laboratories Erlangen, Am Wolfsmantel 33,Erlangen91058, Germany

a r t i c l e i n f o

Article history:

Received 21 February 2016

Accepted 12 July 2016

Available online 18 July 2016

PACS:

43.60.Fg

43.60.Mn

43.60.Hj

Keywords:

Beamforming

Acoustic vector-sensors

Smartglasses

Adaptive signal processing

a b s t r a c t

Smartglasses, in addition to their visual-output capabilities, often contain acoustic sensors for receiving

the user’s voice. However, operation in noisy environments may lead to significant degradation of the

received signal. To address this issue, we propose employing an acoustic sensor array which is mounted

on the eyeglasses frames. The signals from the array are processed by an algorithm with the purpose

of acquiring the desired near-field speech signal produced by the wearer while suppressing noise signals

originating from the environment. The array is comprised of two acoustic vector-sensors (AVSs) which are

located at the fore of the glasses’ temples. Each AVS consists of four collocated subsensors: one pressure

sensor (with an omnidirectional response) and three particle-velocity sensors (with dipole responses) ori-

ented in mutually orthogonal directions. The array configuration is designed to boost the input power of

the desired signal, and to ensure that the characteristics of the noise at the different channels are suffi-

ciently diverse (lending towards more effective noise suppression). Since changes in the array’s position

correspond to the desired speaker’s movement, the relative source-receiver position remains unchanged;

hence, the need to track fluctuations of the steering vector is avoided. Conversely, the spatial statistics of

the noise are subject to rapid and abrupt changes due to sudden movement and rotation of the user’s

head. Consequently, the algorithm must be capable of rapid adaptation toward such changes. We propose

an algorithm which incorporates detection of the desired speech in the time-frequency domain, and em-

ploys this information to adaptively update estimates of the noise statistics. The speech detection plays a

key role in ensuring the quality of the output signal. We conduct controlled measurements of the array

in noisy scenarios. The proposed algorithm preforms favorably with respect to conventional algorithms.

© 2016 Elsevier B.V. All rights reserved.

fi

d

(

c

s

u

c

i

e

q

l

1. Introduction

Recent years have witnessed an increased interest in wearable

computers ( Barfield, 2016; Randell, 2005 ). These devices consist of

miniature computers worn by users which can perform certain

tasks; the devices may incorporate various sensors and feature net-

working capabilities. For example, a smartwatch may be used to

display email messages, aid in navigation, and monitor the user’s

heart rate (in addition to functioning as a timepiece).

One specific type of wearable computer which has garnered

much attention is the smartglasses — a device which displays

computer generated information supplementing the user’s visual

∗ Corresponding author.

E-mail addresses: [email protected] (D.Y. Levin), emanuel.habets@

audiolabs-erlangen.de (E.A.P. Habets), [email protected] (S. Gannot). 1 A joint institution of the Friedrich-Alexander-University Erlangen-Nürnberg

(FAU) and Fraunhofer IIS, Germany.

i

o

s

q

T

http://dx.doi.org/10.1016/j.specom.2016.07.002

0167-6393/© 2016 Elsevier B.V. All rights reserved.

eld. A number of companies have been conducting research and

evelopment towards smartglasses intended for consumer usage

Cass and Choi, 2015 ) (e.g., Google Glass ( Ackerman, 2013 ) and Mi-

rosoft HoloLens). In addition to their visual-output capabilities,

martglasses may incorporate acoustic sensors. These sensors are

sed for hands-free mobile telephony applications, and for appli-

ations using a voice-control interface to convey commands and

nformation to the device.

The performance of both of these applications suffers when op-

rating in a noisy environment: in telephony, noise degrades the

uality of the speech signal transmitted to the other party; simi-

arly, the accuracy of automatic speech recognition (ASR) systems

s reduced when the desired speech is corrupted by noise. A review

f one prominent smartglasses prototype delineated these two is-

ues as requiring improvement ( Sung, 2014 ).

To deal with these issues, we propose a system for the ac-

uisition of the desired near-field speech in a noisy environment.

he system is based on an acoustic array embedded in eyeglasses

Page 2: Near-field signal acquisition for smartglasses using two ... · Smartglasses Adaptive signal processing a b s t r a c t Smartglasses, visual-output capabilities,addition containto

D.Y. Levin et al. / Speech Communication 83 (2016) 42–53 43

f

b

p

a

t

u

l

c

w

s

s

s

b

a

c

o

t

f

t

o

p

c

m

n

b

A

s

d

(

d

T

t

o

s

c

g

a

i

n

r

d

c

s

s

c

s

c

s

r

t

n

c

t

m

i

p

y

o

s

i

t

Fig. 1. The proposed sensor locations are indicated in red. (For interpretation of the

references to color in this figure legend, the reader is referred to the web version

of this article.)

t

S

p

t

d

c

c

T

s

g

h

b

m

m

m

i

t

S

c

p

m

2

c

u

(

d

F

t

b

e

a

c

f

s

t

d

t

s

g

s

rames worn by the desired speaker. The multiple signals received

y the array contain both desired speech as well as undesired com-

onents. These signals are processed by an adaptive beamforming

lgorithm to produce a single output signal with the aim of re-

aining the desired speech with little distortion while suppressing

ndesired components.

The scenario of a glasses mounted array presents some chal-

enging features which are not encountered in typical speech pro-

essing. Glasses frames constitute a spatially compact platform,

ith little room to spread the sensors out. Typically, when sen-

ors are closely spaced the statistical qualities of the noise at each

ensor are highly correlated presenting difficulties in robust noise

uppression ( Bitzer and Simmer, 2001 ). Hence, special care must

e taken in the design of the array.

The proposed array consists of two AVSs located, respectively,

t the fore of the glasses’ right and left temples. In contrast to

onventional sensors that measure only the pressure component

f a sound field (which is a scalar quantity), an AVS measures both

he pressure and particle-velocity components. An AVS consists of

our subsensors with different spatial responses: one omnidirec-

ional sensor (corresponding to pressure) and three orthogonally

riented dipole sensors (corresponding to the components of the

article-velocity vector). Hence, the array contains a total of eight

hannels (four from each AVS). Since each subsensor possesses a

arkedly different spatial response, the statistical properties of the

oise at the different subsensors are diverse. Consequently, robust

eamforming is possible in spite of the limited spatial aperture.

nother advantage afforded by the use of AVSs is that the dipole

ensors amplify near-field signals more so than conventional omni-

irectional sensors. Due to these sensors, the desired speech signal

which is in the near-field due to the proximity to the sensors) un-

ergoes a relative gain and is amplified with respect to the noise.

he relative gain is explained and quantified in Section 2 . The in-

erested reader is referred to the Appendix for further information

n AVSs.

The configuration in which the array is mounted on the

peaker’s glasses differs from the typical scenario in which a mi-

rophone array is situated in the environment of the user. The

lasses configuration possesses particular properties which lead to

number of benefits with respect to processing: (i) The close prox-

mity of the desired source to the sensors leads to high signal-to-

oise ratio (SNR) which is favorable. (ii) For similar reasons, the

everberation of the desired speech is negligible with respect to its

irect component, rendering dereverberation a nonissue. (iii) Any

hange in the location of the desired source brings about a corre-

ponding movement of the array which is mounted thereon. Con-

equently, the relative source-sensors configuration is essentially

onstant, precluding the need for tracking changes of the desired

peaker’s position.

Conversely, the glasses-mounted configuration presents a spe-

ific challenge. The relative positions of the undesired acoustic

ources with respect to the sensor array are liable to change

apidly. For instance, when the user rotates his/her head the rela-

ive position of the array to external sound sources undergoes sig-

ificant and abrupt changes. This necessitates that the signal pro-

essing stage be capable of swift adaptation.

The proposed algorithm is based on minimum variance distor-

ionless response (MVDR) beamforming which is designed to mini-

ize the residual noise variance under the constraint of maintain-

ng a distortionless desired signal. This type of beamforming was

roposed by Capon (1969) in the context of spatial spectrum anal-

sis of seismic arrays. Frost (1972) employed this idea in the field

f speech processing using a time-domain representation of the

ignals. Later, Gannot et al. (2001) recast the MVDR beamformer

n the time-frequency domain. In the current work, we adopt the

ime-frequency formulation.

In the proposed algorithm, the noise covariance matrix is adap-

ively estimated on an ongoing basis from the received signals.

ince the received signals contain both desired and undesired com-

onents, the covariance matrix obtained from a naive implemen-

ation would contain significant contributions of energy from the

esired speech. This is detrimental to the performance of the pro-

essing. To prevent desired speech from contaminating the noise

ovariance estimation, a speech detection component is employed.

ime-frequency bins which are deemed likely to contain desired

peech are not used for estimating the noise covariance.

To further reduce noise, the output of the MVDR stage under-

oes post-processing by a single-channel Wiener filter (SWF). It

as been shown ( Simmer et al., 2001 ) that application of MVDR

eamforming followed by a SWF is optimal in the sense of mini-

izing the mean square error (MSE) [since it is equivalent to the

ultichannel Wiener filter (MWF)].

The paper is structured as follows: Section 2 describes the

otivation guiding our specific array design. In Section 3 , we

ntroduce the notation used to describe the scenario in which

he array operates and then present the problem formulation.

ection 4 presents the proposed algorithm and how its various

omponent interrelate. Section 5 evaluates the performance of the

roposed algorithm, and Section 6 concludes with a brief sum-

ary.

. Motivation for array design

In this section, we discuss the considerations which lead to our

hoices for the placement of the sensors and the types of sensors

sed.

An AVS is located at the fore of each of the glasses’ temples

see Fig. 1 ). The reason for selecting this location is that there is a

irect “line of sight” path from the speaker’s mouth to the sensors.

or other locations on the frames, such as the temples’ rear sec-

ions or the areas above the lenses, the direct path is obstructed

y human anatomy or the physical structure of the glasses. The ar-

as underneath the lenses were also considered as they do have

n unobstructed line to the mouth; however, embedding a mi-

rophone array at this locale was deemed to render the resulting

rame structure too cumbersome.

Choosing an AVS based array, rather than using conventional

ensors, leads to several advantages. Firstly, the inherent direc-

ional properties of an AVS lend to the distinction between the

esired source and sound arriving from other directions. In con-

rast, a linear arrangement of conventional omnidirectional sen-

ors along a temple of the glasses frame would exhibit a de-

ree of directional ambiguity – it is known that the response of

uch linear arrays maintains a conical symmetry ( Van Trees, 2002 ).

Page 3: Near-field signal acquisition for smartglasses using two ... · Smartglasses Adaptive signal processing a b s t r a c t Smartglasses, visual-output capabilities,addition containto

44 D.Y. Levin et al. / Speech Communication 83 (2016) 42–53

t

(

m

x

w

k

l

p

i

(

r

C

x

c

t

x

w

r

d

t

s

w

c

i

h

W

k

s

n

i

r

i

4

t

p

t

s

u

u

t

a

Secondly, an AVS preforms well with a compact spatial configura-

tion, whereas conventional arrays suffer from low robustness when

array elements are closely spaced ( Bitzer and Simmer, 2001 ). Al-

though this problem could be alleviated by allowing larger spac-

ing between elements, this would necessitate placing sensors at

the rear of the temple with no direct path to the source. Thirdly,

the near-field frequency response of dipoles amplifies lower fre-

quencies. This effect, which results from the near-field acoustic

impedance, tends to increase the SNR since noise originating in the

far-field does not undergo this amplification.

To illustrate this last point, we consider the sensors’ frequency

responses to an ideal spherical wave 2 ( Pierce, 1991 ). The response

of the monopole sensors is proportional to 1 r , where r is the dis-

tance from the wave’s origin (i.e., they have a flat frequency re-

sponse). The response of the dipole elements is proportional to1 r (1 +

c r

1 jω ) where c is the velocity of sound propagation and ω

is the angular frequency. Consequently, the dipoles have a rela-

tive gain of (1 +

c r

1 jω ) over an omnidirectional sensor. This be-

comes particularly significant at short distances and lower frequen-

cies where r � c ω . Stated differently, when the distance is signifi-

cantly shorter than the wavelength, dipole sensors exhibit notice-

able gain.

3. Notation and problem formulation

This section presents the scenario in which the array operates

and the notation used to describe it. The problem formulation is

then presented using this notation.

Let us denote the clean source signal as s [ n ] and the

z 1 [ n ] . . . z P [ n ] as P interference signals. These signals propagate

from their respective sources to the sensors and may also un-

dergo reflections inducing reverberation. These processes are mod-

eled as linear time invariant (LTI) systems represented by impulse-

responses. Let h m

[ n ] denote the response of the m -th sensor to an

impulse produced by the desired source and g m , p [ n ] denote the

impulse response from the p -th undesired source to the m -th sen-

sor. Each of the M = 8 sensors is also subject to ambient noise and

internal sensor noise; these will be denoted e m

[ n ]. The resulting

signal x m

[ n ] received by the m -th sensor consists of all the above

components and can be written as

x m

[ n ] = s [ n ] ∗ h m

[ n ] +

(

P ∑

p=1

z p [ n ] ∗ g m,p [ n ]

)

+ e m

[ n ] . (1)

Concatenating the respective elements into column-vectors, (1) can

be reformulated as

x [ n ] = s [ n ] ∗ h [ n ] +

(

P ∑

p=1

z p [ n ] ∗ g p [ n ]

)

+ e [ n ] . (2)

The impulse response h [ n ] can be decomposed into direct arrival

and reverberation components h [ n ] = h d [ n ] + h r [ n ] . The received

signals can be expressed as

x [ n ] = s [ n ] ∗ h d [ n ] + v [ n ] , (3)

where v [ n ] incorporates the undesired sound sources, ambient and

sensor noise, and reverberation of the desired source. The vector

v [ n ] and all its different subcomponents are referred to generically

as noise in this paper. Since the sensors are mounted in close prox-

imity to the mouth of the desired speaker, it can be assumed that

2 The actual propagation is presumably more complicated than an ideal spherical

model and is difficult to model precisely. For instance, the human mouth does not

radiate sound uniformly in all directions. Furthermore, the structure of the human

face may lead to some diffraction and reflection. Nevertheless, the ideal spherical

model is useful as it depicts overall trends.

4

w

d

y

he direct component is dominant with respect to reverberation

i.e., the direct-reverberation ratio (DRR) is high).

The received signals are transformed to the time-frequency do-

ain via the short-time Fourier transform (STFT):

[ n ] �→ x (�, k ) =

⎡ ⎢ ⎢ ⎣

X 1 (�, k ) X 2 (�, k )

. . . X M

(�, k )

⎤ ⎥ ⎥ ⎦

, (4)

here the subscript denotes the channel and the indexes � and

represent time and frequency indexes, respectively. A convo-

ution in the time domain can be aptly approximated as multi-

lication in the STFT domain provided that the analysis window

s sufficiently long vis-à-vis the length of the impulse response

Gannot et al., 2001 ). Since the direct component of the impulse

esponse is highly localized in time, h d [ n ] satisfies this criterion.

onsequently, (3) can be approximated in the STFT domain as

(�, k ) = h d (k ) s (�, k ) + v (�, k ) . (5)

Often the transfer function h d ( k ) is not available; therefore, it is

onvenient to use the relative transfer function (RTF) representa-

ion,

(�, k ) =

˜ h d (k ) s (�, k ) + v (�, k ) , (6)

here s is typically the direct component of the clean source signal

eceived at the first channel (or some linear combination of the

ifferent channels), and

h d (k ) is the RTF of this signal with respect

o the sensors. Expressed formally,

(�, k ) = c H (k ) h d (k ) s (�, k ) , (7)

here the vector c ( k ) determines the linear combination (e.g.,

(k ) = [1 0 · · · 0] T selects the first channel). The RTF vector, ˜ h d (k ) ,

s related to the transfer function vector, h d ( k ), by

d (k ) =

h d (k )

c H (k ) h d (k ) . (8)

e refer to s as the desired signal .

The signal processing system receives x [ n ] [or equivalently x ( � ,

)] as an input and returns s [ n ] , an estimate of the desired speech

ignal s [ n ] , as the output. The estimate should effectively suppress

oise and interference while maintaining low distortion and high

ntelligibility. The algorithm should have low latency to facilitate

eal time operation, and should be able to adapt rapidly to changes

n the scenario.

. Proposed algorithm

The various stages of the proposed algorithm are presented in

his section. The signal processing employs beamforming to sup-

ress undesired components. A speech detection method is used

o determine which time-frequency bins are dominated by the de-

ired speech and thus facilitate accurate estimation of the statistics

sed by the beamformer. The beamformer’s single-channel output

ndergoes post-processing to further reduce noise. Initial calibra-

ion procedures for the estimation of RTFs and sensor noise are

lso described.

.1. Beamforming framework

A beamformer forms a linear combination of the input channels

ith the objective of enhancing the signal. This operation can be

escribed as

(�, k ) = w

H (�, k ) x (�, k ) , (9)

Page 4: Near-field signal acquisition for smartglasses using two ... · Smartglasses Adaptive signal processing a b s t r a c t Smartglasses, visual-output capabilities,addition containto

D.Y. Levin et al. / Speech Communication 83 (2016) 42–53 45

w

v

d

y

T

p

E

w

d

|

j

M

d

w

T

w

o

M

p

w

I

t

f

w

A

c

W

W

a

n

f

c

m

t

s

l

i

o

m

k

a

t

4

r

k

b

t

t

t

a

t

w

a

s

i

α

T

a

t

α

w

s

s

d

c

m

w

I

n

4

c

T

G

g

m

s

R

a

s

α

4

o

p

H

3 It should be noted that (21) is fairly rudimentary. Other regularization methods

which are more advanced exist such as diagonal loading ( Cox et al., 1987; Gilbert

and Morgan, 1955 ), alternative loading schemes ( Levin, Habets, Gannot, et al., 2013 ),

and eigenvalue thresholding ( Harmanci et al., 20 0 0 ). We decided to use (21) due to

its computational simplicity: no weighting coefficient must be determined, nor is

eigenvalue decomposition called for.

here y ( � , k ) is the beamformer output, and w ( � , k ) is the weight

ector. This can be in turn presented as a combination of filtered

esired and undesired components

(�, k ) = w

H (�, k ) h d (k ) s (�, k ) + w

H (�, k ) v (�, k ) . (10)

he power of the undesired component at the beamformer’s out-

ut is

{| w

H (�, k ) v (�, k ) | 2 } = w

H (�, k ) �vv (�, k ) w (�, k ) , (11)

here �vv (�, k ) = E{ v (�, k ) v H (�, k ) } . The level of desired-signal

istortion can be expressed as

w

H (�, k ) h d (�, k ) − 1 | 2 . (12)

There is a certain degree of discrepancy between the dual ob-

ectives of reducing noise (11) and reducing distortion (12) . The

VDR beamformer minimizes noise under the constraint that no

istortion is allowed. Formally,

MVDR (�, k ) = argmin

w (�,k )

{ w

H (�, k ) �vv (�, k ) w (�, k ) }

s.t. w

H (�, k ) H ˜ h d (k ) = 0 . (13)

he solution to (13) is

MVDR (�, k ) =

�−1 vv (�, k )

h d (k ) ˜ h

H d (k ) �−1

vv (�, k ) h d (k )

. (14)

In contrast to MVDR beamforming’s constrained minimization

f noise, the MWF preforms unconstrained minimization of the

SE (i.e., distortion is allowed). This leads to improved noise sup-

ression but introduces distortion. Formally, the MWF is defined as

MWF (�, k ) = argmin

w (�,k )

E{| w

H (�, k ) x (�, k ) −˜ s (�, k ) | 2 } . (15)

t has been shown ( Spriet et al., 2004 ) that the MWF is equivalent

o performing post-processing to the output of an MVDR beam-

ormer with a single-channel Wiener filter (SWF):

MWF (�, k ) = w MVDR (�, k ) · W (�, k ) . (16)

n SWF, W ( � , k ), is determined from the SNR at its input (in this

ase the output of a beamformer). The relationship is given by

(�, k ) =

1

1 + SNR

−1 (�, k ) . (17)

e adopt this two-stage perspective and split the processing into

n MVDR stage followed by Weiner based post-processing.

For the MVDR stage, knowledge of the RTF ˜ h d (k ) and of the

oise covariance �vv ( � , k ) are required to compute the beam-

ormer weights of (14) . The RTF ˜ h d (k ) can be assumed to remain

onstant since the positions of the sensors with respect to the

outh of the desired source are fixed. Therefore, h d (k ) can be es-

imated once during a calibration procedure and used during all

ubsequent operation. A framework for estimating the RTF is out-

ined in Section 4.5.2 .

The noise covariance �vv ( � , k ) does not remain constant as it

s influenced by changes in the user’s position as well as changes

f the characteristics of the undesired sources. Therefore, �vv ( � , k )

ust be estimated on a continual basis. The estimation of �vv ( � ,

) is described in Section 4.2 .

The post-processing stage described in Section 4.4 incorporates

scheme for estimating the SNR of (17) . Measures to limit the dis-

ortion associated with Wiener filtering are also employed.

.2. Noise covariance estimation

Since the noise covariance matrix �vv ( � , k ) may be subject to

apid changes, it must be continually estimated from the signal x ( � ,

). This may be accomplished by performing, for each frequency

and, a weighted time-average which ascribes greater significance

o more recent time samples. As a further requirement, we wish

o exclude bins which contain the desired speech component from

he average, since their inclusion introduces bias to the estimate

nd is detrimental to the beamformer’s performance. We estimate

he noise variance as

vv (�, k ) = α(�, k ) �vv (� − 1 , k )

+

(1 − α(�, k )

)x (�, k ) x

H (�, k ) , (18)

here α is the relative weight ascribed to the previous estimate

nd 1 − α is the relative weight of the current time instant. If de-

ired speech is detected during a given bin, α is set to 1, effectively

gnoring that bin. Otherwise, α is set to α0 ∈ (0, 1). Formally,

(�, k ) =

{1 , if desired speech is detected

α0 , otherwise. (19)

he parameter α0 is a smoothing parameter which corresponds to

time-constant τ specifying the effective duration of the estima-

or’s memory. They are related by

0 = e −τF s

R ⇔ τ =

−R

F s ln (α0 ) , (20)

here F s is the sample rate, R is the hop size (i.e., number of time

amples between successive time frames), and τ is measured in

econds.

In certain scenarios, �vv (�, k ) is ill-conditioned and (14) pro-

uces exceptionally large weight vectors ( Cox et al., 1987 ). To

ounter this phenomenon, we constrain the norm of w to a maxi-

al value, 3

reg (�, k ) =

{

w MVDR (�, k ) , if ‖ w MVDR (�, k ) ‖ 2 ≤ ρ

ρw MVDR (�,k )

‖ w MVDR (�,k ) ‖

, otherwise. (21)

n (21) , w reg represents the regularized weight vector and ρ is the

orm constraint.

.3. Narrowband near-field speech detection

To determine whether the desired speech is present in a spe-

ific time-frequency bin, we propose the test statistic

(�, k ) =

| x

H (�, k ) h d (k ) | 2 | x (�, k ) | 2 | h d (k ) | 2 . (22)

eometrically, T corresponds to the square of the cosine of the an-

le between the two vectors x and

˜ h d . The highest value which T

ay obtain is 1; this occurs when x is proportional to ˜ h d corre-

ponding to complete affinity between the received data x and the

TF vector h d . Speech detection is determined by comparison with

threshold value η : for T ( � , k ) ≥ η, speech is detected; otherwise,

peech is deemed absent. This criterion determines the value of

( � , k ) in (19) .

.4. Post-processing

Post-filtering achieves further noise reduction at the expense

f increased distortion. Weiner filtering (17) applies an SNR de-

endent attenuation (low SNRs incur higher levels of attenuation).

owever, the SNR is not known and needs to be estimated. We use

Page 5: Near-field signal acquisition for smartglasses using two ... · Smartglasses Adaptive signal processing a b s t r a c t Smartglasses, visual-output capabilities,addition containto

46 D.Y. Levin et al. / Speech Communication 83 (2016) 42–53

Fig. 2. Block diagram schematic of the proposed algorithm.

W

S

s

l

4

e

t

i

n

m

h

w

u

c

b

a

m

i

t

t

a

w

e

m

t

s

i

r

t

t

4

t

c

c

i

a variation of Ephraim and Malah’s “decision-directed” approach

( Ephraim and Malah, 1984 ), i.e.,

γ (�, k ) =

| w

H reg (�, k ) x (�, k ) | 2

w

H reg (�, k )

�vv (�, k ) w reg (�, k ) SNR (�, k ) = β| W (� − 1 , k ) | 2 γ (� − 1 , k )

+(1 − β) max { γ (�, k ) − 1 , SNR min }

(�, k ) = max

{1

1 +

SNR

−1 (�, k )

, W min

}. (23)

The parameters SNR min and W min set thresholds that, respectively,

prevent the value of the estimated SNR of the current sample and

the amplitude of the Wiener filter from being overly low. This lim-

its the distortion levels and reduces the prevalence artifacts (such

as musical tones). However, this comes at the expense of less pow-

erful noise reduction.

Application of the post-processing stage to the output of the

MVDR stage yields the estimated speech signal:

s (�, k ) =

W (�, k ) w

H reg (�, k ) x (�, k ) . (24)

The signal s (�, k ) which is in the time-frequency domain is then

converted to the time domain to produce the system’s final output, s [ n ] .

It should be noted that the above approach is fairly rudi-

mentary. Other more elaborate post-processing techniques have

been developed and tested (e.g., Gannot and Cohen (2004) ;

Lefkimmiatis et al. (2006) ; McCowan and Bourlard (2003) ). Our

simpler choice of post-processing algorithm serves for the purpose

of demonstrating a complete system of beamforming and post-

processing applied to a smartglasses system. This concise algo-

rithm could, in principal, be replaced with more sophisticated one.

Furthermore, such algorithms could integrate information about

speech activity contained in the test statistic T ( � , k ).

4.5. Calibration procedures

We describe calibration procedures which are done prior to

running the algorithm.

4.5.1. Sensor noise estimation

Offline, the covariance matrix of sensor noise is estimated. This

is done by recording a segment in which which only sensor noise

is present. Let the STFT of this signal be denoted a ( � , k ), and the

number of time frames be denoted by L a . Since sensor noise is as-

sumed stationary, the time-averaged covariance matrix, �aa (k ) =1 L a

∑ L a � =1

a (�, k ) a H (�, k ) , serves as an estimate of the covariance of

sensor noise.

This is used in (18) as the initial condition value �vv (� − 1 , k ) .

pecifically, we set �vv (� = 0 , k ) = �aa (k ) . It should be noted that

etting the initial value as zeros would be problematic since this

eads to a series of singular matrices.

.5.2. RTF estimation

System identification generally requires knowledge of a refer-

nce signal and the system’s output signals. Let b ( � , k ) represent

he STFT of speech signals produced by a user wearing the glasses

n a noise-free environment. For RTF estimation, the reference sig-

al is c H ( � , k ) b ( � , k ) and the system’s outputs are b ( � , k ). An esti-

ate of the RTF vector is given by

est (k ) =

∑ L b � =1

b (�, k ) b

H (�, k ) c (�, k ) ∑ L b � =1

| c H (�, k ) b (�, k ) | 2 , (25)

here L b denotes the number of time frames, and division is to be

nderstood in an element-wise fashion.

Since, the desired RTF ˜ h d (k ) is comprised only of the direct

omponent, the input and output signals would ideally need to

e acquired in an anechoic environment. The availability of such

measuring environment is often not feasible, especially if these

easurements are to be preformed by the end consumer. This be-

ng the case, reliance on (25) can be problematic, since reverbera-

ion is also incorporated into ˜ h est (k ) . (We note that this informa-

ion about reverberation in the training stage is not useful for the

lgorithm, since the during actual usage reverberation changes.)

To overcome the problem, we suggest a method based of (25) in

hich the estimation of ˜ h d (k ) may be conducted in a reverberant

nvironment. We propose that RTF be estimated from measure-

ents in which the speaker’s position shifts during the course of

he measurement procedure. Alternatively, we may apply (25) to

egments recorded at different positions, and average the result-

ng estimates of ˜ h d (k ) . The rationale for this approach is that the

everberant components of the RTF change with the varying posi-

ions and therefore tend to cancel out. The direct component, on

he other hand is not influenced by the desired speaker’s position.

.6. Summary of algorithm

A block diagram depicting how the different components of

he algorithm interrelate is presented in Fig. 2 . The algorithm re-

eives the multichannel input signal x [ n ] and produces a single-

hannel output s [ n ] . Below, we describe the operations of the var-

ous blocks and refer the reader to the relevant formulas.

• The STFT and ISTFT blocks perform the short-time Fourier

transform and inverse short-time Fourier transform, respec-

tively. The main body of the algorithm operates in the time-

Page 6: Near-field signal acquisition for smartglasses using two ... · Smartglasses Adaptive signal processing a b s t r a c t Smartglasses, visual-output capabilities,addition containto

D.Y. Levin et al. / Speech Communication 83 (2016) 42–53 47

5

p

a

s

s

5

e

f

(

s

p

w

(

s

s

d

i

g

t

v

S

r

r

c

d

Fig. 3. Photograph of Cnut, our head and torso simulator (HATS), eyeglasses array.

The distance from the center of Cnut’s mouth to the AVSs is approximately 10 1 2

cm.

.

1

35

Fig. 4. Schematic of the acoustic sources. #1: Cnut (the HATS); #2–#4: static loud-

speakers; #5: moving human speaker. The distance between the HATS and loud-

speakers is approximately 1 m.

t

e

#

g

s

s

t

t

n

c

s

a

S

(

c

o

frequency domain, and these conversion steps are required at

the beginning and end of the process.

• The Initial calibration procedures estimate the RTF and the ini-

tial noise matrix (as described in Section 4.5 ).

• The Calculate test statistic block calculates T ( � , k ) from (22) .

• The Determine smoothing parameter block determines the

value of α( � , k ) according to (19) . The criterion T ( � , k ) ≥ η sig-

nifies speech detection.

• The Estimate noise covariance block calculates (18) .

• The Create MVDR weights block calculates (14) .

• The Regularization of weights block calculates (21) .

• The Apply beamforming block calculates (9) , producing a reg-

ularized MVDR beamformer output.

• The post-processing block corresponds to Section 4.4 .

. Performance evaluation

This section describes experiments conducted to evaluate the

roposed algorithm. The ensuing quantitative results are presented

nd compared to other algorithms. The reader is referred to the as-

ociated audio samples website in order to listen to various audio

ignals.

.1. Recording setup

First, we describe experiments which were conducted to

valuate the performance of the proposed algorithm. To per-

orm the measurements, we used two Microflown USP probes

de Bree, 2003 ). Each of these probes consists of a pressure sen-

or and three particle-velocity sensors. The physical properties of

article-velocity correspond to a dipole directivity pattern.

The USP probes were fastened to the temples of eyeglasses 4

hich were placed on the face of a head and torso simulator

HATS) (Brüel Kjær 4128C). The HATS is a dummy which is de-

igned to mimic the acoustic properties of a human being. For the

ake of brevity, we refer to our HATS by the nickname ‘Cnut’. The

istance from the center of Cnut’s mouth to the AVSs is approx-

mately 10 1 2 cm. Fig. 3 shows a photograph of Cnut wearing the

lasses array.

Recordings of several acoustic sources were preformed in a con-

rolled acoustic room. These include the following five distinct

oices:

1. A male voice, emitted from an internal loudspeaker located

in Cnut’s mouth. This recording was repeated four sepa-

rate times, with changes made to Cnut’s position or ori-

entation between recordings. Three of the recordings were

used for RTF estimation (as described in Section 4.5.2 ). The

fourth recording was used to evaluate performance. All other

sources used for evaluation (i.e., #2–#5) were recorded with

Cnut in this fourth position and orientation.

2. A male voice emitted from an external static (i.e., non-

moving) loudspeaker.

3. A female voice emitted from an external static loudspeaker.

4. A male voice emitted from an external static loudspeaker.

5. A male voice produced by one of the authors while walking

in the acoustic room.

These five separate voices were each recorded independently.

ources #2, #3, and #4 were located respectively to at the front-

ight, front, and front-left of Cnut and were positioned at a dis-

4 In this setup, the sensors are connected to external electronic equipment. The

ecorded signals were processed afterwards on a PC. The setup serves as a “proof of

oncept” validation of the algorithm preceding the development of an autonomous

evice.

q

8

3

r

ance of approximately 1 m from it. Source 5 walked along differ-

nt positions of a semicircular path in the vicinity of sources #1–

3. A rough schematic of the relative positions of the sources is

iven in Fig. 4 .

The recordings were conducted in the acoustic room at the

peech and acoustics lab at Bar-Ilan University. The room dimen-

ions are approximately 6 m × 6 m × 2.3 m. During the recordings

he reverberation level was medium-low.

The use of independent recordings for each source allows for

he creation of more complex scenarios by forming linear combi-

ations of the basic recordings. These scenarios can be carefully

ontrolled to examine the effects of different SNRs. Furthermore,

ince both the desired speech and undesired speech components

re known, we can inspect how the algorithm effects each of these.

The recordings were resampled from 48 kHz to 16 kHz.

ince the frequency-response of the USP sensors is not flat

de Bree, 2003 ), we applied a set of digital filters to equalize the

hannels. These filters are also designed to counter nonumiformity

f amplification across different channels and to remove low fre-

uencies containing noise and high frequencies in the vicinity of

kHz (i.e., half the sampling rate of the resampled signals).

It should be noted that a sound wave with a frequency of

.2 kHz has a wave length of approximately 10 1 2 cm, which cor-

esponds to the distance between the center of Cnut’s mouth and

Page 7: Near-field signal acquisition for smartglasses using two ... · Smartglasses Adaptive signal processing a b s t r a c t Smartglasses, visual-output capabilities,addition containto

48 D.Y. Levin et al. / Speech Communication 83 (2016) 42–53

Fig. 5. A spectrogram depicting a segment of the speech signal emitted from Cnut’s

mouth as recorded by the monopole sensors.

Table 1

Parameter values used for testing proposed algorithm.

Parameter: Value:

Sampling frequency f s = 16 , 0 0 0 Hz

Analysis window 512 sample Hamming window

Hop size R = 128 samples

FFT size 1024 samples (due to zero-padding)

Synthesis window 512 sample Hamming window

Smoothing parameter α0 = 0 . 98 (corresponds to τ ≈ 0.396 s)

Norm constraint ρ = 15

Speech detection threshold η = 0 . 9

˜

a

t

c

f

p

v

t

k

b

a

t

s

i

5

g

a

t

(

o

(

t

a

t

5 Estimations of the noise covariance matrix and the RTF vector used for beam-

forming and estimation of T ( � , k ) are then of reduced size, being based on only two

channels.

the AVSs. For a typical speech signal, the bulk of the spectral en-

ergy in located beneath 3.4 kHz. Furthermore, the power spectrum

of speech signals decays with increasing frequency ( Heute, 2008 ).

Both of these characteristics are qualitatively evident in Fig. 5 ,

which portrays the spectrogram of a segment of the signal emit-

ted from Cnut’s mouth as recorded by the monopole sensors.

(This signal serves as the clean signal for evaluation purposes in

Section 5.3 ). Accordingly, the vast majority of the desired speech

signal’s power corresponds to sub-wavelength propagation.

5.2. Processing details

The calibration for sensor-noise was done with silent segments

of the recordings and the RTF estimation was done with recordings

of speaker #1 at different positions and orientations and no other

speakers mixed in. The average of the two omnidirectional chan-

nels was used as the reference signal s (�, k ) for estimating the RTF h d (k ) . This corresponds to designating c of (7) as 1

2 [1 0 0 0 1 0 0 0] .

The input signals upon which processing is preformed corre-

spond to two scenarios. In the first scenario, three static speak-

ers (#2–#4) were combined with equal levels of mean power. Af-

terwards, they were added to the desired source (#1) at different

SNR levels. In the second scenario, source #5 was combined with

source #1 at different SNR levels.

Each of the 8 channels is converted to the time-frequency do-

main by the STFT, and processed with the algorithm proposed in

Section 4 . Presently, the post-processing stage is omitted; it is eval-

uated separately in Section 5.5 . An inverse STFT transform converts

the output back into the time domain. The values for the parame-

ters used are specified in Table 1 .

Several other algorithms are also examined as a basis for com-

parison. These algorithms include variations of MVDR and mini-

mum power distortionless response (MPDR) beamforming and are

described below:

1. Fixed-MVDR which uses a training segment in which only

undesired components are present (i.e., prior to the on-

set of the desired speech) in order to calculate the sam-

ple covariance-matrix �vv (k ) which serves as an estimate of

the noise covariance matrix. This matrix �vv (k ) is estimated

only once and then used for the duration of the processing.

2. Fixed-MPDR which uses the sample-covariance matrix calcu-

lated from the segment to be processed. In MPDR beam-

forming ( Van Trees, 2002 ), both desired and undesired sig-

nals contribute to the covariance matrix. The matrix, �vv

is replaced by �xx (k ) . In contrast to the fixed-MVDR algo-

rithm, no separate training segment is used.

Instead, the covariance matrix is estimated once from the

entire segment to be processed, and used for the entire du-

ration of the processing.

3. Adaptive-MPDR which uses a time-dependent estimate of the

covariance matrix of the received signals, �xx (�, k ) . This is

done by running the proposed algorithm with the threshold

parameter set at η = 1 , effectively ensuring that α(�, k ) = α0

for all � and k .

4. Oracle adaptation MVDR which uses a time-dependent esti-

mate of the noise covariance matrix, �vv (�, k ) , based on the

pure noise-component [i.e, x ( � , k ) of (18) is replaced by the

undesired component v ( � , k ), and η is set to 1]. The pure

noise component is unobservable in practice, hence the de-

nomination ‘oracle’; the algorithm is used for purposes of

comparison.

5. The unprocessed signal (i.e., the average of the two omnidi-

rectional sensors) is used for comparison with respect to the

short-time objective intelligibility (STOI) measure. By defini-

tion, the noise reduction for the unprocessed signal is 0 dB

and distortion is absent.

All algorithms use all eight channels of data. Furthermore, we

pply the proposed algorithm and the oracle adaptation MVDR to

he data from a reduced array containing only the two monopole

hannels 5 . The obtained results provide an indication of the per-

ormance enhancement due to the additional six dipole channels

resent in the AVSs.

The feasibility of using these algorithms in practical settings

aries. The fixed-MVDR algorithm presumes that segments con-

aining only noise are available. The fixed-MPDR algorithm requires

nowledge of the entire segment to be processed and hence cannot

e used in real-time. Both the adaptive MPDR and the proposed

lgorithm are capable of operation in real-time. The oracle adapta-

ion is not realizable since pure undesired components are unob-

ervable in practice. Its function is purely for purposes of compar-

son.

.3. Performance

In this subsection, we conduct an analysis of the proposed al-

orithm’s performance and compare it to the performance of other

lgorithms. Three measures are examined: (i) noise reduction (i.e.,

he amount by which the undesired component is attenuated);

ii) distortion (i.e., the amount by which the desired component

f the output differs from its true value); (iii) the STOI measure

Taal et al., 2011 ).

Each of these three measures entails comparing components of

he processed signal to some baseline reference. Since the signals

t the eight different channels possess widely different characteris-

ics, an arbitrary choice of a single channel to serve as the baseline,

Page 8: Near-field signal acquisition for smartglasses using two ... · Smartglasses Adaptive signal processing a b s t r a c t Smartglasses, visual-output capabilities,addition containto

D.Y. Levin et al. / Speech Communication 83 (2016) 42–53 49

SNR [dB]-20 -15 -10 -5 0 5 10 1000

Noi

se r

educ

tion

[dB

]

-15

-10

-5

0

5

10

15

20

25

30

35

fixed-MVDRfixed-MPDRoracle adaptationadaptive MPDRproposed algorithmoracle adaptation (RA)proposed algorithm (RA)

(a) 3 static interferers

SNR [dB]-20 -15 -10 -5 0 5 10 1000

Noi

se r

educ

tion

[dB

]

-15

-10

-5

0

5

10

15

20

25

30

35

fixed-MVDRfixed-MPDRoracle adaptationadaptive MPDRproposed algorithmoracle adaptation (RA)proposed algorithm (RA)

(b) 1 moving interferer

Fig. 6. Noise reduction attained from processing with five different algorithms for varying SNR levels in two scenarios.

SNR [dB]-20 -15 -10 -5 0 5 10 1000

Dis

tort

ion

[dB

]

-35

-30

-25

-20

-15

-10

-5

0

fixed-MVDRfixed-MPDRoracle adaptationadaptive MPDRproposed algorithmoracle adaptation (RA)proposed algorithm (RA)

(a) 3 static interferers

SNR [dB]-20 -15 -10 -5 0 5 10 1000

Dis

tort

ion

[dB

]

-35

-30

-25

-20

-15

-10

-5

0

fixed-MVDRfixed-MPDRoracle adaptationadaptive MPDRproposed algorithmoracle adaptation (RA)proposed algorithm (RA)

(b) 1 moving interferer

Fig. 7. Distortion levels resulting from processing with five different algorithms for varying SNR levels in two scenarios.

m

f

s

l

s

r

t

c

s

s

[

e

g

w

c

a

c

fi

a

l

a

n

n

T

d

F

s

c

f

w

A

d

t

a

-

ay produce misleading results. For instance, if a signal originates

rom the side of the array, the user’s head will shield some of the

ensors. Selecting a signal from the right temple as opposed to the

eft temple to serve as the baseline may produce very different re-

ults. We elect to use the average of powers at the two omnidi-

ectional sensors to define a signal’s power. In a similar fashion,

he clean reference signal is taken as the average of the desired

omponents received at the two omnidirectional sensors (which

hould be similar due to the geometrical symmetry of the corre-

ponding direct paths). These definitions are stated formally below

(26) –(28) ]. It should be stressed that this procedure relates to the

valuation but has no impact on the processing itself.

The procedure for testing an algorithm is as follows. An al-

orithm normally receives the data x [ n ], calculates the weights

alg ( � , k ), and applies them to produce the output s alg [ n ] . In our

ontrolled experiments, the desired and noise components of x [ n ]

re known. Hence, we can apply w ( � , k ) to the desired and noise

omponents producing d alg [ n ] and v alg [ n ] , respectively. Let us de-

ne w L ( � , k ) and w R ( � , k ) as weights which select only the left

nd right omnidirectional channel (i.e., the weight value for the se-

ected channel is 1, and all other channel weights are 0 valued). In

similar manner, d L [ n ], d R [ n ], v L [ n ] , and v R [ n ] are produced. The

oise reduction is defined as

oise reduction =

1 2

n (v 2 L [ n ] + v 2 R [ n ]) ∑

n v 2 alg [ n ]

. (26)

he distortion level is defined as

istortion =

n

(d alg [ n ] − 1

2 (d L [ n ] + d R [ n ])

)2

n

(1 2 (d L [ n ] + d R [ n ])

)2 . (27)

or calculating the STOI level, we compare the algorithm’s output

alg [ n ] with

lean [ n ] =

1 2 (d L [ n ] + d R [ n ]) (28)

unctioning as the clean reference signal.

We proceed to analyze the results of the algorithms under test

hich utilize all eight channels (these are marked by solid lines).

fterwards, we return to the results which use only data from a re-

uced array (marked in a dotted line) and compare. Fig. 6 portrays

he noise reduction results, Fig. 7 portrays the distortion results,

nd Fig. 8 portrays the STOI results. The SNRs examined range from

20 dB to 10 dB with increments of 5 dB; furthermore an SNR of

Page 9: Near-field signal acquisition for smartglasses using two ... · Smartglasses Adaptive signal processing a b s t r a c t Smartglasses, visual-output capabilities,addition containto

50 D.Y. Levin et al. / Speech Communication 83 (2016) 42–53

SNR [dB]-20 -15 -10 -5 0 5 10 1000

ST

OI

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

fixed-MVDRfixed-MPDRoracle adaptationadaptive MPDRproposed algorithmunprocessedoracle adaptation (RA)proposed algorithm (RA)

(a) 3 static interferers

SNR [dB]-20 -15 -10 -5 0 5 10 1000

ST

OI

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

fixed-MVDRfixed-MPDRoracle adaptationadaptive MPDRproposed algorithmunprocessedoracle adaptation (RA)proposed algorithm (RA)

(b) 1 moving interferer

Fig. 8. STOI levels resulting from processing with five different algorithms for varying SNR levels in two scenarios. (Note that for the static scenario (a), the fixed-MVDR and

the proposed algorithm are nearly identical.).

c

t

T

s

t

p

t

u

r

a

m

p

a

5

r

s

i

f

s

t

t

c

s

e

T

r

F

i

r

n

i

c

0

d

w

5

T

10 0 0 dB is also examined (appearing at the right edge of the hor-

izontal axis). This exceptionally high SNR is useful for checking ro-

bustness in extreme cases.

The results shown in the figures indicate that although both

MPDR based algorithms perform reasonably well for low SNRs,

there is a rapid degradation in performance as SNR increases.

This can be explained by the contamination of the estimated

covariance-matrix by desired speech, which is inherent in these

methods. For very low SNRs the contamination is negligible, but

at higher SNRs the contamination becomes significant. Due to this

issue, the MPDR based algorithms cannot be regarded as viable.

With respect to distortion, the other algorithms (i.e., fixed-

MVDR and proposed) score fairly well with levels between −20

dB and −18 dB. However, they differ with regards to noise reduc-

tion. For the static scenario, the fixed-MVDR attains a noise reduc-

tion of 21.8 dB. The proposed algorithm does slightly better at low

SNRs ( −10 dB and lower). At an SNR of −5 dB, the fixed-MVDR is

slightly better and as the SNR increases the proposed algorithm’s

noise reduction drops by several decibels (reaching 17.6 for an SNR

of 10 dB). This is not decidedly troublesome since at high SNRs the

issue of noise reduction is of lesser consequence.

For the case of moving interference, the proposed algorithm

significantly outperforms fixed-MVDR. The fixed-MVDR algorithm

reduces noise by 16.1 dB, whereas the proposed algorithm yields

a reduction of 29.3 at −20 dB. As the SNR increases, the noise re-

duction gradually decreases but typically remains higher than the

fixed-MVDR. For example, at SNRs of −10 , 0, and 10 dB the re-

spective noise reductions are 26.2, 22, and 18.5 dB. Due to the

changing nature of the interference, the initial covariance estimate

of the fixed-MVDR algorithm is deficient. In contrast, the pro-

posed algorithm constantly adapts and consequently manages to

effectively reduce noise. We note that the proposed algorithm is

more successful in this dynamic case than in the case of 3 static in-

terferers. This can be explained by the increased challenge of sup-

pressing multiple sources.

The proposed algorithm significantly outperforms the fixed-

MVDR algorithm in the scenario of a moving interferer with re-

spect to STOI. Interestingly, in the static scenario the two algo-

rithms have virtually indistinguishable STOI scores. This is despite

the fact that there are differences in their noise reduction.

We now discuss the performance of the two algorithms tested

with a reduced array (RA). These are labeled ‘oracle adaptation

(RA)’ and ‘proposed algorithm (RA)’ in Figs. 6, 7 , and 8 . The noise

reduction attainable with the reduced array with the proposed al-

gorithm is roughly 6 dB which is close to the limit set by ora-

le adaptation with a reduced array. Full use of all channels form

he AVSs provided an improvement of approximately 15 to 25 dB.

he performance with respect to STOI with a reduced array is only

lightly better than the unprocessed signal. The distortion levels of

he reduced array are in the vicinity of -30 dB which is an im-

rovement over the full array with distortion of approximately -18

o -20 dB. This improvement is apparently due to the fact that the

nprocessed signal was defined as the average of the two omnidi-

ectional channels used by the reduced array. In any case, the full

rray preforms satisfactorily in terms of distortion and improve-

ent is of negligible significance; utilization of all channels does

rovide significant improvements with respect to noise reduction

nd STOI.

.4. Threshold parameter sensitivity

In this subsection, we examine the impact of the threshold pa-

ameter η on the performance of the proposed algorithm. If η is

et too low, then too many bins are mistakenly labeled as contain-

ng desired speech. This may lead to poor noise estimation since

ewer bins are used in the estimation process. Conversely, if η is

et too high, bins which do contain desired speech will not be de-

ected as such which may lead to contamination of the noise es-

imation (as seen in the MPDR based algorithms). Presumably, a

ertain region of values in between these extremes will yield de-

irable results with respect to the conflicting goals.

We repeatedly executed the algorithm with η taking on differ-

nt values (the other parameters in Table 1 remain unchanged).

his was done for SNRs ranging from −20 dB to 10 dB. The noise

eduction results are plotted in Fig. 9 , the distortion results in

ig. 10 , and the STOI measure in Fig. 11 . The STOI measures peak

n the vicinity of η = 0 . 9 and the curve is fairly flat indicating

obustness. Similarly, η = 0 . 9 is a fairly good choice with respect to

oise reduction, although for low SNRs Fig. 9 indicates that a slight

ncrease in η is beneficial for low SNRs and conversely a slight de-

rease in η is beneficial for high SNRs.

The distortion levels are somewhat better in the vicinity of η =.6. However, since the distortion is minor, this slight improvement

oes not justify the accompanying degradation in noise and STOI

hich are of notable quantity.

.5. Post-processing results

In this subsection, we examine the effects of post-processing.

hree parameters influence the post-processing: β , SNR , and

min
Page 10: Near-field signal acquisition for smartglasses using two ... · Smartglasses Adaptive signal processing a b s t r a c t Smartglasses, visual-output capabilities,addition containto

D.Y. Levin et al. / Speech Communication 83 (2016) 42–53 51

10−2

10−1

100

5

10

15

20

25

30

theshold value η

nois

e re

duct

ion

[dB

]

−20 dB−15 dB−10 dB −5 dB 0 dB 5 dB10 dB

(a) 3 static interferers

10−2

10−1

100

5

10

15

20

25

30

theshold value η

nois

e re

duct

ion

[dB

]

−20 dB−15 dB−10 dB −5 dB 0 dB 5 dB10 dB

(b) 1 moving interferer

Fig. 9. Noise reduction attained with the proposed algorithm using different values for the speech detection threshold ( η) for a number of SNR levels.

10−2

10−1

100

0

5

10

15

20

25

theshold value η

dist

ortio

n [d

B]

−20 dB−15 dB−10 dB −5 dB 0 dB 5 dB10 dB

(a) 3 static interferers

10−2

10−1

100

0

5

10

15

20

25

theshold value η

dist

ortio

n [d

B]

−20 dB−15 dB−10 dB −5 dB 0 dB 5 dB10 dB

(b) 1 moving interferer

Fig. 10. Distortion levels resulting from applying the proposed algorithm using different values for the speech detection threshold ( η) for a number of SNR levels.

10−2

10−1

100

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

theshold value η

ST

OI

−20 dB−15 dB−10 dB −5 dB 0 dB 5 dB10 dB

(a) 3 static interferers

10−2

10−1

100

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

theshold value η

ST

OI

−20 dB−15 dB−10 dB −5 dB 0 dB 5 dB10 dB

(b) 1 moving interferer

Fig. 11. STOI levels attained with the proposed algorithm using different values for the speech detection threshold ( η) for a number of SNR levels.

Page 11: Near-field signal acquisition for smartglasses using two ... · Smartglasses Adaptive signal processing a b s t r a c t Smartglasses, visual-output capabilities,addition containto

52 D.Y. Levin et al. / Speech Communication 83 (2016) 42–53

SNR [dB]-20 -15 -10 -5 0 5 10 1000

nois

e re

duct

ion

[dB

]

5

10

15

20

25

30

35

no post-processingpost1post2

(a) Noise reduction

SNR [dB]-20 -15 -10 -5 0 5 10 1000

dist

ortio

n [d

B]

-20

-18

-16

-14

-12

-10

-8

-6

no post-processingpost1post2

(b) Distortion

SNR [dB]-20 -15 -10 -5 0 5 10 1000

ST

OI

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

no post-processingpost1post2

(c) STOI

Fig. 12. Effects of post-processing on noise reduction, distortion, and STOI.

Table 2

Parameter values in post-processing.

Parameter: post 1 : post 2 :

β 0 .9 0 .9

SNR min −10 dB −24 dB

W min −8 dB −20 dB

a

p

t

t

t

p

f

m

t

a

d

a

a

l

I

b

6

a

S

v

a

a

t

o

t

t

i

h

p

s

p

a

A

T

d

i

A

w

f

t

W min . Setting the latter two parameters at lower values corre-

sponds to a more aggressive suppression of noise, whereas higher

values correspond to a more conservative approach regarding sig-

nal distortion. To illustrate this trade-off, we test the two sets of

parameters whose values 6 are given in Table 2 . These two sets

6 It should be noted that SNR min describes the ratio of powers whereas W min de-

scribes a filter’s amplitude . Consequently, the former is converted to decibel units

via 10log 10 ( ·), and the latter by 20log 10 ( ·).

s

s

v

r

re referred to as ‘post 1 ’ and ‘post 2 ’, respectively. In general, post-

rocessing parameters are determined empirically; the designer

ests which values yield results which are satisfactory for a par-

icular application.

Fig. 12 portrays the effects of post-processing (using the

hree speaker scenario as a test case) on the performance. Post-

rocessing reduces noise but increases distortion and adversely af-

ects intelligibility as measured by STOI (this degradation is very

inor for ‘post 1 ’ and more prominent in ‘post 2 ’). The parame-

ers of ‘post 1 ’ are more conservative and the parameters of ‘post 2 ’

re more aggressive with respect to noise reduction. The former

o not reduce as much noise, but have less distortion and only

minor degradation of STOI score. The latter reduce more noise

t the expense of greater distortion and lower STOI. With the

atter, audio artifacts have a stronger presence than the former.

n general, the parameters may be adjusted to attain a desirable

alance.

. Conclusion

We proposed an array which consists of two AVSs mounted on

n eyeglasses frame. This array configuration provides high input

NR and removes the need for tracking changes in the steering

ector. An algorithm for suppressing undesired components was

lso proposed. This algorithm adapts to changes of the noise char-

cteristics by continuously estimating the noise covariance ma-

rix. A speech detection scheme is used to identify the presence

f time-frequency bins containing desired speech and preventing

hem from corrupting the estimation of the noise covariance ma-

rix. The speech detection plays a pivotal role in ensuring the qual-

ty of the output signal; in the absence of a speech detector, the

igher levels of noise and distortion which are typical of MPDR

rocessing are present. Experiments confirm that the proposed

ystem performs well in both static and changing scenarios. The

roposed system may be used to improve the quality of speech

cquisition in smartglasses.

cknowledgments

We wish to acknowledge technical assistance from Microflown

echnologies (Arnhem, Netherlands) related to calibration and re-

uction of sensor noise. This contribution was essential for attain-

ng quality audio results.

ppendix A. Background on AVSs

A sound field can be described as a combination of two fields

hich are coupled: a pressure field and a particle-velocity field. The

ormer is a scalar field and the latter is a vector field consisting of

hree Cartesian components.

Conventional sensors which are typically used in acoustic

ignal-processing measure the pressure field. Acoustic vectors sen-

ors (AVSs) also measure the particle-velocity field, and thus pro-

ide more information: each sensor provides four components

ather than one component.

Page 12: Near-field signal acquisition for smartglasses using two ... · Smartglasses Adaptive signal processing a b s t r a c t Smartglasses, visual-output capabilities,addition containto

D.Y. Levin et al. / Speech Communication 83 (2016) 42–53 53

Fig. A.13. The magnitude of the directivity patterns of an AVS are plotted. They

consist of a monopole and three mutually orthogonal dipoles.

a

s

m

D

a

D

w

(

F

r

t

a

r

s

v

t

d

r

c

l

s

m

s

d

(

A

d

c

t

c

s

t

c

t

w

c

t

R

A

A

B

B

d

C

C

C

D

D

E

E

E

F

G

G

G

H

H

L

L

M

O

P

R

S

S

S

S

T

V

An AVS consists of four collocated subsensors: one monopole

nd three orthogonally oriented dipoles. For a plane wave, each

ubsensor has a distinct directivity response. The response of a

onopole element is

mon = 1 , (A.1)

nd the response of a dipole element is

dip = q

T u , (A.2)

here u is a unit-vector denoting the wave’s direction of arrival

DOA), and q is a unit-vector denoting the subsensor’s orientation.

rom the definition of scalar multiplication, it follows that D dip cor-

esponds to the cosine of the angle between the signal’s DOA and

he subsensor’s orientation. The orientation of the three subsensors

re q x = [1 0 0] T , q y = [0 1 0] T , and q z = [0 0 1] T . The monopole

esponse, which is independent of DOA, corresponds to the pres-

ure field and the three dipole responses correspond to a scaled

ersion of the Cartesian particle-velocity components. Fig. A.13 por-

rays the magnitude of the four spatial responses.

For a spherical wave, the acoustical impedance is frequency-

ependent. It can be shown that the dipole elements undergo a

elative gain of 1 +

c r

1 jω over an omnidirectional sensor (as dis-

ussed in Section 2 ). This phenomenon is manifested particu-

arly at lower frequencies for which the wavelength is significantly

horter than the source-receiver distance.

A standard omnidirectional microphone functions as a

onopole element. Several approaches are available for con-

tructing the dipole components of an AVS. One approach applies

ifferential processing of closely-spaced omnidirectional sensors

Derkx and Janse, 2009; Elko and Pong, 1995; 1997; Olson, 1946 ).

n alternative approach employs acoustical sensors with inherent

irectional properties ( Derkx, 2010; Shujau et al., 2009 ). Re-

ently, an AVS based on microelectromechanical systems (MEMS)

echnology has been developed ( de Bree, 2003 ) and has become

ommercially available. The experiments discussed in Section 5 use

uch devices.

The different approaches mentioned produce approximations of

he ideal responses. For instance, the subsensors can be placed

lose to each other but are not strictly collocated; spatial deriva-

ives are estimated, etc. The approaches mentioned above differ

ith respect to attributes such as robustness, sensor noise, and

ost. A discussion of these characteristics is beyond the scope of

he current paper.

eferences

udio samples: smartglasses. URL www.eng.biu.ac.il/gannot/speech-enhancement/

smart-glasses/ .

ckerman, E., 2013. Google gets in your face [2013 Tech To Watch]. In: IEEESpectrum, vol. 50 (1), pp. 26–29. http://spectrum.ieee.org/consumer-electronics/

gadgets/google-gets- in- your- face . arfield, W. (Ed.), 2016. Fundamentals of wearable computers and augmented 778

reality. CRC Press. itzer, J. , Simmer, K. , 2001. Superdirective microphone arrays. In: Brandstein, M.,

Ward, D. (Eds.), Microphone Arrays: Signal Processing Techniques and Applica-

tions. Springer-Verlag, pp. 18–38 . chapter 2. e Bree, H.-E. , 2003. An overview of microflown technologies. Acta Acust. United

Acust. 89 (1), 163–172 . apon, J. , 1969. High-resolution frequency-wavenumber spectrum analysis. Proc.

IEEE 57 (8), 1408–1418 . ass, S., Choi, C.Q., 2015. Google Glass, HoloLens, and the real future of augmented

reality. IEEE Spectr. 52 (3), 18, http://spectrum.ieee.org/consumer-electronics/ audiovideo/google-glass-hololens-and-the-real-future-of-augmented-reality .

ox, H. , Zeskind, R. , Owen, M. , 1987. Robust adaptive beamforming. IEEE Trans.

Acoust. Speech Signal Process. 35 (10), 1365–1376 . erkx, R.M.M. , 2010. First-order adaptive azimuthal null-steering for the suppres-

sion of two directional interferers. EURASIP J. Adv. Signal Process. 2010, 1 . erkx, R.M.M. , Janse, K. , 2009. Theoretical analysis of a first-order azimuth-steerable

superdirective microphone array. IEEE Trans. Audio Speech Lang. Process. 17 (1),150–162 .

lko, G. , Pong, A.-T. N. , 1995. A simple adaptive first-order differential microphone.

In: IEEE ASSP Workshop on Applications of Signal Processing to Audio andAcoustics, pp. 169–172 .

lko, G. , Pong, A.-T. N. , 1997. A steerable and variable first-order differential mi-crophone array. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP) 1, 223–

226 . phraim, Y. , Malah, D. , 1984. Speech enhancement using a minimum-mean square

error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal

Process. 32 (6), 1109–1121 . rost, O. , 1972. An algorithm for linearly constrained adaptive array processing. Proc.

IEEE 60 (8), 926–935 . annot, S. , Burshtein, D. , Weinstein, E. , 2001. Signal enhancement using beamform-

ing and nonstationarity with applications to speech. IEEE Trans. Signal Process.49 (8), 1614–1626 .

annot, S. , Cohen, I. , 2004. Speech enhancement based on the general transfer func-

tion gsc and postfiltering. IEEE Trans. Sp. Au. Proc. 12 (6), 561–571 . ilbert, E. , Morgan, S. , 1955. Optimum design of directive antenna arrays subject to

random variations. Bell Syst. Tech. J 34, 637–663 . armanci, K. , Tabrikian, J. , Krolik, J. , 20 0 0. Relationships between adaptive mini-

mum variance beamforming and optimal source localization. IEEE Tran. SignalProcess. 48 (1), 1–12 .

eute, U. , 2008. Speech-transmission quality: aspects and assessment for wideband

vs. narrowband signals. In: Martin, R., Heute, U., Antweiler, C. (Eds.), Advancesin Digital Speech Transmission. John Wiley & Sons .

efkimmiatis, S. , Dimitriadis, D. , Maragos, P. , 2006. An optimum microphone ar-ray post-filter for speech applications. In: Interspeech – Int. Conf. on Spoken

Lang. Proc . evin, D. , Habets, E. , Gannot, S. , et al. , 2013. Robust beamforming using sensors with

nonidentical directivity patterns. In: IEEE International Conference on Acoustics,

Speech and Signal Processing (ICASSP), pp. 91–95 . cCowan, I.A. , Bourlard, H. , 2003. Microphone array post-filter based on noise field

coherence. IEEE Trans. Speech Audio Process. 11 (6), 709–716 . lson, H.F. , 1946. Gradient microphones 17 (3), 192–198 .

ierce, A.D. , 1991. Acoustics: an introduction to its physical principles and applica-tions. Acoustical Society of America .

andell, C. , 2005. Wearable computing: a review. Technical Report. Technical ReportCSTR-06-004. University of Bristol .

hujau, M. , Ritz, C.H. , Burnett, I.S. , 2009. Designing acoustic vector sensors for local-

isation of sound sources in air. In: 17th European Signal Processing Conference,pp. 849–853 .

immer, K.U. , Bitzer, J. , Marro, C. , 2001. Post-filtering techniques. In: MicrophoneArrays. Springer, pp. 39–60 .

priet, A. , Moonen, M. , Wouters, J. , 2004. Spatially pre-processed speech distortionweighted multi-channel Wiener filtering for noise reduction. Signal Processing

84 (12), 2367–2387 .

ung, D., 2014. What’s wrong with Google Glass?: the improvements the BigG needs to make before Glass hits the masses. URL www.wareable.com/

google-glass/google-glass-improvements-needed . aal, C. , Hendriks, R. , Heusdens, R. , Jensen, J. , 2011. An algorithm for intelligibility

prediction of time-frequency weighted noisy speech. IEEE Trans. Audio, Speech,Lang. Process. 19 (7), 2125–2136 .

an Trees, H. , 2002. Detection, Estimation, and Modulation Theory. Optimum Array

Processing, IV. Wiley, New York, USA .