madhuashok.commadhuashok.com/wp-content/uploads/2019/12/ECE47… · Web viewWith stereo files, the...

ECE 476 Spring 2018 Final Paper Improving the Workflow of Spatial Audio

Madhu AshokUniversity of Rochester

Department of Electrical and Computer Engineering Submitted to Ming-Lun Lee

Musicians and audio engineers are beginning to implement spatial audio in post-production. Pro Tools HD version 12.8.2 and above now supports ambisonic

input, output, and bus paths, joining Reaper in ambisonic capable DAWs. This has motivated many audio companies to develop ambisonic plugins, such as the

360 Ambisonics Tools by Waves, which can remap audio busses into a full sphere of virtual sources. A Max patch has been developed to route mono and

stereo audio files in a Pro Tools or Reaper session into an ambisonic panner for output to monitors and recording. The patch was custom designed for the CSB

505 Multichannel Studio, which has 25 loudspeakers in a spherical arrangement.

1

Introduction

For the past two and a half years I have been working on multichannel research, and I

often ask myself: Do we even need spatial audio? Multichannel audio systems are expensive and

require specialized equipment and software to operate, leading many audio engineers (and

consumers) to disregard surround sound entirely. Even with a 25-channel system, such as the

multichannel studio in CSB 505, custom solutions need to be developed for spatial audio

panning if one has not purchased Pro Tools Ultimate ($2499.00) or the 360 Ambisonics Tools

by Waves ($400). A majority of music available is in stereo, with occasional 5.1 mixes, so why

even bother? A recent study done by Dr. SungYoung Kim and I demonstrated an increase in

clarity (C80) and reduction in early decay time (EDT) when increasing from 2 to 22-channel

audio in the following three studios [1]. Subjective evaluation revealed the changes to be

perceivable.

Room Width (m)Length

(m)Height

(m) Volume (m3)RT60 1kHz

(ms)Canada (McGill University) - CIRMMT A820 6 m 7.8 m 3.2 m 149.8 m3 168 msUSA (Rochester Institute of Tech. - RIT) - Conference Room 7 m 5.9 m 3.4 m 140.4 m3 640 msJapan (Tokyo Univ. of the Arts - Geidai) - Studio B 6.8 m 6.8 m 4.5 m 208.1 m3 340 ms

Table 1: Room dimensions, and reverberation time (RT60 – 1kHz) for the studios used in this study [1]

All three studios were relative in size, with slight variations in reverberation time due to the

different acoustic treatments. Clarity and EDT were simulated using CATT-Acoustic.

Table 2: Clarity (C80 in dB - left) and Early decay time (EDT in seconds - right) calculated from IRs in CATT-Acoustic for 1kHz.

2

C80 1kHz (dB): McGill RIT Geidai2 Channel 38.99 9.8 24.39

22.2 Channel 46.06 18.79 29.84

EDT 1kHz (s): McGill RIT Geidai2 Channel 0.15 0.61 0.24

22.2 Channel 0.01 0.15 0.03

Fig. 1: A comparison of simulated EDT and clarity (right) with subjective perceptual space (left) for 02- and 22-Channel audio in McGill, RIT, and Geida studios.

A subjective evaluation was given to eleven audio engineers at RIT to rate the similarity of two

audio files, which were simulated binaural impulse responses convolved with the music (in 2- or

22-channel for each room). The perceptual map (left) indicates that subjects can differentiate

between 2- and 22-channel audio through perceptual Dimension 1. The increase to 22-channel

content nearly doubles the clarity in the RIT studio, and overall improves the acoustics in all

three studios. This study demonstrates the importance of acoustically treating a room with stereo

or multichannel audio, as the perceptual Dimension 2 was used by subjects to differentiate

between rooms. In a follow up study by Dr. SungYoung Kim [2], listeners gave higher

preference and clarity ratings to reproduction in 22-channels (over stereo).

Another benefit from increasing channels is an inherent increase in the width of the sound

field. Subjective evaluation in Dr. Kim’s study supported this statement, where subjects rated

22-channel playback higher in depth and width [2]. Increasing to 25 speakers allows a near 4

steradians of radiated sound at the sweet spot, allowing virtual sources to be placed anywhere

3

within the sphere of speakers using ambisonic (or VBAP) encoding and decoding. This can be a

very useful tool in post-production for mapping sources within the sphere of speakers and can be

beneficial when there are many instruments or voices. In theory, one could move around virtual

sources much like a conductor can change the seating arrangement of an orchestra, but with a

much greater range of source locations. The idea of spatial audio panning can be implemented

into music (both live and recorded) easily, since the musician does not have to change anything

about the performance or microphone positions.

Even with all the benefits that can be seen from increasing channel number, the CSB 505

studio and RIT 3D audio lab are rarely frequented. One of the main reasons for this is a lack of

user-friendly compatibility with Pro Tools and other DAWs like Reaper. Last semester I

implemented a Max patch that used the ICST Ambisonics Toolbox [3] for mapping a mono

plugin to a user defined location in the sphere of speakers.

Fig. 2: Presentation view of 16-channel panning patch from last semester (top section). A horizontal panning control with a vertical panning slider.

The patch worked as designed but did not allow for any way to record or save the work as in

most DAWs. Since there are so many DAWs available, each with their own benefits, it seemed

4

logical to find a way to send audio busses into the Max patch for spatial audio panning and allow

for a user to record back into the DAW as a final 25-channel audio file. The goal was to be able

to take existing recorded and mixed sessions and map each of the audio channels to various

spaces in the 25-channel sphere as a pre-mastering step in the production. Very little needs to

change in the recording process other than using ambisonic microphones to capture the room.

Stereo audio files can be remastered in ambisonics, given that the Pro Tools session was saved.

A Max patch was developed to increase the workflow of producing in spatial audio.

Methods

In the simplest panning scheme, one could have 25 faders to control each of the channels

in the CSB 505 sound system. There are nine channels in the upper layer, consisting of one

‘voice of god’ channel directly above the listener and eight channels around the top of the

listener. The middle layer has eight channels at ear level, with eight of the channels surrounding

the listener equally spaced. The bottom layer houses the two subwoofers, along with eight

speakers surrounding the listener’s feet. Each speaker is equidistant from the sweet spot to avoid

delays and is pointed towards the center of the sphere. A CATT-Acoustic simulation of the CSB

505 studio was generated to find the surfaces that needed acoustic treatment. Twenty-five

speakers are modelled using the directivity files of the JBL Control 1 Pro and are placed in the

center of the room. Three windows are modelled in the room in addition to a scattering column

in the back. The CSB 505 studio has a drop ceiling (approximately 3 feet in height) with

removable gypsum panels (purple in CATT-Acoustics), so a bass trap was implemented by

removing panels that showed the highest energy from reflections in the CATT-Acoustic

simulation.

5

Fig. 3: CATT-Acoustic simulation of CSB 505 with the gypsum drop ceiling.

Fig. 4: TUCT surface rendering of 1kHz reflections. The yellow areas indicate higher reflected dB SPL.

6

Fig. 5: Eleven gypsum tiles were removed from the drop ceiling to help reduce reflections in the room and reduce the reverberation time. These tiles were chosen based on the surface irradiation from CATT-Acoustics.

This was the cheapest acoustic treatment I could think of for the space, but there is a great deal of

improvement that can be made. An untreated room can cause the localization of sources to

change, which would be highly undesirable in spatial audio panning applications.

When a human is asked to localize a sound, a typical answer would be him or her

pointing in the direction they thought the source was coming from. This includes the horizontal

and vertical angles from the head direction of a listener. It would seem necessary to provide

control over the positioning of sources, as to mimic the realism of a helicopter moving overhead

or a train moving across the left side of the listener. There are various formats, such as 5.1, 6.1,

7

and 7.1, that utilize horizontal panning to help visualize movements in video with audio content.

An example of this is with a MAX toolbox developed at the Institute for Computer Music and

Sound Technology (ICST) in Zurich which is available as the ICST Ambisonics Toolbox [3].

One of the example patches (Basic_01) demonstrates moving four virtual sound sources within

the circular arrangement of 8 speakers:

Fig. 6: Basic_01 from ICST Ambisonics Toolbox. The numbered sources (1, 2, 3, 4) can be moved within the grey circle (horizontal plane).

This patch uses ambiencode O N and ambidecode O M to map N virtual sources to M speakers

in the horizontal plane. The example above has 4 mono .wav files mapped to 8 speakers at ear

level equally spaced around the listener. This is done by implementing the semi-normalized

8

form of the Furse-Malham set for encoding and decoding higher order Ambisonics [3]. The

toolbox allows control over the Othorder Ambisonics, which was set to 3rd to match the example

given by ICST. This patch was adapted to work with 25 speakers arranged as they are in CSB

505.

The first improvement from last year was to implement user control over the whole

sphere for source placement. The top space is for horizontal movement of sources and the

bottom space is for vertical movement of sources. Additionally, the monitor locations were

modified to output to 25 speakers with the known locations of each speaker embedded in the

patches located above each set of speakers. In the example below, there are 11 input sources

spread around the front of the listener, all located at ear level as indicated in the vertical space

(bottom left). These 11 sources will be encoded into ambisonics and decoded through the 25-

speaker system.

Fig. 7: Final Max patch implementation of source placement (left), and monitor location (right).

9

The second improvement was to modify the input section of the Max patch. The two

most common ways to transport audio from one DAW to another is ReWire and Soundflower.

Soundflower was chosen as the audio interface due to it being open source. Soundflower has 64

inputs and outputs that can be mapped to and from any digital audio workstation. The adc

function was used to input channels 1-16 into the ambiencode 320 block, which produces 20

inputs to the ambisonic encoder with 3rd order ambisonic channels.

Fig. 8: Final Max patch implementation of input control. The top is the presentation view, and the bottom shows the block diagram.

10

Channels 1 and 2 are left for opening audio files to be moved within the sphere. Channel 3 is

used for plugins directly from Max (in this case Massive). There is also a test tone on channel 4,

cycle 440, that can be used to check the outputs when troubleshooting.

In order to correctly run audio from a DAW through Soundflower to Max, one must

change the audio settings to match the following:

Fig. 9: Settings for Reaper (left) and Max (right). The buffer size may need to be changed if the computer slows down.

In this example I used Reaper, but it can work with Pro Tools or any other DAW as long as

Soundflower is supported. Sound from Reaper will output through Soundflower, and is used as

the input device for Max. The output device in Max can be set to the 25-channel aggregate

device for playback, and Soundflower (64ch) for recording back into Reaper. A meter block is

added to each input to help the user’s workflow when multiple sources are present. There are a

total of 20 inputs and 25 outputs, approaching the 64 channel limit. Due to the constraint on

channels, playback and recording cannot happen simultaneously.

A recording feature was implemented using a gate in Max to switch between output to

the 25 monitors and recording into the DAW as a 25-channel mix. The audio settings need to be

11

changed to output through Soundflower (64ch) when recording. This feature is only available

for Reaper, as I do not own a copy of Pro Tools HD. The recording will save a 28-channel track

(28 is the closest channel number to 25 in Reaper) of the given ambisonic mix generated in the

Fig. 10: Final Max patch presentation view for outputs (top) and patching mode (middle). A gate was used to toggle between all outlets off (0), open to monitors (2), and open to Recording (3). The bottom image shows the Reaper track with inputs 30-57 that will be recorded from the spatial audio panner.

12

Max patch. This 28-channel track can be used to directly output to the speakers in CSB 505

without having to encode and decode the audio files. Even with 4 sources running through Pro

Tools there were occasional clicks and artifacts from sample dropping. When this happens the

I/O vector size needs to be modified. Issues with latency were present with Pro Tools when

there were more than 6 inputs. Reaper did not have any latency issues when tested with 4 or less

audio sources. More channels will have to be tested for latency in Reaper.

Conclusions

Fig. 11: Final Max patch presentation view for the spatial audio panner.

In the end, the spatial audio panner was able to work in the CSB 505 studio with a limited

number of channels, and very slight modification in the Pro Tools preferences and audio settings.

The patch was tested with a recording David Kunstmann and I made during the semester, which

had more than 10 channels of audio. At first, we tried mapping all the channels at once, and Pro

Tools crashed immediately. When we reduced the number of channels and increased the I/O

13

vector size we were able to get spatial panning of individual sources at user defined locations. In

order to add a channel into the inputs of Max, simply use an insert to map to a desired channel of

Soundflower, or directly send from the output to a given channel 1-16. Mono or stereo files can

be sent through Soundflower. With stereo files, the user can adjust the width and location of the

stereo input projected into the ambisonic listening space.

Fig. 12: Pro Tools inserts, sends, and I/O settings (top). This track will output to 5-6 (Out 5-6) and 11-12 (Insert 11-12) in Soundflower. These two stereo output pairs (now 8-9 and 14-15 on the bottom panel) are encoded to ambisonics based on the user defined position of the stereo pairs.

Running this Max patch with Pro Tools has latency issues that are very audible when the

I/O vector size is mismatched or too small. When using a single source and moving the position

around the sphere, the perceived source may not be identical. This could be due to the room

14

reflections, or the mapping to the speakers. Other gain artifacts appear when the sources are

moved to a position directly above the listener, or horizontally centered images.

Due to the computational intensity of this specific Max patch, a plugin in JUCE would be

highly desirable over sending audio through the Soundflower driver. Max is a great prototyping

environment but is limited in processing speed when running in parallel with other DAWs.

Overall, Reaper produces better latency when running with the spatial audio panner and is easier

to manipulate ambisonic and multichannel tracks. Although this prototype of a spatial audio

panner has its limitations, we can improve upon the workflow of getting recording sessions

converted into spatial audio mixes. Hopefully, once the bugs are worked out, I can record

sessions in the Rettner Studio using spatial and non-spatial microphones and then bring the

session to CSB 505 for spatial audio mixing/mastering.

15

References

[2] Ashok, M., King, R., Kamekawa, T., and Kim, S. “Acoustic and Subjective Evaluation of 22.2- and 2-Channel Reproduced Sound Fields in Three Studios,” in Proc. Audio Engineering Society 144th Int. Conv., AES, Milan, Italy, 2018.

[3] Kim, S., King, R., Kamekawa, T., and Sakamoto, S. “Recognition of an auditory environment: investigating room-induced influences on immersive experience,” in Proc. Audio Engineering Society Int. Conference on Spatial Reproduction, AES, Tokyo, Japan, 2018.

[4] Schacher, Jan C., and Philippe Kocher. "Ambisonics spatialization tools for max/msp." Omni 500.1 (2006).

16

madhuashok.commadhuashok.com/wp-content/uploads/2019/12/ECE47… · Web viewWith stereo files, the...

Documents

Transcript of madhuashok.commadhuashok.com/wp-content/uploads/2019/12/ECE47… · Web viewWith stereo files, the...