madhuashok.commadhuashok.com/wp-content/uploads/2019/12/ECE47… · Web viewWith stereo files, the...
Transcript of madhuashok.commadhuashok.com/wp-content/uploads/2019/12/ECE47… · Web viewWith stereo files, the...
ECE 476 Spring 2018 Final Paper Improving the Workflow of Spatial Audio
Madhu AshokUniversity of Rochester
Department of Electrical and Computer Engineering Submitted to Ming-Lun Lee
Musicians and audio engineers are beginning to implement spatial audio in post-production. Pro Tools HD version 12.8.2 and above now supports ambisonic
input, output, and bus paths, joining Reaper in ambisonic capable DAWs. This has motivated many audio companies to develop ambisonic plugins, such as the
360 Ambisonics Tools by Waves, which can remap audio busses into a full sphere of virtual sources. A Max patch has been developed to route mono and
stereo audio files in a Pro Tools or Reaper session into an ambisonic panner for output to monitors and recording. The patch was custom designed for the CSB
505 Multichannel Studio, which has 25 loudspeakers in a spherical arrangement.
1
Introduction
For the past two and a half years I have been working on multichannel research, and I
often ask myself: Do we even need spatial audio? Multichannel audio systems are expensive and
require specialized equipment and software to operate, leading many audio engineers (and
consumers) to disregard surround sound entirely. Even with a 25-channel system, such as the
multichannel studio in CSB 505, custom solutions need to be developed for spatial audio
panning if one has not purchased Pro Tools Ultimate ($2499.00) or the 360 Ambisonics Tools
by Waves ($400). A majority of music available is in stereo, with occasional 5.1 mixes, so why
even bother? A recent study done by Dr. SungYoung Kim and I demonstrated an increase in
clarity (C80) and reduction in early decay time (EDT) when increasing from 2 to 22-channel
audio in the following three studios [1]. Subjective evaluation revealed the changes to be
perceivable.
Room Width (m)Length
(m)Height
(m) Volume (m3)RT60 1kHz
(ms)Canada (McGill University) - CIRMMT A820 6 m 7.8 m 3.2 m 149.8 m3 168 msUSA (Rochester Institute of Tech. - RIT) - Conference Room 7 m 5.9 m 3.4 m 140.4 m3 640 msJapan (Tokyo Univ. of the Arts - Geidai) - Studio B 6.8 m 6.8 m 4.5 m 208.1 m3 340 ms
Table 1: Room dimensions, and reverberation time (RT60 – 1kHz) for the studios used in this study [1]
All three studios were relative in size, with slight variations in reverberation time due to the
different acoustic treatments. Clarity and EDT were simulated using CATT-Acoustic.
Table 2: Clarity (C80 in dB - left) and Early decay time (EDT in seconds - right) calculated from IRs in CATT-Acoustic for 1kHz.
2
C80 1kHz (dB): McGill RIT Geidai2 Channel 38.99 9.8 24.39
22.2 Channel 46.06 18.79 29.84
EDT 1kHz (s): McGill RIT Geidai2 Channel 0.15 0.61 0.24
22.2 Channel 0.01 0.15 0.03
Fig. 1: A comparison of simulated EDT and clarity (right) with subjective perceptual space (left) for 02- and 22-Channel audio in McGill, RIT, and Geida studios.
A subjective evaluation was given to eleven audio engineers at RIT to rate the similarity of two
audio files, which were simulated binaural impulse responses convolved with the music (in 2- or
22-channel for each room). The perceptual map (left) indicates that subjects can differentiate
between 2- and 22-channel audio through perceptual Dimension 1. The increase to 22-channel
content nearly doubles the clarity in the RIT studio, and overall improves the acoustics in all
three studios. This study demonstrates the importance of acoustically treating a room with stereo
or multichannel audio, as the perceptual Dimension 2 was used by subjects to differentiate
between rooms. In a follow up study by Dr. SungYoung Kim [2], listeners gave higher
preference and clarity ratings to reproduction in 22-channels (over stereo).
Another benefit from increasing channels is an inherent increase in the width of the sound
field. Subjective evaluation in Dr. Kim’s study supported this statement, where subjects rated
22-channel playback higher in depth and width [2]. Increasing to 25 speakers allows a near 4
steradians of radiated sound at the sweet spot, allowing virtual sources to be placed anywhere
3
within the sphere of speakers using ambisonic (or VBAP) encoding and decoding. This can be a
very useful tool in post-production for mapping sources within the sphere of speakers and can be
beneficial when there are many instruments or voices. In theory, one could move around virtual
sources much like a conductor can change the seating arrangement of an orchestra, but with a
much greater range of source locations. The idea of spatial audio panning can be implemented
into music (both live and recorded) easily, since the musician does not have to change anything
about the performance or microphone positions.
Even with all the benefits that can be seen from increasing channel number, the CSB 505
studio and RIT 3D audio lab are rarely frequented. One of the main reasons for this is a lack of
user-friendly compatibility with Pro Tools and other DAWs like Reaper. Last semester I
implemented a Max patch that used the ICST Ambisonics Toolbox [3] for mapping a mono
plugin to a user defined location in the sphere of speakers.
Fig. 2: Presentation view of 16-channel panning patch from last semester (top section). A horizontal panning control with a vertical panning slider.
The patch worked as designed but did not allow for any way to record or save the work as in
most DAWs. Since there are so many DAWs available, each with their own benefits, it seemed
4
logical to find a way to send audio busses into the Max patch for spatial audio panning and allow
for a user to record back into the DAW as a final 25-channel audio file. The goal was to be able
to take existing recorded and mixed sessions and map each of the audio channels to various
spaces in the 25-channel sphere as a pre-mastering step in the production. Very little needs to
change in the recording process other than using ambisonic microphones to capture the room.
Stereo audio files can be remastered in ambisonics, given that the Pro Tools session was saved.
A Max patch was developed to increase the workflow of producing in spatial audio.
Methods
In the simplest panning scheme, one could have 25 faders to control each of the channels
in the CSB 505 sound system. There are nine channels in the upper layer, consisting of one
‘voice of god’ channel directly above the listener and eight channels around the top of the
listener. The middle layer has eight channels at ear level, with eight of the channels surrounding
the listener equally spaced. The bottom layer houses the two subwoofers, along with eight
speakers surrounding the listener’s feet. Each speaker is equidistant from the sweet spot to avoid
delays and is pointed towards the center of the sphere. A CATT-Acoustic simulation of the CSB
505 studio was generated to find the surfaces that needed acoustic treatment. Twenty-five
speakers are modelled using the directivity files of the JBL Control 1 Pro and are placed in the
center of the room. Three windows are modelled in the room in addition to a scattering column
in the back. The CSB 505 studio has a drop ceiling (approximately 3 feet in height) with
removable gypsum panels (purple in CATT-Acoustics), so a bass trap was implemented by
removing panels that showed the highest energy from reflections in the CATT-Acoustic
simulation.
5
Fig. 3: CATT-Acoustic simulation of CSB 505 with the gypsum drop ceiling.
Fig. 4: TUCT surface rendering of 1kHz reflections. The yellow areas indicate higher reflected dB SPL.
6
Fig. 5: Eleven gypsum tiles were removed from the drop ceiling to help reduce reflections in the room and reduce the reverberation time. These tiles were chosen based on the surface irradiation from CATT-Acoustics.
This was the cheapest acoustic treatment I could think of for the space, but there is a great deal of
improvement that can be made. An untreated room can cause the localization of sources to
change, which would be highly undesirable in spatial audio panning applications.
When a human is asked to localize a sound, a typical answer would be him or her
pointing in the direction they thought the source was coming from. This includes the horizontal
and vertical angles from the head direction of a listener. It would seem necessary to provide
control over the positioning of sources, as to mimic the realism of a helicopter moving overhead
or a train moving across the left side of the listener. There are various formats, such as 5.1, 6.1,
7
and 7.1, that utilize horizontal panning to help visualize movements in video with audio content.
An example of this is with a MAX toolbox developed at the Institute for Computer Music and
Sound Technology (ICST) in Zurich which is available as the ICST Ambisonics Toolbox [3].
One of the example patches (Basic_01) demonstrates moving four virtual sound sources within
the circular arrangement of 8 speakers:
Fig. 6: Basic_01 from ICST Ambisonics Toolbox. The numbered sources (1, 2, 3, 4) can be moved within the grey circle (horizontal plane).
This patch uses ambiencode O N and ambidecode O M to map N virtual sources to M speakers
in the horizontal plane. The example above has 4 mono .wav files mapped to 8 speakers at ear
level equally spaced around the listener. This is done by implementing the semi-normalized
8
form of the Furse-Malham set for encoding and decoding higher order Ambisonics [3]. The
toolbox allows control over the Othorder Ambisonics, which was set to 3rd to match the example
given by ICST. This patch was adapted to work with 25 speakers arranged as they are in CSB
505.
The first improvement from last year was to implement user control over the whole
sphere for source placement. The top space is for horizontal movement of sources and the
bottom space is for vertical movement of sources. Additionally, the monitor locations were
modified to output to 25 speakers with the known locations of each speaker embedded in the
patches located above each set of speakers. In the example below, there are 11 input sources
spread around the front of the listener, all located at ear level as indicated in the vertical space
(bottom left). These 11 sources will be encoded into ambisonics and decoded through the 25-
speaker system.
Fig. 7: Final Max patch implementation of source placement (left), and monitor location (right).
9
The second improvement was to modify the input section of the Max patch. The two
most common ways to transport audio from one DAW to another is ReWire and Soundflower.
Soundflower was chosen as the audio interface due to it being open source. Soundflower has 64
inputs and outputs that can be mapped to and from any digital audio workstation. The adc
function was used to input channels 1-16 into the ambiencode 320 block, which produces 20
inputs to the ambisonic encoder with 3rd order ambisonic channels.
Fig. 8: Final Max patch implementation of input control. The top is the presentation view, and the bottom shows the block diagram.
10
Channels 1 and 2 are left for opening audio files to be moved within the sphere. Channel 3 is
used for plugins directly from Max (in this case Massive). There is also a test tone on channel 4,
cycle 440, that can be used to check the outputs when troubleshooting.
In order to correctly run audio from a DAW through Soundflower to Max, one must
change the audio settings to match the following:
Fig. 9: Settings for Reaper (left) and Max (right). The buffer size may need to be changed if the computer slows down.
In this example I used Reaper, but it can work with Pro Tools or any other DAW as long as
Soundflower is supported. Sound from Reaper will output through Soundflower, and is used as
the input device for Max. The output device in Max can be set to the 25-channel aggregate
device for playback, and Soundflower (64ch) for recording back into Reaper. A meter block is
added to each input to help the user’s workflow when multiple sources are present. There are a
total of 20 inputs and 25 outputs, approaching the 64 channel limit. Due to the constraint on
channels, playback and recording cannot happen simultaneously.
A recording feature was implemented using a gate in Max to switch between output to
the 25 monitors and recording into the DAW as a 25-channel mix. The audio settings need to be
11
changed to output through Soundflower (64ch) when recording. This feature is only available
for Reaper, as I do not own a copy of Pro Tools HD. The recording will save a 28-channel track
(28 is the closest channel number to 25 in Reaper) of the given ambisonic mix generated in the
Fig. 10: Final Max patch presentation view for outputs (top) and patching mode (middle). A gate was used to toggle between all outlets off (0), open to monitors (2), and open to Recording (3). The bottom image shows the Reaper track with inputs 30-57 that will be recorded from the spatial audio panner.
12
Max patch. This 28-channel track can be used to directly output to the speakers in CSB 505
without having to encode and decode the audio files. Even with 4 sources running through Pro
Tools there were occasional clicks and artifacts from sample dropping. When this happens the
I/O vector size needs to be modified. Issues with latency were present with Pro Tools when
there were more than 6 inputs. Reaper did not have any latency issues when tested with 4 or less
audio sources. More channels will have to be tested for latency in Reaper.
Conclusions
Fig. 11: Final Max patch presentation view for the spatial audio panner.
In the end, the spatial audio panner was able to work in the CSB 505 studio with a limited
number of channels, and very slight modification in the Pro Tools preferences and audio settings.
The patch was tested with a recording David Kunstmann and I made during the semester, which
had more than 10 channels of audio. At first, we tried mapping all the channels at once, and Pro
Tools crashed immediately. When we reduced the number of channels and increased the I/O
13
vector size we were able to get spatial panning of individual sources at user defined locations. In
order to add a channel into the inputs of Max, simply use an insert to map to a desired channel of
Soundflower, or directly send from the output to a given channel 1-16. Mono or stereo files can
be sent through Soundflower. With stereo files, the user can adjust the width and location of the
stereo input projected into the ambisonic listening space.
Fig. 12: Pro Tools inserts, sends, and I/O settings (top). This track will output to 5-6 (Out 5-6) and 11-12 (Insert 11-12) in Soundflower. These two stereo output pairs (now 8-9 and 14-15 on the bottom panel) are encoded to ambisonics based on the user defined position of the stereo pairs.
Running this Max patch with Pro Tools has latency issues that are very audible when the
I/O vector size is mismatched or too small. When using a single source and moving the position
around the sphere, the perceived source may not be identical. This could be due to the room
14
reflections, or the mapping to the speakers. Other gain artifacts appear when the sources are
moved to a position directly above the listener, or horizontally centered images.
Due to the computational intensity of this specific Max patch, a plugin in JUCE would be
highly desirable over sending audio through the Soundflower driver. Max is a great prototyping
environment but is limited in processing speed when running in parallel with other DAWs.
Overall, Reaper produces better latency when running with the spatial audio panner and is easier
to manipulate ambisonic and multichannel tracks. Although this prototype of a spatial audio
panner has its limitations, we can improve upon the workflow of getting recording sessions
converted into spatial audio mixes. Hopefully, once the bugs are worked out, I can record
sessions in the Rettner Studio using spatial and non-spatial microphones and then bring the
session to CSB 505 for spatial audio mixing/mastering.
15
References
[2] Ashok, M., King, R., Kamekawa, T., and Kim, S. “Acoustic and Subjective Evaluation of 22.2- and 2-Channel Reproduced Sound Fields in Three Studios,” in Proc. Audio Engineering Society 144th Int. Conv., AES, Milan, Italy, 2018.
[3] Kim, S., King, R., Kamekawa, T., and Sakamoto, S. “Recognition of an auditory environment: investigating room-induced influences on immersive experience,” in Proc. Audio Engineering Society Int. Conference on Spatial Reproduction, AES, Tokyo, Japan, 2018.
[4] Schacher, Jan C., and Philippe Kocher. "Ambisonics spatialization tools for max/msp." Omni 500.1 (2006).
16