9348 209. - Thieme Medical Publishers

15
The purpose of this chapter is to lay the foun- dation for broader understanding of one of the most promising laryngeal imaging techniques, high-speed videoendoscopy (HSV). This tech- nique may have significant impact in helping us uncover new phenomena in the mechanism of voice production and to better understand laryngeal pathology along with its impact on voice quality. HSV is the most powerful tool for the examination of vocal fold vibration to date. It will provide further insights into the biomechanics of laryngeal sound production, as well as enable more accurate functional as- sessment of the pathophysiology of voice dis- orders leading to refinements in the diagnosis and management of vocal fold pathology. For the clinic, HSV is capable of providing un- matched functional and structural informa- tion about the larynx, which will ultimately improve clinical practice in speech-language pathology and otolaryngology. Without claims of completeness, this chap- ter describes the origins and principles of the HSV technique, the technical considerations important for making HSV clinically useful, the advantages of HSV over videostroboscopy, the unsolved challenges to HSV delaying its wide clinical implementation, and the direc- tions in which HSV is expected to improve our research and clinical abilities. Several clinical applications of HSV are reviewed. Origins and Principles of High-Speed Videoendoscopy During phonation, the vocal folds usually open and close over 100 times per second and vibrate at velocities approaching 1 meter per second, making it impossible to view this activity with the unaided eye. 1 For centuries, scientists and clinicians have been trying to build instruments allowing visualization of this fast vibration. To present such fast vibra- tion to the human eye, one has to “slow it down.” There are three methods for slowing down fast motion. The most obvious method to “slow down” vocal fold vibration is by optically photo- graphing the fast-vibrating vocal folds at speeds several times faster than the frequency of vibration, then presenting those images to the human eye at significantly slower rates. This is the principle of high-speed imaging (Fig. 28.1). Until the late 19th century, little was known about the limits of visual perception, and building a high-speed imaging machine required technologies not available at the 28 Laryngeal High-Speed Videoendoscopy Dimitar Deliyski

Transcript of 9348 209. - Thieme Medical Publishers

Page 1: 9348 209. - Thieme Medical Publishers

The purpose of this chapter is to lay the foun-dation for broader understanding of one of themost promising laryngeal imaging techniques,high-speed videoendoscopy (HSV). This tech-nique may have significant impact in helpingus uncover new phenomena in the mechanismof voice production and to better understandlaryngeal pathology along with its impact onvoice quality. HSV is the most powerful toolfor the examination of vocal fold vibration todate. It will provide further insights into thebiomechanics of laryngeal sound production,as well as enable more accurate functional as-sessment of the pathophysiology of voice dis-orders leading to refinements in the diagnosisand management of vocal fold pathology. Forthe clinic, HSV is capable of providing un-matched functional and structural informa-tion about the larynx, which will ultimatelyimprove clinical practice in speech-languagepathology and otolaryngology.

Without claims of completeness, this chap-ter describes the origins and principles of theHSV technique, the technical considerationsimportant for making HSV clinically useful,the advantages of HSV over videostroboscopy,the unsolved challenges to HSV delaying itswide clinical implementation, and the direc-tions in which HSV is expected to improve our

research and clinical abilities. Several clinicalapplications of HSV are reviewed.

◆ Origins and Principles ofHigh-Speed Videoendoscopy

During phonation, the vocal folds usually openand close over 100 times per second andvibrate at velocities approaching 1 meter persecond, making it impossible to view thisactivity with the unaided eye.1 For centuries,scientists and clinicians have been trying tobuild instruments allowing visualization ofthis fast vibration. To present such fast vibra-tion to the human eye, one has to “slow itdown.” There are three methods for slowingdown fast motion.

The most obvious method to “slow down”vocal fold vibration is by optically photo-graphing the fast-vibrating vocal folds atspeeds several times faster than the frequencyof vibration, then presenting those images tothe human eye at significantly slower rates.This is the principle of high-speed imaging(Fig. 28.1). Until the late 19th century, little wasknown about the limits of visual perception,and building a high-speed imaging machinerequired technologies not available at the

28Laryngeal High-Speed Videoendoscopy

Dimitar Deliyski

E1CH28.qxd 2/26/10 4:40 PM Page 245

Page 2: 9348 209. - Thieme Medical Publishers

time. Therefore, scientists and engineers hadto search for alternative methods.

Another “indirect” approach to the evalua-tion of vocal fold vibration is by recordingsignals that result from the vibration andpresenting them in a graphic format to the hu-man eye. One can then infer conclusions aboutthe characteristics of vocal fold vibration byanalyzing the graphic images. The invention of

the phonograph and the gramophone allowedfor obtaining “visible graphic recordings”(ie, acoustic waveforms).2 Consequently, theadvances in acoustic voice analysis, in elec-troglottography (EGG) and photoglottography,and the ability to record signals for transglot-tal airflow and intraoral and subglottal pres-sure provided invaluable indirect informationabout the vibration of the vocal folds. The

VII

Hig

h-S

pe

ed

Im

ag

ing

246

Fig. 28.1 Example of a sequence of color HSV images

containing two glottal cycles. The frequency of

vibration of the vocal folds (male subject) is 126 Hz and

the HSV frame rate is 4000 fps, producing a sequence

of �32 images per glottal cycle (Video Clip 53).

E1CH28.qxd 2/26/10 4:41 PM Page 246

Page 3: 9348 209. - Thieme Medical Publishers

knowledge learned via these technologieshelped to refine the models and theory ofvoice production and stimulated the buildingof instrumentation that improved clinicalvoice assessment.

The third approach for “slowing down” thevibration of the vocal folds is by taking advan-tage of the stroboscopic effect, which is possi-ble due to the quasi-periodic nature of thevocal fold vibration. In the late 19th century,Oertel published the earliest application ofstroboscopic principles for observing vocalfold vibration.3 Later, combining indirect voicesignals, acoustic or EGG, with a film or videocamera, led to the invention of the mostwidely used instrument for laryngeal imagingtoday, the videostroboscopic system.4 Chapter11 provides full details about the principlesand clinical application of videostroboscopy.

The first high-speed motion picture machinewas built in the 1930s, leading immediately toseveral studies of vocal fold vibration.5,6 Theseand other later studies have become some ofthe most important works in understandinglaryngeal physiology.5–8 But the technology forhigh-speed imaging was too impractical untilthe mid-1990s, when two types of high-speedimaging systems became commercially avail-able, the videokymography (VKG) and the HSVsystems.9,10 VKG could scan a single line acrossthe vocal folds at a speed of 7800 lines persecond, and the HSV systems could scan a fullimage at speeds up to 2000 frames per second(fps). These first systems provided monochro-matic images with poor resolution and imagequality. The VKG was faster and less expensivethan HSV but could scan only one section onthe anterior-posterior plane of the vocal foldsand lacked a mechanism for feedback aboutwhich line was being scanned. A new-generationVKG system has resolved many of theseconcerns.11 In the meantime, bridging thetechnological gain in machine vision helpedtremendously improve the HSV technology.12

Today, high-speed cameras can record atframe rates up to 1,000,000 fps. They canrecord in color (Fig. 28.1), with high spatialresolution and excellent image quality, forlonger durations.13 Before overwhelming thereader by presenting technical parameters, itis important to explain: What makes HSVsuperior to videostroboscopy?

◆ Advantages of High-SpeedVideoendoscopy overVideostroboscopy

Videostroboscopy

To elicit an effect of “slow motion,” videostro-boscopy relies on the assumption of the near-periodic nature of vocal fold vibration. Figure28.2A reiterates an illustration of the princi-ple of videostroboscopy from Chapter 11. Invideostroboscopy, the resulting “slow-motion”glottal cycle is artificially assembled from im-ages sampled from consecutive phases takenfrom different glottal cycles. The strobe lightflashes are short (10 to 20 �sec), deliveredonly during the cycle phases for which imagesare taken.

It is important to realize that in the case of anaperiodic signal, the near-periodic assumptionsdo not hold. When the acoustic or EGG signal isaperiodic, the timing of strobe flashes does notcorrespond with the phases of the glottic cyclein the desired sequence. Even subtle variationsin periodicity can produce completely distortedor unrealistic videostroboscopic sequences.Depending on the type of aperiodicity, thedistortions may produce random-appearing vi-brations, may change the balance between thetiming of the opening and closing phases of theglottal cycle, may produce a reverse-appearingmotion during a portion of the cycle or throughthe entire cycle, or may “lock” out of the closedphase, making it appear that the glottis nevercloses completely. All these effects can occureven in vocal fold vibration with very goodoverall periodicity, where the irregularity is soslight that it is not visually perceivable. At thesame time, relatively pronounced aperiodic pat-terns may be able to synchronize well thestrobe light, producing an illusory “regular”cycle. There are at least three reasons for theseeffects:

1. Videostroboscopy is a hybrid between anacoustic analysis system and an imaging sys-tem. The videostroboscopic effect relies onanalysis of the acoustic or EGG waveform.We typically classify videostroboscopy un-der the category of laryngeal imaging sys-tems. That is partially true, because the end

28

Lary

ng

ea

l H

igh

-Sp

ee

d V

ide

oe

nd

osc

op

y

247

E1CH28.qxd 2/26/10 4:41 PM Page 247

Page 4: 9348 209. - Thieme Medical Publishers

product of a videostroboscopic exam is aseries of images. However, which images arebeing presented depends on acoustic analy-sis. Therefore, from the point of view of theanalysis of vocal fold vibration, videostro-boscopy is an acoustic analysis technique,

not a true imaging technique. The videostro-boscopic vocal fold vibratory patterns aredetermined by the acoustic waveform, notby the actual biomechanical vibration. Alllimitations to acoustic voice analysis arepresent in videostroboscopy.14

VII

Hig

h-S

pe

ed

Im

ag

ing

248

Fig. 28.2 Illustration of the principle of sampling in

videostroboscopy (A) and in HSV (B, C). (A) In

videostroboscopy, the resulting “slow-motion” glottal

cycle (below) is artificially assembled from images

sampled from consecutive phases in different glottal

cycles (above); whereas (B) in HSV, each cycle (below)

is represented by images sampled within that very

same cycle (above); that is, there is a “true” intracycle

slow-motion viewing achieved by zooming, or warping

the timescale. (C) When increasing the frame rate,

HSV represents more accurately the details of the

vibration within the glottal cycle.

E1CH28.qxd 2/26/10 4:41 PM Page 248

Page 5: 9348 209. - Thieme Medical Publishers

2. For each video frame, videostroboscopyrelies on pitch tracking through a laryn-geal contact microphone or EGG to predictthe phase of the upcoming glottal cycles.Thus, the acoustic or EGG signal duringone 30-msec video frame is used to pre-dict the frequency of the next phase ofvibration that will be recorded in the up-coming video frame, assuming that theperiod will not change within the next30 msec. Obviously, in the case of aperi-odic vibration, or aperiodic acoustic orEGG waveforms, the videostroboscopicimages will present in random or chaoticorder and will not be representative of theactual vibration pattern.

3. It is also important to note that high aperi-odicity of the acoustic or EGG signal doesnot always mean that the period of vibra-tion is highly irregular. A visible irregularpattern of vocal fold vibration wouldalways cause severe irregularities in theacoustic waveform, translating into percep-tual effects of severe dysphonia or aphonia.However, period perturbations of theacoustic waveform do not necessarilymean that the period of vibration is visiblyaperiodic. Most of the increased acousticperturbation cases are not related to “visi-ble” period irregularities of the vibration.15

The variations of the acoustic period arenot necessarily caused by variations inthe period of the glottal cycle. The glottalcycle period may be stable overall, but thecycle-to-cycle variations of local (intracycle)vibratory features, such as glottal width,symmetry, open quotient, mucosal wave,mucus bridges, and/or loss of contact, maybe producing acoustic period perturbations.Videostroboscopy interprets these acousticperiod perturbations as vibratory periodinstability, leading to an overdiagnosis ofaperiodicity.15,16

In summary, aperiodic vibration, or aperi-odic acoustic waveforms cause the strobelight to become asynchronized with theactual phase of vocal fold movements pre-venting visualization in “slow motion.” As aresult, videostroboscopy cannot be used onpersons whose voice disorder has causedtheir vocal fold movement, or acoustic

waveform, to become aperiodic. Thus, manypatients, mainly those exhibiting dysphonia,cannot benefit from the technology ofvideostroboscopy even though it is consid-ered the current gold standard for laryngealimaging.

Videostroboscopy is a technique that revo-lutionized clinical management of voice disor-ders and laryngeal pathology. However, it isapplicable only on sustained phonation tasksfor individuals with stable phonatory charac-teristics. Accurate and reliable assessmentcannot be achieved for individuals withpronounced dysphonia. Videostroboscopy isnot applicable for evaluating transient vocalfold vibratory behaviors, such as, phonatorybreaks, laryngeal spasms, and the onset andoffset of phonation. It cannot be used for tasksinvolving vocal attack, coughing, throat clear-ing, laughing, and other activities includingrapid laryngeal maneuvers.

High-Speed Videoendoscopy

In contrast with videostroboscopy, HSV is theonly technique that captures the true intracy-cle vibratory behavior through a true series offull-frame images of the vocal folds. Therefore,HSV, by default, overcomes the above limita-tions of videostroboscopy, providing for thepossibility of a more reliable and accurateobjective quantification of the vocal foldvibratory behavior regardless of whether thisbehavior is periodic or aperiodic.

Figure 28.2B illustrates the principle of HSVsampling in comparison with stroboscopicsampling (Fig. 28.2A). In HSV, each resultingglottal cycle is represented by several imagessampled within that very same cycle (ie, thereis a “true” intracycle slow-motion viewingachieved by zooming the timescale). The light-ing is constant, not intermittent as in stro-boscopy. HSV is recording constantly, and noinformation can be missing between theframes. HSV data contains all frames, not justselected ones. Therefore, HSV supersedesvideostroboscopy. We have demonstrated thatvideostroboscopy can be produced from HSVusing simulated stroboscopy with audio(SSA).13 The advantage of SSA is that it usesthe actual vibration, not an indirect acoustic

28

Lary

ng

ea

l H

igh

-Sp

ee

d V

ide

oe

nd

osc

op

y

249

E1CH28.qxd 2/26/10 4:41 PM Page 249

Page 6: 9348 209. - Thieme Medical Publishers

signal, to establish the glottal cycle phases inproducing the stroboscopic effect, thus it doesnot suffer the tracking errors typical forvideostroboscopy.

In addition, HSV is a superset of VKG,because it contains all the VKG lines fromthe anterior to the posterior, whereas VKGcontains only one line.13 Therefore, kymogra-phy can be produced from HSV by selecting aparticular line across the anterior-posterioraxis, a process termed digital kymography(DKG).

These properties make HSV uniquely suit-able for either spatial and/or dynamic repre-sentation of the same content (ie, as a movie)or kymographically. Not only does HSV recordthe true glottal cycle: It records a series ofmany glottal cycles, allowing for the study ofcycle-to-cycle variation in the local (intracycle)vibratory features over time.

Assessment of Vibratory Features

of Sustained Phonation

The purpose of videostroboscopy is theassessment of vocal fold vibratory features insustained phonation. In a stroboscopic exam,the voice assessment protocol includes vibra-tory features such as periodicity, symmetry,mucosal wave, open quotient, glottal closure,and mucus aggregation. HSV can be used toelicit all features of the stroboscopic protocol.However, due to the higher temporal resolu-tion and tracking reliability of the HSV tech-nique, some of the features appear differently,and new important aspects of these featurescan be observed.

Symmetry

Vibratory symmetry of the glottal cycle canbe regarded in several ways. In a videostrobo-scopic evaluation, asymmetry is judged in theleft-right dimension evidenced by amplitudeand phase differences between the left andright vocal folds. A recent systematic catego-rization of asymmetry differentiated fouraspects of left-right asymmetry: amplitude,phase and frequency differences, and axisshifts.17 Another important aspect of asymme-try is the anterior-posterior phase asymmetry,

which is often manifested through the hour-glass or zipper effects during vocal foldclosure.18 Anterior-posterior phase asymme-try is defined as the anterior and posteriorportion of one vocal fold reaching maximalglottal opening at different times within theglottal cycle. Left-right phase asymmetry isdefined as the two vocal folds reaching maxi-mal glottal opening at different times withinthe glottal cycle. Left-right amplitude asym-metry is defined as the two vocal folds havingdifferent maximal amplitudes of glottalopening within the glottal cycle. Left-rightfrequency asymmetry is defined as the twovocal folds vibrating at different frequencies.Axis shifts are defined as the spatial locationof the opening of the vocal folds within theglottal cycle shifting to the left or to the rightfrom the location of last contact. HSV allowsfor the objective visualization of all fiveaspects of asymmetry, whereas videostro-boscopy can be used only for two, left-rightamplitude and phase asymmetry.18,19 Invideostroboscopy, these two features areusually judged together because it is difficultto perceptually separate them, and their visu-alization is limited only to periodic vibrationand acoustic waveforms.

Period and Glottal Width Irregularity

Regularity, or periodicity, of vocal fold vibra-tion can be defined as the exact repetition of aspatial-temporal pattern. Thus, irregularityand aperiodicity refer to any change of thispattern over time. The most common visuallyjudged features of vocal fold vibratory regular-ity are glottal period regularity and glottalwidth regularity, which reflect the two as-pects of the spatial-temporal pattern.15,19 BothHSV and videostroboscopy can be used forassessing period and glottal width regularity.However, the reliability of videostroboscopysuffers significantly in the presence of irregu-larity due to tracking problems, whereasHSV can visualize any irregular vibratory pat-tern. In videostroboscopy, the determinationof irregularity is essentially based on theacoustic or EGG signal’s properties, not onthe actual vibration properties. In addition tothe ability to precisely record irregular patterns,HSV allows for presenting these patterns in a

VII

Hig

h-S

pe

ed

Im

ag

ing

250

E1CH28.qxd 2/26/10 4:41 PM Page 250

Page 7: 9348 209. - Thieme Medical Publishers

spatial-temporal domain using DKG, makingthem more comprehensible.

Mucosal Wave

Mucosal wave is one feature that is generallythought to be a good global indicator of vibra-tory behavior. Mucosal wave is the propagationof the epithelium and superficial layer of thelamina propria from the inferior to the supe-rior surface of the vocal folds during phona-tion. The presence, magnitude, and symmetryof the mucosal wave are indicators of tensionand pliability of the underlying vocal foldtissue and are essential to the production ofgood voice quality.20 Due to the anatomic con-figuration of the vocal folds and the superfi-cial viewpoint of rigid endoscopy relative tothem, the mucosal wave is viewed throughtwo different aspects. The first viewing aspectis the lateral propagation of the mucosal wavebetween the vocal folds, where the mucosalwave is seen as the differential between thelower and the upper margins of the vocal foldsduring closing. This view begins with theclosing phase, from the moment of adductionof the lower margins of the vocal folds throughthe end of the adduction of the entire folds.The second viewing aspect is the propagationof the mucosal wave on the upper surface ofthe vocal folds. This view begins during theclosing phase from the upper margins of thevocal folds. While the vocal folds are adduct-ing to close the glottis, the mucosal wave istraveling in the opposite direction, toward theexterior margins of the vocal folds. Both HSVand videostroboscopy allow visualization ofthe two mucosal wave aspects. However, HSVprovides more objective visualization, espe-cially through the use of DKG. Due to its highvelocity of propagation, the mucosal wave isthe feature most sensitive to the frame rateof the HSV system.20 Our investigations showthat for achieving full viewing of the mucosalwave features, the frame rate has to be at least16 times higher than the frequency of vibra-tion. That is, for a man with a fundamentalfrequency (F0) of 125 Hz, the frame rate has tobe at least 2000 fps, for a woman with F0 �

300 Hz, it has to be at least 4800 fps, and for awoman producing falsetto with F0 � 1000 Hz,it has to be at least 16,000 fps to track the detailof mucosal wave propagation.

Open Quotient

Open quotient is the amount of time the vocalfolds are in the opening and closing phase,versus the duration of the entire vibratorycycle.19,21 HSV allows for measuring open quo-tient because it provides the true intracycleinformation for each glottal cycle.

Contact and Loss of Contact

Glottal closure is the pattern of vocal foldcontact at the closed phase of vibration. It isgenerally categorized as closed, hourglass,anterior gap, posterior gap, or irregular. Thisfeature can be viewed through both HSV andvideostroboscopy. However, it is very impor-tant to report whether the realization ofcontact and loss of contact are changing fromone cycle to the next. Only HSV can providethis information due to its inherent truecycle-to-cycle visualization.

Mucus and Mucus Bridges

Vocal fold mucus aggregation is common inpersons with voice disorders. It is known thatan increase in vocal fold mass, from mucus,will change vocal fold vibratory behavior.Mucus has been noted as the causal factor ofrough vocal quality. The presence, type, thick-ness, location, and pooling of mucus aggrega-tion are important indicators of how mucus isimpacting vocal quality.22 Mucus can be evalu-ated with both HSV and videostroboscopictechniques, and videostroboscopy is generallymore sensitive due to its better spatial resolu-tion and image quality. However, anotherfeature important for voice quality, the cycle-to-cycle variation of mucus bridges formingbetween the vocal folds during loss of contact,can be studied only through HSV.

As indicated, many vibratory features can bestudied by either videostroboscopy or HSV.However, most of the features appear differentfrom videostroboscopy when viewed usingHSV.15,16,18,20,22,23 Voice clinicians, includingspeech-language pathologists and laryngolo-gists, have been highly trained to usevideostroboscopy. When using HSV, they mayattempt interpreting vibratory features rela-tive to the norms used in the clinic with

28

Lary

ng

ea

l H

igh

-Sp

ee

d V

ide

oe

nd

osc

op

y

251

E1CH28.qxd 2/26/10 4:41 PM Page 251

Page 8: 9348 209. - Thieme Medical Publishers

videostroboscopy. Thus, there is a risk that anew and very different technique may not befound useful, unless a smooth transition isrealized. An important first step in such tran-sition is to generate HSV-specific clinicalnorms. This topic is covered later in thesection “High-Speed Videoendoscopy in theClinical Speech-Language Pathology Practice.”

Analysis of Aperiodic Phenomena

HSV is uniquely suited for studying aperiodicvibration and other fast movements. This is anarea in which videostroboscopy has no utility.Videostroboscopy cannot be used on personswhose voice disorder has caused vibrationwith perturbed periodicity. Not only can HSVvisualize such vibration, but also it allowsmeasuring the degree of perturbation. HSV isapplicable for evaluating most transient vocalfold vibratory behaviors.

Phonatory Breaks, Laryngeal Spasms,

Onset and Offset of Phonation

HSV is the only imaging technique that caneffectively record and visualize transient phona-tory events. A better understanding of the na-ture and occurrence of these events is a veryimportant area of voice research with strongimplications for clinical practice, from thefunctional evaluation and diagnosis of variousvoice disorders through treatment planningand intervention.

Phonatory breaks are transient instabilitiesor short interruptions of the phonatoryprocess. They are a typical phenomenon as-sociated with several voice disorders. Wehave seen them also sometimes in vocallynormal populations, especially within thefirst 100 msec after phonatory onset. HSVallows for precisely tracking the phonatorybreaks, visualizing them, and assessing theirtemporal pattern and duration.

Laryngeal spasms can result in vocal foldabduction or adduction. They are thought tooccur in neurologically based voice disordersand are most typical in spasmodic dysphonia.HSV allows for precisely tracking, categorizing,and measuring laryngeal spasms.

The characteristics of phonation onsetand offset of phonation may be indicative of a

specific type of voice disorder. Little is knownin this area. The evaluation of vocal offset canprovide invaluable information about vocalfold pliability by judging how quickly andorderly the end of phonation occurs. The eval-uation of vocal onset can provide objectiveinformation about the maneuvers the patientperforms in reaching the optimal phonatorythreshold to begin phonation or about asym-metries due to left-right differences in massand tension, which may not be visible duringstable sustained phonation. HSV places thisinformation at our fingertips, and it is only amatter of conducting sufficient researchto create better protocols for objective voiceevaluation.

Vocal Attack Time

The speed with which the vocal folds adductto the midline is considered an importantvariable in the etiology of some voice disor-ders and may also be a meaningful indicatorof central or peripheral neural dysfunction.Measuring vocal attack time has beenaddressed by Moore and by Werner-Kukukand von Leden.5,8 HSV allows for preciselyrecording the voice onset for different types ofglottal attacks and measuring useful physio-logic characteristics. Recently, we have usedHSV to successfully validate a vocal attacktime measurement, which is discussed inmore detail later in the section “High-SpeedVideoendoscopy as a Research Tool for VoiceScience.”24

Coughing, Throat Clearing, Laughing,

and Other Activities Involving Rapid

Laryngeal Maneuvers

Coughing and throat clearing are consideredto be potentially harmful to the vocal foldsfrom the point of view of vocal hygiene. Clini-cians typically recommend “safer” mucus clear-ing behaviors, such as “soft” cough and clear.But little is known about the biomechanicsof these processes and how are they actuallyharmful to the tissue. HSV allows for precisevisualization, registration, and measurementof the physical attributes of these behaviors.Our ongoing research effort in this area mayprovide clinically useful data. On a separate

VII

Hig

h-S

pe

ed

Im

ag

ing

252

E1CH28.qxd 2/26/10 4:41 PM Page 252

Page 9: 9348 209. - Thieme Medical Publishers

note, laughing, clearing, and other similarquickly varying laryngeal tasks are clinicallyuseful as media for eliciting phonation inaphonic patients. HSV can be used for visual-izing the vocal fold vibration during the shortphonatory segments elicited though such clin-ical techniques.

Alaryngeal Speech

Developing instrumental or perceptual tech-niques for the evaluation of alaryngeal voicehas always been a significant challenge. Thesevoices do not qualify for acoustic voice analy-sis, present difficulties in using perceptualscales, and cannot be documented via EGG orvideostroboscopy.14 HSV has been successfullyused for visualization of the vibratory charac-teristics of the substitute voice generator andfor automatic image segmentation of theneoglottis.25 The ability of HSV to visualizeand measure vibration after laryngectomy isimportant for evaluating the success of voicerestoration.

Objective Automated Analysis

After everything said about HSV, it is importantto make a clarification. HSV is a lot more than aslow-motion movie. Videostroboscopy is atechnique designed primarily for slow-motionvisualization of the fast vocal fold vibrationduring sustained phonation in real time, whichis very limited in terms of objective measure-ment of the vibration. HSV is fundamentallydifferent in that respect. The visualization isvery accurate, presented in warped (delayed)time. Every characteristic of the visualizedvibration is potentially measurable, becauseit is inherently accurately represented in therecording. HSV can be described as a data“cube,” which has two spatial coordinates, x(left-right) and y (posterior-anterior), and onetemporal coordinate, t (time). All three dimen-sions are described by the intensity of eachpixel. It is a solid “cube”: there is no missinginformation along any of the dimensions.Therefore, it is essential to demonstrate what isclinically relevant to develop the appropriateanalytic technique for measuring it.

Several automatic and semiautomatic HSV-derived measurements have been reported in

the literature.26 They have been classified asfollows:

◆ Measures related to frequency of vibration:fundamental frequency; period perturba-tion quotient; coefficient of variation of F0;voice breaks; vocal tremor (F0-modulation)frequency and magnitude.

◆ Measures associated with glottal symmetry:left-to-right phase, amplitude, and frequencysymmetry quotients; axis shifts; posterior-to-anterior symmetry concurrence (showingwhether some parts along the vocal foldhave different symmetry parameters thanothers).

◆ Measures related to glottal width and areacharacteristics: open and closed quotient;glottal area perturbation quotient; coeffi-cient of variation of glottal area; softphonation index; vocal tremor (glottal areamodulation) frequency and magnitude.

◆ Measures reflecting unilateral dynamic char-acteristics: activity/displacement of the leftor the right vocal fold, and ratio of left ver-sus right vocal fold.

◆ Measures related to mucosal wave properties:mucosal wave presence; symmetry quo-tient; relative area to glottal open area;sharpness pattern.

◆ Measures related to vertical movement dur-ing phonation: left-to-right vertical symme-try, computed through the image intensity.

◆ Measures assessing modal types: vibration-based voice typing (similar to types 1, 2,and 3 per Titze); automatic classification ofbifurcation patterns (eg, periodic, biphonia,diplophonia, vocal fry, aphonia, vocal onset,vocal offset, etc.); subharmonic level (ie,which is the most active subharmonic of F0

[first, second, etc.]).14

◆ Semiautomatic measures reported objec-tively: manually placed posterior and ante-rior commissure markers; manually taggedtransient events; visually classified patternsof vibration.17

Some of these objective measures have beencompared with visual perceptual ratings: pe-riod and glottal width irregularity, left-rightphase and amplitude asymmetry, axis shiftsduring closure, and open quotient.15,18,21

The findings suggest that usually objectivemeasures differ from visual subjective ratings,

28

Lary

ng

ea

l H

igh

-Sp

ee

d V

ide

oe

nd

osc

op

y

253

E1CH28.qxd 2/26/10 4:41 PM Page 253

Page 10: 9348 209. - Thieme Medical Publishers

underscoring the limits of human perceptionand the importance of developing robustautomated measurement techniques. Newobjective HSV measures are being developedthrough the phonovibrography method.27

Objective measures that have been reportedusing VKG—amplitude symmetry, speed quo-tient, and phase symmetry index—are highlyapplicable to HSV.28 Whereas several researchteams are actively developing HSV-basedobjective measures, the clinical efficacy ofthese measures is still under investigation.There are no established standards or com-mercially available software products at thistime.

Relationships between Vibration

and Acoustics

Due to the high temporal resolution of HSV, itis possible for the first time to precisely alignthe HSV images with acoustics and other voicesignals (Fig. 28.3), such as EGG, transglottalairflow, intraoral and subglottal pressure, andaccelerometry. This is exciting for two reasons.First, voice science can better understand therelationships between vocal fold vibration andthe resulting voice, leading to importantrefinements of the models of voice produc-tion. Additionally, combining HSV measures ofvocal fold vibration with concurrent acousticand EGG measures may provide complemen-tary, high-precision measures that can improvethe clinical practice. Scientific investigationsof these relationships are under way. Severalexamples are presented later in the section“High-Speed Videoendoscopy as a ResearchTool for Voice Science.”

◆ Technical and MethodologicalConsiderations Using High-Speed Videoendoscopy

This section is intended to provide a practicalunderstanding of the HSV technology. Thereare two aspects, technical and methodological.The technical part is concerned with acquiringhigh-quality HSV data. That is, making surethat all vibratory information of interest wasrecorded correctly, with sufficient spatial andtemporal image quality. The methodological

aspect is concerned with the efficacy of pre-senting the relevant information to the clini-cian or researcher. That is, finding ways ofcomplementing the playback of the HSVmovie by other, more intuitive facilitativeplaybacks and objective measures that canreveal the relevant content, which is often hid-den to the human eye through the HSV movieplayback.

Important Technical Characteristics

of High-Speed Videoendoscopy

An HSV system typically consists of thefollowing elements: a digital high-speedcamera (monochrome or color); a 70-degreeor 90-degree rigid laryngeal endoscope; anendoscopic lens adapter; a powerful lightsource (usually 300 W constant xenon); atrigger button; a computer controlling thecamera via specialized software for imageacquisition and real-time video feedback; acomputer monitor; and a wheeled equip-ment cart. The camera may be connected tothe computer either via a specialized hard-ware card or via a standard Ethernet orFireWire interface. In some cameras, the dig-ital processing circuitry is in a separate boxinstalled on the cart, which allows for alighter camera head attached to the endo-scope. Heavier cameras, 2 lb and above, maybe weight-balanced using a camera crane.The synchronous recording of additional sig-nals is available with more advanced config-urations. Such systems include additionalhardware and software. Our HSV system,designed at the Voice and Speech Laboratory,University of South Carolina (Columbia, SC)(Fig. 28.3), includes the following additionalelements: an 8-channel data acquisitioncard; a head-mount condenser microphone;a microphone preamplifier; an EGG device; afrequency divider; data acquisition software;and a second monitor to separate the HSVimage from the channel data feedback.

Sensitivity

The digital high-speed cameras are photon-integrating devices. The complementary metal-oxide semiconductor (CMOS) photo sensor ofthe camera is divided into pixels (individual

VII

Hig

h-S

pe

ed

Im

ag

ing

254

E1CH28.qxd 2/26/10 4:41 PM Page 254

Page 11: 9348 209. - Thieme Medical Publishers

photo cells), each usually 10 �m � 10 �m insize, or larger, up to 22 �m by 22 �m. For theduration of one frame of the recording, eachpixel “counts” the number of photons beingreflected from the surface of the anatomicstructures that “fall” on the surface of that sen-sor. The stronger the intensity of light reflec-tion and/or the longer the integration time,the higher is the number recorded for thatpixel (ie, the brighter that pixel will be in therecorded movie). The amount of light thatthe tissue can absorb safely is limited. Thus, thesensitivity of the sensor’s pixels and the dura-tion of each frame’s integration time are impor-tant parameters for eliciting an image. Themost sensitive high-speed cameras todayhave monochrome sensitivity of 6400 ISO per1280 � 800 pixels sensor (Vision Research, Inc.,Wayne, NJ) and 6400 ISO per 1000 � 1000 pixelssensor (Photron Inc., San Diego, CA), whichprovides similar sensitivity per pixel. The

sensitivity of the color versions of these cam-eras is 1600 ISO.

Integration Time

The high-speed camera integrates the lightreflected from the tissue surface for a giventime corresponding with one frame. Eachrecorded sample is an image, termed frame,constructed of many pixels. For example, ifthe HSV frame rate is 2000 fps, each second oftime is divided into 2000 recording sessionsfollowing every 500 �sec. The integration ofeach frame takes most of the 500-�sec pe-riod, given that the only time the integrationis not active is during the “reset” time foreach frame, which is negligible (�2 �sec).This is the most fundamental difference be-tween stroboscopy and HSV. In videostro-boscopy, the strobe flashes are intermittent. If

28

Lary

ng

ea

l H

igh

-Sp

ee

d V

ide

oe

nd

osc

op

y

255

Fig. 28.3 A block-circuit of the HSV system designed

at the Voice and Speech Laboratory, University of

South Carolina, which allows aligning precisely the HSV

images with acoustic, EGG, and other voice signals. A

Phantom v7.3 high-speed camera (Vision Research,

Inc.) is clocked by the sampling rate of an 8-channel

M-Audio Delta 1010LT data acquisition card (Avid Tech-

nology, Inc.) after 1:6 frequency division. The camera

“Ready” signal makes it possible to achieve accuracy of

synchronization of 11 �sec. This architecture permits

exactly attributing the six acoustic or EGG samples

corresponding with each frame.

E1CH28.qxd 2/26/10 4:41 PM Page 255

Page 12: 9348 209. - Thieme Medical Publishers

they cannot be precisely timed, the resultingreconstructed glottal cycle image sequence isincorrect. HSV is recording everything and noinformation is missing between the frames(Fig. 28.2).

Frame Rate

Although HSV technology provides true sam-pling of vibration, the selection of an appro-priate frame rate is very important for theaccurate recording of some of the relevantvibratory features. Figure 28.2C illustrates theeffect of increasing the HSV frame rate on theaccuracy of representing the vibration detailswithin the glottal cycle compared with lowerframe rates (Fig. 28.2B). The frame rate deter-mines the integration time (ie, the timebetween frames). If the integration time is toolong and the velocity of the features beingfilmed is too high, the fast-moving features areaveraged through the integration period, andthey appear blurred, out of focus, or may evenbecome invisible. The faster the motion, theshorter the integration time has to be, thusthe frame rate has to be higher. Based on ourdata, the fastest vocal fold vibratory featuresare the mucosal wave propagation and themovement of the vocal fold edges during theclosing phase. Based on visual testing, we es-tablished a rule of thumb that the frame ratehas to be at least 16 times higher than the fre-quency of the glottal cycle (in periodic sus-tained phonation same as F0). That is, eachcycle has to be presented by at least 16 images(Fig. 28.1). Therefore, in clinical settings, theoptimal frame rate is �8000 fps, allowing forthe evaluation of voicing tasks not exceedingF0 of 500 Hz. That frame rate would covermost clinical tasks for men and women, suchas habitual pitch and loudness phonation, on-set and offset, high and low pitch in modalregister, and breathy and pressed phonation.In some special tasks, such as falsetto registeror pitch glides, even the rate of 8000 fps maybe insufficient and some features may beunderrepresented. Obviously, the commonlyused frame rate of 2000 fps is inadequate andwould misrepresent the vibratory featuresof persons with a F0 above 125 Hz (ie, mostwomen would appear to lack a mucosalwave).20 Several older studies have suggested

that increased pitch relates to a reduced mu-cosal wave, probably a conclusion partiallyresulting from inferior technology at the time.

Color

Traditionally, HSV systems have been mono-chromatic (black and white). The Voice andSpeech Laboratory integrated the first knowncolor HSV system back in 2003.13 Since then,we have learned about the advantages andcaveats of color. Color is clinically important forcorrectly identifying anatomic structures andespecially for identifying lesions and structuraltissue changes. However, to achieve color, thesensitivity of the camera is reduced �4 times,because the light has to be channeled intothree color filters (for red, green, and blue), andadditional light loss is caused by filtering-outthe infrared and ultraviolet components, lightabsorption, and reflection. That translates intoa 4-times reduction of the maximum framerate of the HSV system for the same imagequality relative to monochrome. Additionally,color significantly reduces the effective spatialresolution of the camera due to the Bayermosaic color filtering used in single-chip colorsensors. Consequently, for the very same modelcamera, the color and monochrome versions atthe same pixel resolution have a significantlylower effective resolution of the color camerabecause the image is obtained through interpo-lation. Thus, the edges of the vocal folds repre-sented using the monochrome camera are moreaccurate. Color HSV systems have advantageswhen viewing the vocal fold, but monochromeimages allow for more accurate measurementof the vibratory characteristics.

Lighting

HSV technology requires a lot of light due tothe CMOS photon integration principles. Thus,increasing the amount of light can improveHSV image quality and frame rates. The typeof light source used with most HSV systemstoday is 300 W constant xenon light. There is,however, a safety concern that further increas-ing the amount of light used with HSV cancause tissue damage. Additionally, it is consid-ered possible that long exposures to a 300 Wconstant xenon light can cause tissue damage.

VII

Hig

h-S

pe

ed

Im

ag

ing

256

E1CH28.qxd 2/26/10 4:41 PM Page 256

Page 13: 9348 209. - Thieme Medical Publishers

No reports of such damage have been filed todate, but as a precaution it is recommendedthat the amount of time the vocal folds areexposed to light during an HSV exam bereduced to less than 20 seconds.

Spatial (Pixel) Resolution

Our experience shows that spatial resolutionabove 300 � 300 pixels is adequate for qualityimages and for automated analysis. The spatialresolution of modern high-speed camerasallows for much higher resolution, includinghigh-definition resolution of 1920 � 1080pixels, and up to 2048 � 2048 pixels (VisionResearch, Inc.).

Effective Dynamic Range

Dynamic range is a measure of brightness reso-lution of the sensor (ie, the ratio between thelargest possible to the smallest possible lightthat the camera can register). That ratio isrelated to the sensitivity of the camera in acombination with the quantization levels (bitsper pixel). For example, 8 bits per pixel providefor a maximum dynamic range of 256 (48 dB),whereas 12-bit quantization supports a maxi-mum dynamic range of 4096 (72 dB) per pixel.Whether this is an effective dynamic rangedepends on the amount of noise in the lowerbits (the smaller intensity values). If noise ispresent, the effective dynamic range is reducedrelative to the maximum dynamic range. A higheffective dynamic range allows one to “brighten”a dark image or to “darken” a very bright imagewithout causing a distortion or loss of informa-tion. That is very important for achieving highimage quality and for accurate image analysis.

Weight

The achievement of ultrahigh speeds requiresthat all hardware, including the memory, bephysically located inside the body of the cam-era. Thus, the fastest high-speed cameras aretoo heavy to be held by hand during the exam.The camera used in the Voice and Speech Lab-oratory system, Phantom v7.3 (Vision ResearchInc.), weighs 7 lb. To compensate for the weight,we attached the device to a camera crane(model CamCrane 200; Glidecam Industries,

Inc, Kingston, MA). The camera weight is bal-anced, to appear weightless to the operator,while allowing the most degrees of freedomfor motion by using a ballhead. Other thancreating weightlessness, the crane was foundto reduce significantly the endoscopic motionand tilt, thus introducing a comfortable sys-tem to operate. Based on that experience, werecommend using a camera crane regardlessof the weight of the camera.

Color and spatial resolution are very impor-tant factors when identifying lesions, vasculari-ties, and tissue changes and for accuratelyrepresenting the glottal edges. Spatial resolu-tion is also important as it allows for the wideview angle necessary to examine the full ante-rior-posterior view of the vocal folds and theirsurrounding anatomic structures. The framerate is essential for accurately displaying themucosal wave and providing sharp glottaledges, especially when viewing high-pitchedsamples. Long recording duration is necessaryto register multiple phonatory tasks in a contin-uous recording (ie, comfortable, high and lowpitch, glides, loudness levels, repetitive phona-tion, and forced inhalation), adduction and ab-duction of vocal folds, and phonatory onset andoffset. An increased dynamic range allows forimproved viewing quality and increased accu-racy of the automated image analyses. Due toinsufficient clinical experiments with HSV, thenecessary requirements for these factors havenot yet been standardized.

Of all factors that influence HSV, the color,temporal resolution, and dynamic range are theones limited by the sensitivity of the camerasensor. The spatial resolution, temporal resolu-tion, and dynamic range are in a reciprocalrelationship as they share the same hardwarebandwidth and memory resources. The sampleduration and spatial resolution depend on thememory available, while the view angle de-pends on the spatial resolution and the opticsinstalled. Therefore, improvements in HSV tech-nology depend on three factors: sensitivity ofthe camera sensor, hardware speed, and mem-ory size. The most important and the mostchallenging factor is the sensitivity, due to thelimitations of CMOS technology and the lack ofdemand for high sensitivity from the traditionalmarket sectors using high-speed cameras.13

An ongoing collaborative effort between theVoice and Speech Laboratory, University of

28

Lary

ng

ea

l H

igh

-Sp

ee

d V

ide

oe

nd

osc

op

y

257

E1CH28.qxd 2/26/10 4:41 PM Page 257

Page 14: 9348 209. - Thieme Medical Publishers

South Carolina, and the Center for LaryngealSurgery and Voice Rehabilitation, MassachusettsGeneral Hospital (Boston, MA), led recently tothe following breakthrough advances in HSVtechnology:

1. Color HSV allowing for high-quality 42-bitrigid videoendoscopy at the speed of6000 fps at a spatial resolution of 400 �480 pixels and for 24-bit rigid color HSV at10,000 fps and 320 � 320 pixels resolution.

2. Ultrahigh-speed monochrome HSV allow-ing for high-quality 12-bit rigid videoen-doscopy at the speed of 16,000 fps at aspatial resolution of 320 � 320 pixels andfor 8-bit rigid HSV at 48,000 fps and 128 �200 pixels resolution.

3. High-precision temporal synchronization ofmonochrome HSV at 16,000 fps with multi-ple channels of other data allowing for accu-racy of synchronization around 11 �sec at asampling rate of 96,000 Hz per data channel.

4. High-definition HSV allowing for opticallyzoomed, high-quality, 12-bit monochromeimaging of the vibrating vocal fold tissues atthe speed of 4000 fps and spatial resolutionof 600 � 800 pixels.

5. Flexible HSV allowing for the use of a regu-lar nasal fiberscope at the speed of up to6000 fps and spatial resolution of 320 �

320 pixels.

These examples are presented to the readerto provide a notion of what is considered to bethe state of the art in year 2009. Although theseHSV system integrations are experimental andnot currently commercially packaged for clini-cal use, it is likely that in another 5 years fromnow, HSV systems with similar and betterparameters will be available to the clinic at anonprohibitive cost. More importantly, and inthe meantime, the research on the clinical effi-cacy of HSV needs to accelerate to “catch up”with the advance of technology.

Methodology: High-Speed

Videoendoscopy Offers Much More

than a Slow-Motion Movie

The improvement of the HSV camera technol-ogy is essential for the accurate recording ofthe biomechanical information of vocal foldmovement. The biggest advantage of HSV is

the true visual presentation of movement ofan anatomic structure that humans, especiallythe skilled clinicians, understand best.13 Ad-vanced image processing techniques will com-plement the visual data automated analysesand measurements.

That is, the presentation of the HSV con-tent to the clinician can be made either visu-ally or through measurements. Thus, themethodology for voice evaluation via HSVcan be achieved via visual perceptual ratingsand via automatic or manual objective meas-ures. These are two mutually complemen-tary approaches. In the rich HSV content,some of the vibratory information is difficultfor the human eye to perceive but can bemeasured automatically, whereas other fea-tures are difficult to formalize as an algorithmbut are intuitive to the human brain. There-fore, it is most likely that the HSV clinicalvoice evaluation protocol of the future willbe a combination of visual ratings and objec-tive measures.

Facilitative Playbacks

As noted earlier in the section “Advantages ofHigh-Speed Videoendoscopy over Videostro-boscopy,” HSV is a lot more than a slow-motionmovie. There are many creative ways of pre-senting the HSV content in an intuitive formby preserving some of the spatial informationso the clinician can follow the anatomy whilethe features of interest are emphasized foreasy comprehension. This approach facilitatesvisual perception, improves the accuracy ofquantification, and increases the reliability ofvisual rating. Special tools for enhanced visual-ization have been created, termed facilitativeplaybacks.13,27 The following are some of thefacilitative playbacks that have been success-fully used thus far for research and clinicalpurposes: digital kymography playback, mu-cosal wave playback, mucosal wave kymogra-phy playback, and phonovibrogram.

Digital Kymography Playback. The normalsequence of viewing HSV recordings, termedHSV playback, is by sequentially presentingimage frames with spatial coordinates x and yalong the time axis t. DKG playback corre-sponds viewing DKG image frames with coor-dinates x and t in a sequence presented along

VII

Hig

h-S

pe

ed

Im

ag

ing

258

E1CH28.qxd 2/26/10 4:41 PM Page 258

Page 15: 9348 209. - Thieme Medical Publishers

the posterior-anterior axis y. In DKG playback,the DKG frames are viewed as a movie se-quence that plays from the posterior towardthe anterior.13 The DKG playback can be regarded as a step up from multiplane ky-mography.29 Figure 28.4 provides three snap-shots from a DKG playback of sustainedphonation taken in the posterior, medial, andanterior areas along the posterior-anterioraxis. Figure 28.5 shows two snapshots from aDKG playback of a phonatory offset, takenin the posterior and medial areas. The DKGplayback was found useful for demonstratingthe change of the dynamic characteristicswhile viewing damaged tissues, such as lesions, scars, and discolored areas. Dynamicchanges due to stiffness of the tissue, shownas a movie, may also help to reveal the natureof lesions (ie, cysts vs polyps). It is essential

to point out the importance of endoscopicmotion compensation in ascertaining timealignment of the anatomic structures for validDKG representations.30

Mucosal Wave Playback. The mucosal wave(MW) playback is produced by modifying theHSV image sequence into a series of frames, inwhich the pixel intensity encodes the motionof the upper and lower margins of the vocalfolds and the mucosal edges. As illustrated inFig. 28.6, in the MW playback, the color indi-cates the direction of motion (ie, the openingedges are encoded in green and the closingedges in red).13 The frame selected for the ex-ample in Fig. 28.6 demonstrates the effective-ness of the MW playback in emphasizing the

28

Lary

ng

ea

l H

igh

-Sp

ee

d V

ide

oe

nd

osc

op

y

259

Fig. 28.4 Example of a DKG playback of sustained

phonation. DKG playback is a movie playing from pos-

terior to anterior. The figure provides three snapshots

taken in the posterior, medial, and anterior areas,

respectively. The image on the left shows an average

image of the vocal folds with the line being scanned

across the glottis. On the right, the corresponding

kymographic image is shown. The actual DKG

playback movie used in this example is provided as

Video Clip 54 in the DVD accompanying this book.

E1CH28.qxd 2/26/10 4:41 PM Page 259