Representing Emotions with Animated Text · Representing Emotions with Animated Text Raisa Rashid...
Transcript of Representing Emotions with Animated Text · Representing Emotions with Animated Text Raisa Rashid...
Representing Emotions with Animated Text
By
Raisa Rashid, B.Comm (Ryerson University, 2005)
A thesis submitted in conformity with the with the requirements for the degree of Masters in Information Studies
Faculty of Information Studies
University of Toronto
© Copyright by Raisa Rashid 2008
ii
Representing Emotions with Animated Text
Raisa Rashid
Masters of Information Studies
Faculty of Information Studies
University of Toronto
2008
Abstract
Closed captioning has not improved since early 1970s, while film and television
technology has changed dramatically. Closed captioning only conveys verbatim dialogue
to the audience while ignoring music, sound effects and speech prosody. Thus, caption
viewers receive limited and often erroneous information. My thesis research attempts to
add some of the missing sounds and emotions back into captioning using animated text.
The study involved two animated caption styles and one conventional style:
enhanced, extreme and closed. All styles were applied to two clips with animations for
happiness, sadness, anger, fear and disgust emotions. Twenty-five hard of hearing and
hearing participants viewed and commented on the three caption styles and also identified
the character’s emotions. The study revealed that participants preferred enhanced,
animated captions. Enhanced captions appeared to improve access to the emotive
information in the content. Also, the animation for fear appeared to be most easily
understood by the participants.
iii
Acknowledgements
I would like to express my gratitude to all those who gave me the possibility to
complete this thesis. It would be impossible without the people who supported me and
believed in me.
I want to give heartfelt thanks to my thesis supervisors Dr. Deborah Fels and Dr.
Andrew Clement. I want to specially thank Deb for expertly guiding me and inspiring me
at every step. I also want to thank Quoc Vy and Richard Hunt for helping me with the
studies. A huge note of gratitude goes to my fellow “labbies” who were always
encouraging. I want to especially thank Bertha Konstantinidis, Daniel Lee, JP Udo, Emily
Price and Carmen Branje. Finally I want to thank my family. My husband has been a
source of strength and support throughout the last year. He has made my work on this
thesis so much easier by taking care of our lives when I was not able to. I also want to
thank my loving parents, in-laws and sister for their on-going support, love and
encouragement.
I also want to thank University of Toronto, Ryerson University and NSERC for
supporting my research.
iv
Table of Contents Introduction......................................................................................................................... 1 Chapter 1. Literature Review........................................................................................ 5
Section 1.01 Closed Captioning............................................................................... 5 (a) Captioning Standards ...................................................................................... 6 (b) Current Captioning Practices .......................................................................... 7 (c) Captioning Practices outside North America.................................................. 9 (d) Problem with Captioning .............................................................................. 10
Section 1.02 Missing Elements: Music, Sound Effects & Prosody....................... 11 Section 1.03 Kinetic Typography .......................................................................... 16
(a) Elements of Typography............................................................................... 16 (b) Kinetic Typography ...................................................................................... 17
Section 1.04 A Model of Basic Emotions for Captioning ..................................... 23 Chapter 2. Model of Emotion Used in this Thesis ..................................................... 26
Section 2.01 Framework of emotive captions........................................................ 28 Chapter 3. System Perspective ................................................................................... 33 Chapter 4. Method ...................................................................................................... 39
Section 4.01 Content used in this study ................................................................. 42 Section 4.02 Data Collection ................................................................................. 46
(a) Pre-Study Questionnaire ............................................................................... 46 (b) Data collection during viewing of the content .............................................. 48 (c) Post-study questionnaire ............................................................................... 48 (d) Video Data .................................................................................................... 49 (e) Experimental setup and time to complete study ........................................... 50 (f) Data Collection and Analysis........................................................................ 50
Chapter 5. Results....................................................................................................... 54 Section 5.01 Questionnaire Data............................................................................ 54 Section 5.02 Video Data ........................................................................................ 62 Section 5.03 Emotion ID........................................................................................ 70
Chapter 6. Discussion ................................................................................................. 76 Section 6.01 Limitations ........................................................................................ 90
(a) Emotion Identification Method..................................................................... 90 (b) Communication barriers................................................................................ 92 (c) Small Number of Participants....................................................................... 93 (d) Limited Genre ............................................................................................... 94 (e) Short and Limited Test Clips ........................................................................ 94
Chapter 7. Recommendation/Conclusion ................................................................... 95 Section 7.01 Future Research ................................................................................ 97 Section 7.02 Conclusion ........................................................................................ 98
Chapter 8. References............................................................................................... 100 Appendix A..................................................................................................................... 106 Appendix B Pre-Study Questionnaire............................................................................. 106 Appendix B Pre-Study Questionnaire............................................................................. 107 Appendix C Post-Study Questionnaire ........................................................................... 111
v
List of Tables
Table 1: Summary of the relevant animation properties for anger, fear, sadness, happiness........................................................................................................................................... 29 Table 2: Description of the themes .................................................................................. 50 Table 3: Emotion identification categories ....................................................................... 53 Table 4: χ2, degrees of freedom (df), mean and standard deviation for enhanced captioning attributes for To Air. ....................................................................................... 55 Table 5 χ2, degrees of freedom (df), mean and standard deviation for enhanced captioning attributes for Bad Vibes. ................................................................................. 56 Table 6: Number of positive/negative comments by groups for the video data ............... 67 Table 7: Descriptive statistics for HOH group in all categories of the video data analysis........................................................................................................................................... 68 Table 8: Descriptive statistics for hearing group in all categories of the video data ....... 69 Table 9 Average number of Emotions identified in each category .................................. 71 Table 10: Emotions versus type of identification ............................................................. 73 Table 11: Caption category versus type of identification ................................................. 74
vi
List of Figures
Figure 1 Example of closed captioning. The music note is one of the commonly used symbols to represent music (with permission).................................................................... 2 Figure 2 Teletext Example................................................................................................ 10 Figure 3 Shark Week Titling Sequence ............................................................................ 18 Figure 4 Example of font size representing volume (Lee et. al. 2002)............................. 21 Figure 5: Example of high intensity anger over four frames. ........................................... 31 Figure 6 Example of low intensity fear. Initial text size is default size. Text size then expands and contracts rapidly for the entire duration that the text is on the screen. ........ 31 Figure 7 Conventional Captioning System....................................................................... 34 Figure 8 Captioning System, Proposed............................................................................. 34 Figure 9 Captioning Process: Conventional and proposed ............................................... 38 Figure 10: Age distribution............................................................................................... 42 Figure 11: Example of Enhanced Caption and Extreme Caption ..................................... 46 Figure 12 Comparison of the three captioning styles by number of comments ............... 70
1
Introduction
Deaf and hard of hearing viewers have limited access to the rich media of television.
Most are unable to obtain the same quality and quantity of sound information as their
hearing peers. Hearing impaired viewers can compensate for some of the missing sound
information through the use of closed captioning, which relays a portion of aural sound
visually through text and icons.
The size of the hard of hearing population in Canada is very difficult to estimate.
The degree of hearing loss can vary from mild to profound deafness making the
classification of being hearing impaired somewhat amorphous (Gallaudet Research
Institute, 2007). In addition, data on the hearing impaired population is primary based on
self-reported or informant-reported instances and many hearing impaired individuals do
not report their disabilities (Gallaudet Research Institute, 2007). The Canadian Hearing
Society (CHS) roughly estimates that 23% of the Canadian population has some form of
hearing loss (CHS, 2004), which translates into approximately seven million people.
However, not all caption users are hearing impaired, as many viewers turn on
captions to learn English or when sound is unavailable (such as at fitness centres, bars or
restaurants) (Lewis, 2000). The National Captioning Institute estimates that there are 100
million viewers in the US that benefit from using closed captioning (2003). This is
neither a small nor insignificant population that can take advantage of this technology.
Despite the large user group, very few industry or research resources are allocated to
develop closed captioning technology, which has remained stagnant for decades.
The North American system of closed captioning is called Line 21 captioning and it
was developed for analogue television in the 1970s (Canadian Association of
2
Broadcasters, 2004). This type of captioning is limited to a small set of fonts, colours and
graphics (Canadian Association of Broadcasters, 2004). Limitations in decoder
technology posed legibility problems for early captions, forcing the text to be mono-
spaced, upper case, and in white font colour displayed against a black background
(Canadian Association of Broadcasters, 2004).
Recent advancements in decoder technologies allow for legible mixed case letters,
some symbols, and a choice of fonts and colours to be used (Canadian Association of
Broadcasters, 2004). However, despite the new capabilities, the uppercase, white text on
black background is still the principle format for captions used today (National
Captioning Institute, 2003) (see Figure 1 for an example of standard closed captioning).
Figure 1 Example of closed captioning. The music note is one of the commonly used symbols to
represent music (with permission).
The lack of progress within the captioning industry is incongruent with the
advancements in television technology over the last few decades, especially the
emergence of digital television. Closed captioning have not only remained unchanged for
three decades, current guidelines actually discourage the use of the new available features
3
such as colours, fonts and mixed case lettering (Canadian Association of Broadcasters,
2004).
A major concern with closed captioning is that it provides only for the verbatim
equivalent of the spoken dialogue of a television show and ignores most non-verbal
information such as music, sound effects, and tone of voice. Much of this missing sound
information is used to express emotions, create ambiance, and complete the television
and film viewing experience. The audience members forced to access sound information
through captions alone stand to lose vital components of their television and film viewing
experience.
The inspiration for my thesis is to investigate ways of infusing captions with the non-
verbal sound elements using a relatively new approach: animated or kinetic text. One of
the primary aims of my thesis research work is to determine whether animated text is a
viable option to pursue when representing sound information through animation. My
personal motivation for pursuing this research area comes from being a member of an
assistive technologies lab dedicated to investigating information challenges faced by
people with disabilities. The hearing impaired caption users have tremendous unfulfilled
needs that are not being addressed currently, and as an information technology
professional I am committed to use new technologies to solve unmet information needs.
Animated text has been used in the entertainment industry for exciting title sequences and
it appeared to be an unexplored, yet creative approach to text with the potential for
visually representing sound information.
My research work in combining animated text with captioning is described in this
thesis. It begins with a review of the literature in the following areas: captioning, kinetic
4
text, music, sound effects and emotions. The literature review is followed by a
description of the process used to develop a model of emotion/animation that attempts to
map the properties of animation with the expression emotions. The process description is
followed by a systems look at how captions are created by third-party caption houses.
The next chapter outlines the study that was undertaken to explore and refine the
emotion/animation model. The results of the data analysis obtained from the user study
are reported following the methods section. These results are then interpreted and
analyzed in a discussion section. The final chapter of the thesis includes
recommendations based on the discussion and concluding remarks.
The goal of this thesis is to explore the potential for animation to enhance captioning
and incorporate more of the non-verbal sound information, not to categorically claim that
animation is the only (or even the best) way to present non-verbal sound information.
Additionally, the type of animations explored in this thesis is only one possible
application of animations to represent emotions. Many other possible ways of
representing emotions using animations exist that have not been investigated. The focus
of this thesis is on verbatim captioning in its current state only. While literacy levels of
hearing impaired viewers can impact their ability to watch television with verbatim
captioning, this topic is out of scope for this thesis.
5
Chapter 1. Literature Review
Section 1.01 Closed Captioning
Closed captioning is a technique for translating dialogue into text. A television set
with a built-in decoder is able to display the translated text on the screen to match the
dialogue of a show (Abrahamian, 2003). Captioning can be provided by the broadcasters
in real-time or off-line (for broadcast at a later time) format and can appear on screen as
roll-up or pop-on captions (Canadian Association of Broadcasters, 2004). Off-line
captions are produced by captioners from third party captioning houses or by in-house
captioning departments of large broadcasters. Off-line captions are created before the
program actually airs, which allows time for correcting of errors, inserting of symbols
and organizing of sentence and phrase structure so that it is easy to read. The Canadian
Association of Broadcasters (2004) recommends that off-line captions are provided in the
pop-on format, where the entire sentence/phrase “pop” on screen at once; however, scroll
up captions are sometimes used for off-line captions because they are less costly to
produce (R.I.T., n.d.)
Real-time or on-line captions are created by caption stenographers, who watch
live broadcasts (like sporting events) and transcribe them for the viewers at home
(Canadian Association of Broadcasters, 2004). Real-time captions typically appear using
a roll-up format, where the captions scroll up the screen one line at a time.
6
(a) Captioning Standards
The standard defining the captioning format, spacing and font for North American
analogue television is EIA-608. The EIA-608 standard specifies a restricted set of fonts,
characters and colours for use in captioning (Consumer Electronics Association, 2005).
When captions were first introduced in early 1970s, limitations in television encoder and
decoder technology prevented the use of anything other than a mono-spaced font and
white text colour. However, several options have been added since the original
specification, namely the use of mixed letter cases, more font colours, and special
characters. These new additions are rarely implemented in captions. Moreover,
captioning guidelines strongly discourage the use of the new additions citing viewer
expectation and bandwidth limitations as obstacles.
The reasons behind the discouraging caption guidelines are valid to some extent.
Caption viewers are accustomed to the white, uppercase captions and in general are
hesitant to accept alterations to the old standard (NCAM, n.d.). In one NCAM study, for
example, one participant commented about a green caption colour that although the green
was easy to read, she felt “funny” about a non-white caption (NCAM, n.d.). The
bandwidth allowed for EIA-608 captions is 980 bits per second (bps) (Robson &
Hutchins, 1998) and the quantity and quality of information that can be broadcasted at
this bandwidth is quite constrained, limiting the type and style of information being
transmitted (Consumer Electronics Association, 2005).
The viewer preferences will not change easily, but the bandwidth limitations will
soon be mitigated. The limited analogue closed captioning standard, EIA-608, is to be
replaced by the emerging EIA-708 (or digital television) standard in the near future. The
higher bandwidth of 9600bps for EIA-708 means that more data per minute of video can
7
be transmitted (Consumer Electronics Association, 2006). As a result, EIA-708 will
provide for a much more enhanced character set that includes non-English letters,
accented letters and an array of symbols (Consumer Electronics Association, 2006). The
new standard will also allow features such as viewer-adjustable sizing of text, which will
transfer more control over to the viewers by allowing them to increase or decrease the
size of their caption display. EIA-708 will permit the use of different colours for the text,
as well as translucency in the backgrounds. The translucent background has the potential
to obstruct less of the television screen than the black, opaque background used currently
(Blanchard, 2003).
The digital television standard will further allow for text styles that include edged
or drop-shadowed text and a broad collection of fonts such as mono-spaced, serif, sans-
serif and cursive. The 708 standard will provide for other interesting features such as the
delay command (Blanchard, 2003). The delay command feature has been designed to
instruct the caption decoder to halt processing of the Service Input Buffer data for a
designated period of time. When a delay command is received by the decoder, incoming
data is kept in the Service Input Buffer until the defined delay time has expired
(Blanchard, 2003). Blanchard describes this feature of the 708 caption standard as “time-
release captions” (2003). The Delay command can potentially be used to reduce
synchronization of speech and caption errors that exist in captioning today.
(b) Current Captioning Practices
The existing captioning guidelines, referenced by most captioners, are heavily
influenced by the research of Harkins, Korres, Singer & Virvan in 1995. The Harkins et.
al. (1995) study presented 106 deaf and 83 hard of hearing viewers with 19 television
8
clips and asked them to complete a questionnaire to indicate their captioning preferences.
The study sought to garner user preferences about what the researchers called “non-
speech information” (e.g. speaker identification, music, manner of speaking). Based on
the study results, Harkins et. al. (1995) drafted recommendations for use by various
captioning institutions.
The Harkins et. al. (1995) study brought forth some interesting insights about
caption user preferences and identified possible improvements for captioning. However,
much of the issues uncovered in that study were insufficiently addressed and still require
in-depth investigation. Harkins et. al. (1995) concluded that colour is ineffective in
distinguishing between various speakers compared to explicit text descriptions; however,
little research has been conducted since then to uncover possible benefits of coloured text
to caption viewers. Harkins et. al. (1995) concluded that animations like flashing text is
poorly received by audiences, but no research has been carried out to determine whether
and how animation, graphics or symbols can be beneficial.
Harkins et. al. (1995) used a survey instrument that did not compare different
styles of captions but only asked people to imagine them when answering questions. Use
of animation, symbols, and graphics in captioning is so uncommon that most people will
never have seen it. People are generally poor at imagining features and functionality that
they have never experienced before, and are unable to rate the effectiveness or
desirability of them (Jacko & Sears, 2003). Controlled experiments with different styles
and use of sensory enhancements are warranted to determine the effect of these
alternatives on people’s attitudes and levels of understanding of the content.
9
As mentioned before, very little research has been conducted to build on the
findings of Harkins et. al. (1995) and captioning guidelines have remained virtually
unchanged as a result of it. This lack of advancement in closed captioning can be
attributed to several factors. Firstly, caption lobbyists are still struggling to ensure better
captioning legislation and increase the number of captioned programs (Downey, 2007).
As a result, the social pressure is not on advancing caption research, rather having
captions present at all. Secondly, captions are created by people unrelated to the
production of the content (e.g., third-party caption houses) without any input from the
producers/creators of a particular show. Therefore, all captioning decisions are made
post-production, by outside captioners. It is relatively easy to speculate that these
captioning houses have little compelling reasons to want to change their current way of
producing the captions.
Finally, the broadcasters are mandated by government regulations to provide a
specific number of hours of captioning a week (Downey, 2007). As the broadcasters are
naturally concerned with the bottom line, the easiest and least expensive captioning
solution is the one they tend to opt for. Quality control and assurance then is undertaken
by the captioning house rather than the broadcasters. However, as mentioned before, the
captioning houses have little incentive for improving the state of captions,.
(c) Captioning Practices outside North America
While closed captioning in North America has remained colourless and motion-
less, European and Australian caption viewers have been benefiting from colours,
multiple fonts, and mixed case lettering for many years. The European and Australian
equivalent of captioning, called teletext subtitling (NCAM, n.d.b), emerged out of an
10
advertising model. Teletext, shown in Figure 2, was created by the British Broadcasting
Company to provide accessibility through subtitles to deaf and hard of hearing viewers,
but was subsequently adapted to provide other information such the weather, advertising,
sports etc. to all viewers (NCAM, n.d.b). Teletext’s higher data transmission rate (12
kilobytes per second) allow for the display of more features (such as animation) and
information (NCAM, n.d.b). However, complaints from European viewers about their
subtitling services are surprisingly similar to those from North Americans: lack of
synchronization, poor spelling, too fast onset, delays in update rates, insufficient amount
of text and undesirable position of text on screen (Ofcom, 2005).
Figure 2 Teletext Example
(d) Problem with Captioning
It appears that there is much research and development required to improve
captioning to even a minimally satisfactory level. Many issues such as spelling,
synchronization of text to dialogue, and positioning of text on screen have potential
technical solutions. However, none of the captioning techniques described thus far have
adequately managed to present information beyond verbatim dialogue. The capabilities of
the captioning technologies allow for improvements in this area, but little research has
11
been done to gauge audience reaction to alternative means of sound expression beyond
static text. If the viewers of a show are only receiving the dialogue portion of television
sound, then they are missing, as identified by Harkins (1995) and Ofcom (2005),
important non-verbal elements such as music, sound effects, and speech prosody.
Section 1.02 Missing Elements: Music, Sound Effects &
Prosody
Speech prosody, music and sound effects are almost as integral to the television
viewing experience as the dialogue and the visuals. Speech prosody profoundly
influences emotional context of words (Cruttenden, 1997). Music has the ability to
impact the way viewers recognize, understand and remember a television series or a
movie (Boltz, 2004; Bruner, 1990) and sound effects have the ability to provide
information as well as emotions.
Tone of voice heavily influences emotional context of speech by enhancing
perception of words (Cook, 2002). Speech prosody can be divided into three areas:
rhythm, loudness of voice and intonation (Cook, 2002). All three elements of voice can
convey emotions, in particular changes in loudness and pitch (Cook, 2002). Caption
users who are unable to access these important elements of voice risk losing crucial
dramatic cues, especially when irony and puns are concerned.
As Boltz (2004) suggests, music can be used in parallel with a scene to increase or
decrease the emotional impact of the visual elements. The combination of pitch, timing
and loudness properties have the ability to create eerie suspense or heartbreaking sorrow
(Boltz, 2004). Rising and falling pitch, for example, can represent growing or declining
12
intensity within a specific emotional context. Complex melodies can convey more
sadness or frustration than simple melodies (Bruner, 1990). Up-tempo music is generally
associated with happy or pleasant feelings, while slow tempo suggests sentimental or
solemn feelings (Gundlach, 1935; Hevner, 1937 as cited in Bruner, 1990). Increasing
loudness or crescendo can express an increase in force, while a decreasing loudness or
diminuendo can convey a decrease in energy (Cooke, 1962; Zetti, 1973 as cited in
Bruner, 1990). All of these critical musical properties are manipulated by the filmmakers
to present the audiences with another dimension of information beyond the visuals. If the
viewers are missing the added music dimension because they cannot hear (or cannot hear
fully), then they are missing tremendous amounts of crucial information the filmmakers
intended to deliver and that is not replicated in the visual information.
Chion (1994) claims that music can have two basic types of emotional effects on a
scene: empathetic and anempathetic. Music that creates empathetic effects directly
expresses the feelings of a scene, by emulating the rhythm, tone and phrasing of that
scene (Chion, 1994). Music that produces anempathetic results uses indifferent music
alongside an emotional scene to exaggerate its impact (Chion, 1994). Boltz (2004)
suggests a similar concept, where the ironic contrast technique is used to combine
dissimilar visuals and music to enhance or diminish the emotional intensity of a particular
scene. An example of the ironic contrast technique would be playing cheery music
together with a violent scene to make the actions dramatically more disturbing.
With empathetic music, where visuals are reinforced by music, the risk to viewers
who cannot hear is that they lose some of the emotional impact of the scene. However,
with anempathetic music or where the ironic contrast is used, music plays a much larger
13
role. In this case, the risk to the viewers is much higher because losing those critical,
ironic musical cues can result in misunderstanding an entire scene (or an entire film).
Sound effects, like music, can add much information and affect to a television
program or a film. Sound effects supplement visual elements and dialogue to provide
information and establish mood (Marshal, n.d.). According to Kerner (1989), sound
effects are used to accomplish three objectives: simulate reality, create illusions and
establish mood. Firstly, sounds effects simulate reality by bridging the gap between a
staged, artificial scene and the audience’s perception of reality. For instance, when a fake
bottle is broken over a cowboy’s head in a Western drama, only the sound effects of a
glass bottle crashing make it real to the audience (Kerner, 1989). Secondly, sound effects
create illusions by assisting audiences in imagining scenes that were never filmed or
shown. For instance, off-screen crowd chatter convinces the audience that the actors are
in a crowded arena instead of an empty studio (Kerner, 1989). Finally, sound effects
establish the mood of a scene simply based on the information it provides (Kerner, 1989).
For example, the sound of a door clicking shut can provide the audience with the
information that a door has been shut but the same door clicking during a burglary scene
can communicate a sense of dread (Marshall, n.d.). Viewers who are unable to obtain
these critical sound effect prompts stand to misinterpret (or simply not enjoy) television
shows.
Since closed captioning mostly provides a translation of the dialogue elements,
the richness of music, sound effects and intonation is frequently missed by the hearing
impaired viewers (or viewers missing access to sound). At this time, closed captioning
either ignores the non-verbal sound information or sound/music information is provided
14
through italics and brackets: such as (woman crying) or (door slamming) or (soft music).
These descriptive phrases are, at best, poor replacements for the depth of music, sound
effects and voice. Moreover, time limitations and viewer reading speeds prevent most
captioned film and television from having much text descriptions of the non-dialogue
information.
Jensema (1997) found in a series of studies that the average viewer reads captions
at a speed of 145 words per minute. He also found that caption reading speed of hearing
impaired viewers was somewhat higher than the hearing viewers and that the reading
speed of all caption viewers varied greatly (Jensema, 1997). He showed that hearing
impaired viewers have a higher caption reading speed because they likely have more
experience using captions than the hearing viewers. Harkins et. al. (1995) reported in
their study that 53% of the deaf and hard of hearing participants expressed an interest in
having more of the background sound information presented, in addition to the dialogue.
Since, as reported by Jensema (1997), average caption speed on television is
already 141 words per minute, any further addition of information beyond dialogue may
render the captions too fast for people to enjoy or understand. Also, Jensema, Danturthi
& Burch (2000) found that when viewers watched captioned television programs, they
looked at the captions 84% of the time and the actual video only 14% of the time. These
findings indicate that any additional sound information conveyed to the viewers via the
television screen must be contained in the area where captions are being displayed.
In order to inject more information into the closed captioning without increasing the
number of words, it may be possible to supplement text information with other forms of
information representation such as icons, animation, and colours. Fels, Lee, Branje &
15
Hornburg (2005) have explored the use of colours and icons in captioning. The main
objectives of this research were to communicate emotional elements missing from regular
captioning and to provide access to more sound information. They experimented with
using icons such as emoticons, music notes and symbols in conjunction with colour to
express emotions such as happiness, anger, and surprise.
Fels et. al. (2005) found that participant reactions differed dramatically between the
hard of hearing and deaf groups. The hard of hearing participants found the use of colour
and icons very useful and expressive. However, many of the deaf participants thought the
use of colour was juvenile and preferred to rely on their own faculties to interpret the
emotional content (Fels et. al., 2005). All of the participants found the use of icons to
provide sound information useful (Fels et. al., 2005). The authors drew two main
conclusions from the findings: first colours and icons can be used successfully to convey
sound and emotional information, and second, enhanced captioning may be perceived as
more beneficial to hard of hearing viewers than deaf viewers (Fels et. al., 2005).
Silverman & Fels (2002) also investigated the use of emotive captions, but in the
form of comic book style graphics. The researchers used speech bubbles and colourful
letters to portray emotions such as sadness, happiness, anger and fear as well as
background sounds. Silverman & Fels (2002) also used multiple icons and a number of
text styles to manipulate the dialogue and present non-dialogue sound information. The
emotive captioned content was viewed by deaf and hearing participants. The researchers
found that most of the participants (ten of eleven) thought that the emotive captions
increased their understanding of the content. The primary complaint received from the
16
viewers was that the comic book style was somewhat juvenile and ill-suited to more
serious content (Silverman & Fels, 2002).
Beyond using static elements such as graphics, icons and colours, animating the
caption text could be a new way of infusing text with an added dimension of information
without adding more text descriptions. Animated or kinetic text has long been used in the
entertainment industry to entertain, inform and evoke emotions. The emotive power of
kinetic text could be applied to captions to potentially create animated captions that
convey emotions, music and sound effects.
Section 1.03 Kinetic Typography
(a) Elements of Typography
Typography encompasses all the design choices embedded in selecting a type on a
webpage, printed page or on the television screen (Ernst, 1984). Type can be determined
by such elements as font (or typeface), size, spacing (between lines of type) and contrast.
Common kinds of typefaces are categorized as Serif and Sans Serif. A serif typeface has
small lines projecting from the ends of the letters, while a Sans Serif typeface does not
have these lines (Ernst, 1984). Most fonts are proportionately spaced, where larger
characters occupy more space than smaller ones (Ernst, 1984). With a mono-spaced font
(type used in closed captioning), each character takes up the same amount of space. The
size of type is measured in vertical height or points (72 points in one inch) and the size is
measured from common baseline. The weight of type refers to the density of letter (e.g.
bold, light, and heavy) (Ernst, 1984).
17
According to Ambrose & Harris (2006) manipulating the type elements can
ensure that a piece of text is readable and legible. Readability refers to the ease with
which readers can comprehend the body of text and takes into account the overall layout
of the text (Ambrose & Harris, 2006). In captioning, pop-on captions are considered more
readable than roll-on captions, as viewers read in collections of words as opposed to one
word at a time (RIT, 2000). Legibility, unlike readability, concerns the finer, operational
detail of a typeface (e.g. size, weight, and typeface). A legible typeface, for instance, can
be rendered unreadable by making it too wide (Ambrose & Harris, 2006). Leading or
spacing between two letters is another factor that affects readability (Ambrose & Harris,
2006). Lack of white space between letters makes text less readable.
(b) Kinetic Typography
Kinetic typography, also known as animated text, is essentially text that moves
over time. Kinetic typography has recently emerged as a powerful tool for expressing
emotion, mood, personal characteristics, and tone of voice in the creative media industry
(Forlizzi, Lee & Hudson, 2003). According to Woolman (2005), kinetic type has
intrinsic, embedded meaning and the ability to inform, entertain and emotionally affect
the audiences. One of the first examples of kinetic typography can be found in the title
sequence of the Alfred Hitchcock film Psycho (Lee, Forlizzi & Hudson, 2002), where
erratic lettering and movement of text on screen successfully communicated the
unsettling nature of the classic horror film.
In recent years, the film and television industry has invested heavily in animated
text for title or credit sequences (perhaps to emotionally prepare the audience for the
impending viewing experience) (Geffner, 1997). Another widely acclaimed use of kinetic
18
typography is found in the 1995 film horror film Se7en. In the opening sequence of Se7en
trembling, high-contrasting letters with a scratchy typeface are used to convey a sense of
terror that emotionally prepares the audiences for this disturbing film (Geffner, 1997).
Other popular examples of kinetic text is found in the title sequences from television
shows like 24 and Shark Wee, illustrating the power of kinetic text on influencing
audience emotions (Woolman, 2005). The title sequence for Shark Week (Figure 3)
conveys the dangers associated with sharks with only floating, eerie letters without
showing the sharks themselves (Woolman, 2005).
Figure 3 Shark Week Titling Sequence
In contrast to the film/artistic industry, where much work is being done to exploit
the potential of kinetic typography, little formal research is being conducted in the
academic community to understand the impact of kinetic type on audiences/users. The
few kinetic typography researchers include Wang, Predinger & Igarashi (2004), who
have explored the impact of kinetic typography in expressing the affective or emotional
state in online communication using a library of pre-made animations. Wang et. al.
(2004) used a library that included around twenty different animations, some of which
were animations to represent emotions (happy and sad) and others to signify emphasis.
Using galvanic skin response sensors to detect emotional arousal among the participants,
19
Wang et. al. (2004) established that the degree of emotional response increased when
animated text was used, implying that animated text can communicate emotions.
However, Wang et. al. (2004) did not try to evaluate the relationship between specific
emotions and animations or see whether different animation properties had different
effects on the users.
Bodine & Pignol (2003) conducted a similar study to Wang et. al. (2004),
evaluating the emotional impact of kinetic typography in instant messaging
communication. The researchers concluded that “kinetic typography has the capacity to
dramatically add to the way people convey emotions” (Bodine & Pignol, 2003, p. 2).
Forlizzi et. al. (2003) agreed with the findings of Body & Pignol, that kinetic typography
has the capacity to express affective content. Similar to Wang et. al. (2004), Bodine &
Pignol (2003) and Forlizzi et. al. (2003) did not analyze the relationship between specific
animation properties and emotions. As a result, all three research groups allowed kinetic
properties of the text to be driven by the interpretation of the creators of the animations
rather than the emotions themselves.
In addition to affective content, kinetic typography can create characters and
direct the attention of the audiences (Forlizzi et. al., 2003). Ford, Forlizzi & Ishizaki
(1997) discovered that readers of kinetic typography messages attached tone of voice to
the messages they observed. Ford et. al. (1997) define tone of voice as “variations in
pronunciation when segments of speech such as syllables, words and phrases are
articulated” (1997, p.269). The researchers also suggest that kinetic typography can take
on “personality” of the people composing the animated messages, where the message
readers begin to assign distinguished personas to the writers of the kinetic dialogue.
20
The research initiatives mentioned thus far has demonstrated that kinetic
typography could be used to enhance the emotional interpretation of written words;
however, only a few of the studies attempted to determine which properties of the
animation excited the particular emotions. In order for captioners to effectively apply
animations to captions, they need to convey the specific emotions in a consistent and
accurate way.
Lee, Forlizzi & Hudson (2002) are some of the few researchers in the academic
community to look at the relationship between properties of text animation and emotion.
They are also some of the few to attempt to create reusable animations. In their study
using kinetic text with instant messaging, Lee et. al. (2002) show that kinetic text can be
used consistently with pre-defined patterns to communicate specific emotions. They
argue that kinetic typography parameters correspond to prosodic features of voice that
express emotions such as rate of speech and volume of voice. They suggest that
animation properties of text, such as increase in size or upward or downward movement,
relate to voice characteristics such as pitch, volume, and speed of delivery. They report
that sweeping upward or downward motions of text can suggest rising and falling pitch
and increase and decrease of text size can express loudness of voice (Lee et. al., 2002).
They also discovered that short up and down movements are able to mimic fear and
vibration can express anger in animated text. An example of this would be to
communicate loudness with larger font size, and express quietness with smaller font size.
As shown in Figure 4, a) represent exuberance with large letters while b) represents
lowness in feeling. The findings of Lee et. al. (2002) suggest that emotions that are
expressed using the prosodic elements of voice may possibly be represented using
21
animation properties that correspond to those elements. If in anger, for example, the voice
increases in volume and vibrates, then the animation representing that anger would
increase in size and vibrate.
Lee et. al. (2002) make some intriguing observations; however, their techniques
are geared towards generating hundreds of different types of animations. These
animations are contained within a behaviour library, where some animations represent
affective states; others represent nothing more than the typographer’s creative state of
mind.
Figure 4 Example of font size representing volume (Lee et. al. 2002)
Minakuchi & Tanaka (2005) agree with Lee et. al. (2002) that the motion of the
kinetic text itself has meaning. They classify this into three sub-classes: physical motion,
physiological motion, and body language. Kinetic type that follows physical motion,
copies the natural pattern of movement such as bouncing or falling. Kinetic type
mimicking physiological motion, mirrors human reactions such as turning red when
angry. Finally, animations that follow body language replicate such motions as
shrugging. Minakuchi & Tanaka’s findings further support the close relationship between
22
animation and emotions, and suggest that specific motion of text can be related to
specific emotions. The discovery of these relationships means that particular animation
patterns can be applied consistently to text to produce repeatable expression of emotions.
However, Minakuchi & Tanaka (2005) are focused on developing a kinetic typography
engine that automatically animates a variety of words, whether emotional or otherwise.
Also, the researchers have done little work in the area of validating how
accurately/effectively the animations they have produced represent the meaning of the
words/emotions. Therefore, the application of Minakuchi & Tanaka’s research to closed
captioning is quite limited considering the broad range of caption users.
Overall, kinetic typography needs to be explored further in the context of captioning,
especially to visually express the emotional aspects of sound information such as music,
tone of voice and sound effects. Unless the animations correctly characterize the
emotions, sound effect, or music, their application will be a detriment to the users
because they will also add clutter, confusion and distraction to the captions.
The examples from the film industry reveal that emotional impact can be made using
kinetic typography, but individual viewer reactions or the exact nature of the viewer
reaction to kinetic text has not been formally studied. Furthermore, application of kinetic
text in the captioning industry has many limitations. A film titling sequence (lasting only
minutes) can have a budget that exceeds $30,000 (Geffner, 1997) and can involve an
entire artistic team. Comparatively, the budget for captioning is around $1000 for a one
hour show and typically involves one captioner (RIT, n.d.). Also, a captioner lacks the
necessary graphic design skills to manipulate animation properties to create effective
animations. As a result, a simplified model of animation based on a limited number of
23
emotions is required. The simplified model will allow a captioner to identify emotions of
words and phrases and easily apply the appropriate animations to them in a cost-effective
way. Eventually an information system needs to be built that uses kinetic typography to
create animated captions that convey the missing sound and emotional elements.
Section 1.04 A Model of Basic Emotions for Captioning
What constitutes emotions is constantly being debated by psychologists and
psychology theorists, as the human emotional system is complex and difficult to study.
While very few concrete conclusions have been drawn about emotions, one of them is
that emotions are crucial to interpersonal communication. In fact, Ekman (1999) found
that those who are unable to convey emotions through facial expressions and speech
prosody have tremendous difficulty communicating and maintaining personal
relationships.
Theories about emotions can be loosely separated into two main streams of
thought. One theory maintains that all emotions are the same in essence and differ only in
intensity and pleasantry (Ortony & Turner, 1990). The second theory claims the existence
of a set of discreet, basic emotions, which are fundamentally dissimilar to each other.
Evaluating the basic emotion models proposed by the range of researchers, Ortony and
Turner claim that most proposed basic emotion models can be combined to form all other
emotions (1990). Even within the group that agrees on the existence of basic emotions,
few concur on the number of basic emotions and their identity/definition.
Examining the research on basic emotions, it appears that the number of basic
emotions falls somewhere between two and several dozen. Mowrer, for example,
proposes only two basic emotional states: pleasure and pain (1960). Frijda identifies 18
24
different emotions in his model which includes interest, wonder, humility, indifference
and desire, in addition to the more commonplace ones like anger and happiness (1986).
Many of these researchers have their own distinctive reasons or criteria to determine
basic emotions. Mowrer, for example, only includes pleasure and pain as basic emotions
because they are unlearned emotional states (1960).
Among the plethora of conflicting theories and vast number of suggested basic
emotions, Ortony and Turner (1990) observe some congruency. They claim that the most
commonly identified basic emotions include anger, happiness, sadness, and fear (Ortony
& Turner, 1990). Ortony and Turner (1990) also point out that often the basic emotion
theorists are not proposing emotions that are different; they are merely using different
words to describe the same emotion. Words like frustration, irritation, aggression, and
resentment, for example, are less or more intense forms of anger (Ortony and Turner,
1990).
The most well-established psychological models of basic emotion include those
proposed by (Plutchik, 1980) and (Ekman, 1999), which suggest that all emotions can be
reduced to a set of five to eight primitive emotions. Based on cross-cultural facial
expressions Ekman and Friesen (1986) derived a five emotion model that includes the
four, overlapping, common emotions plus surprise. Their justification suggests that basic
emotions are episodic and brief in nature, felt as a direct result of an antecedent, and each
unique emotion results in a common set of facial expressions that are shared across
cultures. Plutchik (1980), like Ekman & Friesen (1986), acknowledge the existence of a
set of emotions that combine to derive other emotions. Plutchik’s emotion model includes
acceptance, anger, anticipation, disgust, joy, fear, sadness, and surprise (1980).
25
Due to resource constraints, such as lack of money and artistic talent, and the fast
turn-around time required of captioners, the process of adding emotive animations to
captions needs to be simple and easy to use. As a result, for my research, a simplified
model of emotion to characterize the emotional elements present in television and film
dialogue is used. A four-emotion model using the four common emotions suggested by
Ekman was considered initially but a disgust category (suggested as a primitive emotion
by Plutchik, 1980) was added to the emotion model because it appeared once (but
prominently) in the example video content and required a unique animation style.
26
Chapter 2. Model of Emotion Used in this Thesis
The literature on the relationship between specific animation and emotions is very
limited, especially when applied to captioning. Thus, a design-oriented, intuitive
approach was chosen to uncover the linkage between animation properties and emotion.
The intuitive approach was spearheaded by an experienced graphic designer and a design
team. The design process consisted of generating a set of alternatives, selecting one
concept based on group consensus, applying of that concept to captioning content,
refining the concept and then creating a final production. There were natural limitations
to this creative process; however, it seemed to be an appropriate method at this initial
stage.
The experienced graphic designer specialized in animated text and its ability to
evoke emotions. His interest in this particular project was partly due to his own hearing
loss. My role in the research at this initial stage was to utilize captioning, animated text,
psychology and typography literature to support the creative process. I was also
responsible for the more technical aspects of the initiative, such as using multimedia
software to actually construct the animations and applying them to video.
Some animation movements have been found to suggest specific emotions,
especially those emotions that can be easily conveyed by voice intonations. For instance,
Juslin & Laukka, (2003) found that rising vocal contours were associated with anger, fear
and joy. Lee et. al. (2002) also found that emotions that are expressed using the prosodic
elements of voice can be represented using animation properties that correspond to those
elements. According to limited existing literature, animation motions that followed these
voice patterns appeared to consistently suggest the same emotions. These shared
27
properties created a good starting point for the generation of animation ideas for the
graphic designer and the research team.
At first, the research team focused on generating an array of alternative example
concepts. Each concept was roughly sketched to produce many simple “test” animations.
Some of the “animations” were fully rendered using software, while other concepts
remained in mock-up form. During the idea generation period, my role was to take the
designer’s mock-ups and create the test animations. I also used literature to ensure that
the animation examples were readable, legible and comprehensible as captions. The
period of idea generation lasted three weeks.
Next, the example concepts were critically examined and only the strongest concepts
were chosen for further development. My role at this critical period was to evaluate and
reject test concepts based on preferences of caption users according to literature. Several
iterations later, the chosen examples were applied to video samples and the animated
clips were assessed by the full research team consisting of me and two other individuals:
an experienced caption researcher and a deaf consultant from the Canadian Hearing
Society. Critical input from the research team guided the selection of the final two
concept styles.
Once agreement was reached on which alternatives to adopt, they were applied to the
text captions of two one to two minute examples of video content. At this point, my job
was to create the animations using software and ensure that the animations worked as
captions. During the animation creation process, it appeared that some emotions did not
lend themselves easily to animation. For instance, the motion representing sadness was
not consistently rendered and accurately interpreted by the group. In one instance, for
28
example, motion was used to move type “up” on the page, to complement dialogue which
said “we’re all going to die!” said with an increasing pitch in the voice. The motion was
meant to convey that sense of “dying” but was too tenuous a link for understanding. To
correct it, linguistic cues were relied upon and resulted in the sadness animation moving
downward mimicking sadness and falling intonation. These and other ideas were then
developed and applied by the designer and me and evaluated by the research team until
there was agreement on the final animations.
Section 2.01 Framework of emotive captions
Once two content pieces were captioned using the expert’s approach, a framework
was created, relating animation properties to the set of basic emotions as described by
Ekman (1999) and Plutchik (1980). At this point I was involved in extracting the
designer’s reasons behind the animations and matching them with the emotions. The
emotion surprise, anticipation and acceptance did not produce unique animation
properties but existed within the other five. As a result, only four emotions, anger, fear,
sadness, and happiness comprised the framework. There was one instance of disgust
identified but the animation for that instance followed the physical motion of the
character. While creating the framework, disgust was left out because the animation for it
appeared to be unique to a specific situation and unrepeatable.
The animation properties consist of a set of standard kinetic text properties. These
are: 1) text size; 2) horizontal and vertical position on screen; 3) text opacity or how see-
through the text is; 4) how fast the animation moves/appears/disappears (speed/rate); 5)
how long it stays on the screen (duration); 6) a vibration effect; and 7) movement from
29
baseline. Vibration is defined as the repeated and rapid cycling of minor variations in size
of letters and appears as “shaking” of the letters.
All of the properties are defined for three stages of onscreen appearance; 1) onset
or when the text first appears or begins to animate; 2) “the effect” occurring between the
offset and onset stages; and 3) offset or when the animated text stops being animated or
disappears from the screen. In addition, the intensity of the emotion can affect all of the
properties. For example, high intensity anger has a faster appearance on the screen and
grows to a larger text size in the effect stage than text that shows a lower intensity anger
condition (as suggested by Lee, et al., 2002). For high intensity anger, the text will also
seem considerably larger than the surrounding text. Most of these properties are defined
relative to the original size of the non-animated text and they would be identified as a
default value by the captioning standard that is applied.
Table 1 summarizes the property descriptions for each primitive emotion. Where
an animation property is not defined, a default value would be used. For example, where
onset speed is not specifically defined, this would be the default onset speed defined by
the captioning software. The framework also allows for other emotions to be added and
the related animation properties defined accordingly.
Table 1: Summary of the relevant animation properties for anger, fear, sadness, happiness
Emotion Relevant Properties (Effect) Intensity of Effect
Fear
Size: Repeated expansion and contraction of
the animated text for the duration of the
caption.
Rate: Expansion and contraction occur at
constant rate (e.g. 5 times per second).
Vibration: Constant throughout the effect.
Low: Size of animated text is the
same as non-animated text at onset.
Low level vibration.
High: Size of animated text is larger
than non-animated text at onset.
Higher the intensity, the larger the
30
text size up to 150% of original size.
High level of vibration.
Anger
Size: Contract to smallest size (e.g. 10%),
then expand to largest size (e.g. 150%), then
expand and contract until original size is
reached.
Vibration: Occurs at largest size in cycle.
Onset: Faster onset than default onset rate.
Duration: Pause at largest size in cycle and
vibration occurs here.
Low: Size of effect is smaller.
Slower onset.
High: Size of effect is larger. Faster
onset.
Sad
Size: Vertical scale of text decreases to 50%
Position: Downward vertical movement
from baseline.
Opacity: 50% decrease in opacity
Offset: Slower offset than default offset rate.
Low: faster offset, faster downward
vertical movement
High: slower offset slower
downward vertical movement.
Happy
Size: Vertical scale of text increases
Position: Upward vertical movement from
below baseline. Follows a curve (fountain-
like).
Low: Slower onset. Offset text size
is smaller.
High: Faster onset. Offset text size
is larger.
Figure 5 and Figure 6 illustrate these concepts for a high intensity anger caption
and a low intensity fear caption with the most salient properties identified. The
animations would appear during the time that the captions were present in the video
material.
31
1. Onset: Starting size
2. Size: Vertical scale of text decreases
3. Size: Vertical scale of text increases
4. Size: Vertical scale of text decreases
Figure 5: Example of high intensity anger over four frames.
1. Size: Expansion of text.
Vibration: Constant
2. Size: Contraction of text.
Vibration: Constant
Figure 6 Example of low intensity fear. Initial text size is default size. Text size then expands and
contracts rapidly for the entire duration that the text is on the screen.
32
Animations created using this framework could be applied to words, sentences,
phrases or even single letters (grain size) in a caption to elicit the desired emotive effect.
The intended emotive effects of the animations seem to be consistently applicable for a
range of different typefaces (e.g., Frutiger, Century Schoolbook).
33
Chapter 3. System Perspective
The system for creating captions is a manual process driven by the choices of the
captioners and broadcasters. As can be seen from Figure 7, the conventional captioning
model has visuals and audio as inputs and visuals, audio and text captions as outputs. The
audio input includes all the components discussed in the literature review that combine to
make the television viewing experience complete: speech prosody, sound effects,
dialogue and music. In order to convert inputs to outputs, the captioner transforms the
inputs by transcribing, interpreting, describing and summarizing the dialogue and some
of the other sound elements into text captions (DCMP, n.d.). The resulting output
includes the text equivalent of the dialogue and does not include most of the other sound
components. The dashed lines outline the components of audio that could be missed by
hearing impaired viewers or those viewers unable to access sound (because they are at the
gym or a bar).
My hypothesis is that those caption users unable to fully access non-dialogue
audio component of the output can compensate for the missing information with the
introduction of animated captions to the conventional system (as shown in Figure 8). In
the proposed system, the inputs are altered to include the input of the director/film-
maker/content-creator and the emotional model outlined in Chapter 2. The captioner can
then use the two added inputs to create the animations that can supplement the missing
audio elements. The emotion model can allow the captioner to create meaningful
animations, which would reduce the amount of descriptive work.
34
Figure 7 Conventional Captioning System
Figure 8 Captioning System, Proposed
35
Figure 9 a) schematically represents the tasks the captioner carries out to convert
the inputs into the outputs, based on the captioning process described in (DCMP, n.d.).
The captioner first creates a copy to work with from the Master copy of the show, he then
watches the program, and transcribes the video. Following that, the captioner works on
the transcript to prepare the captions for broadcast by:
a) Paraphrasing or truncating the dialogue/text
b) Adding text descriptions and
c) Eliminating redundant/unnecessary text.
Once the captioner has completed the process, the captions are timed with the
dialogue based on the time code and set at a pre-determined presentation rate. Once the
captions and video are matched, it is also the captioner’s job to review his work, merge
the video and the captions and produce a copy for broadcast.
Figure 9 a) indicate the steps where the captioner makes crucial, editorial
decisions during the captioning process. Firstly, the captioner decides how the transcript
for a show will be broken up into phrases and presented to the viewers. Then he also
determines whether dialogue need to be paraphrased or “unnecessary” words must be
removed to fit into a short timeframe. The captioner decides which sound effects, music
or prosodic elements are vital enough to be described and the words that should be used
to describe them (DCMP, n.d.). These decisions can often be very important ones and
have the power to change the meaning of a particular show. Incorrectly paraphrasing
fast-paced dialogue can dramatically alter the tone of a show. Similarly omitting a
36
particular sound description, because it is deemed unnecessary by the caption creator, can
hugely impact the story.
The directors of captioned shows and movies have failed to realize that captions
are not merely tools that aid understanding of dialogue. Captions are representations of
the director’s work. Those viewers accessing sound via captions interpret a large portion
of the message and meaning of the show or film based on the decisions made by the
caption houses. It is possible for audiences to dislike or misinterpret a movie based on
substandard caption work.
Right now closed captioning is relatively simple, thus decision making by caption
houses are not as extensive. If the animated caption model is implemented in the current
captioning system, the responsibility of the captioner would increase considerably. In
addition to all the decisions the captioner is currently making, he would have to decide on
the emotions felt by the characters. The only effective and accurate way of incorporating
emotive animations into captioning would be to involve the show creators into the
process.
Figure 9 b), the proposed or potential future process is described, where the
animated captions are incorporated into the output. Unlike the current process, the
proposed process requires much more input from the director/creator of the show. The
show creator (or the show’s subject matter expert like script writer) would advise the
captioner on how to break up the text into phrases, provide notes on the
emotions/intensity ratings of dialogue, sound effects and music, and indicate how sound
effects and music is to be described. The captioners in the proposed system would use the
director’s notes and the emotion model to create the appropriate animations. The rest of
37
the proposed process is similar to the current process except that at one point the director
or a representative of the director will review the captions with the captioner before
broadcast.
Broadcasters are legally required to provide captioning in order to renew their
licenses (Downey, 2007), thus their focus is on providing captions as cheaply as possible.
If show creators can appreciate that approximately 100 million caption users (National
Captioning Institute, 2003) are accessing and interpreting their works through inferior
caption work, then they might be more willing to participate in the captioning process.
38
Figure 9 Captioning Process: Conventional and proposed
9 a) Current captioning process
9 b) Proposed captioning process
39
Chapter 4. Method
In order to begin to understand the impact of kinetic typography in captioning and
to verify and refine the framework outlined in Chapter 2, the following research questions
are proposed:
1. Can viewers interpret/understand/appreciate the animated text captions?
2. Can emotions be represented using animated text in captions?
3. What properties of animation can be used to create emotive animated text?
The first question seeks to determine if typical viewers of captioning, either hearing
or hearing impaired, are able to watch and understand a show captioned with animated
text. Before implementing a novel form of captioning, it is imperative to know if
audiences actually understand, accept and enjoy watching the new style.
The purpose this research is to improve access to the intended emotional
information contained in non-dialogue sound of television by expressing those emotions
through movement of text. As such, I want to attempt to measure the level of
understanding of those emotions within the context of an actual narrative or show. It
should be noted that measuring understanding of emotions is a difficult and complex
process. The animations for a show do not exist by themselves, as the animations are
supplemented by facial expressions and residual sounds that interfere with people’s
understanding. Furthermore, the study attempts not to discover how the participants are
feeling while watching a show, but to measure their understanding of the character’s
emotions—a difficult task to accomplish.
40
It is still important to attempt to measure the participant’s understanding of the
character’s emotions because the main limitation of the kinetic typography research thus
far has been to study reaction to animated text out of context of television/film. My
research endeavours to put the participants understanding of emotions through animation
in the context of watching television or film.
The second research question relates to whether the same properties of animation
applied to different words and phrases can produce a consistent emotional effect on
viewers. For instance, I tried to determine if “angry words” with the same angry
animation properties are consistently identified as angry by the participants. I also tried to
see if variations in intensity of those properties can produce the identification of that
same emotion.
Finally, the last question examines which animation properties (and with what value)
can be used to create the desired animations. The focus of this question is on the
relationship between animation properties like size, duration, vibration rate (Lee et. al.,
2002) and specific emotions. The intent of this question is to construct and refine an
emotion model that can be used to generate animations for captioning.
In order to evaluate and refine the proposed emotional model and to analyze
user’s opinions of animated captions, a 2x2x3 factor empirical study involving hearing
impaired and hearing participants, two different pieces of content and three different
caption styles was carried out. One of the caption styles was conventional closed
captioning, which was used as the control for the study.
This study applied a mix of quantitative and qualitative methods, as suggested in
Creswell (2003). The study used questionnaires, verbal protocol and probing interviews
41
to determine user opinions, understanding and interpretation of the emotions expressed
through the captions, and attitudes and preferences regarding the animated captioning
techniques. Data analysis consisted of thematic analysis of verbal protocol and interview
data, and statistical analysis of the written questionnaire responses. Ethics approval was
sought and received prior to commencing the study and the approval letter can be found
in Appendix A. The deaf and hard of hearing (HOH) test participants were recruited via
the hard of hearing association listserv and they were paid a convenience fee of $30 for
their participation.
Since the study was exploratory and the population of people who are HOH is small
and relatively difficult to recruit, the number of HOH subjects was limited to 15. Ten
hearing subjects were recruited for comparison. Ten of the participants were male and 15
were female. The ages of the participants ranged from children to seniors with two under
the age of 18, three between 19 and 24, five between 25 and 34, four between 35 and 44,
seven between 45 and 54 and four participants between 55 and 64. The distribution of
ages is shown in Figure 10. Nineteen of the participants had college or university
training, two (the under 18s) completed elementary school, and four completed high
school.
42
Age Distribution
0
1
2
3
4
5
6
7
8
Under 19 19-24 25-34 35-44 45-54 55-64
Age ranges
Nu
mb
er
of
Part
icip
an
ts
Figure 10: Age distribution
Although five of the participants were deaf, only one of the participants was oral deaf.
Fourteen of the 15 participants were able to communicate through speaking. Since all of
the participants (except the two children under the age of 18) had completed high school,
their literacy level was considered to be high enough to understand verbatim captioning.
Section 4.01 Content used in this study
marblemedia (sic), a production company partner, provided a children’s television
program called Deaf Planet to be used for the captioning study. Deaf Planet is about an
average man from earth named Max, who crash lands on a planet inhabited by a
population of deaf individuals. On this deaf planet, Max befriends a deaf female named
Kendra. Each episode of Deaf Planet shows Max and Kendra having different adventures
where they discover facts about science. Since Kendra is deaf, she uses American Sign
Language (ASL) to communicate with the other characters in the show. Max’s robot,
named Wilma, interprets between the hearing and deaf characters. Deaf Planet was
chosen as the test material for two primary reasons. Firstly, since marblemedia creates
43
shows for the hard of hearing and deaf audiences, they are more willing to incorporate the
new caption model outlined in Chapter 3. They were willing to work with the research
team to provide guidance about the character’s emotions and review the finished
animations. Secondly, copyright laws restricted the type of content that could be used in
the study. The permission of the content creator was required before showing the clips to
the participants. marblemedia, as a partner in previous research endeavours, permitted the
use of Deaf Planet in the study and for publications. In addition to copyright issues, I
needed a high resolution copy of a piece of content in an uncompressed format because
each time the clip was rendered with layers of text, the quality would deteriorate. The
research team had a well-enough connection with marblemedia to procure such high
quality video.
The sign language content of the show necessitated the exclusion of all
participants trained in ASL, since they will likely rely on the signing instead of the
captions for comprehension. As a result, only non-signing, deaf and HOH participants
were invited to take part in the study.
Two different clips of Deaf Planet were chosen for animated captioning. The
clips were chosen based on having a representative range of emotional content and sound
effects as determined by a committee (including the artist who subsequently created the
animations). The content needed to contain at least five of the basic emotions identified
by Ekman (1999) and Plutchik (1980), with a minimum of one example each. Two
seasons of Deaf Planet episodes (26 in total) were viewed before two short clips from
two different episodes (To Air is Human and Bad Vibrations) were selected.
44
In “To Air” (TA) the main characters are swallowed by a giant fish and
experience varying degrees of fear, anger, sadness, happiness and disgust. In “Bad
Vibrations” (BV) two male characters compete for the attentions of a female character
and experience emotions such as fear and anger. BV also contains three sound effects:
whirring (the start of a space-craft engine), rumble (the earth shaking) and plop (snow
falling on a character). As an informative, learning-oriented show, geared towards young
children, Deaf Planet uses fairly simple and exaggerated expressions of emotion and
sound effects. As a result, it was relatively simple to identify and categorize the emotions
in the test clips. Before proceeding with the animations, the categorized emotions were
verified by marblemedia.
The clips were captioned using two kinetic typography techniques labelled
enhanced captioning and extreme captioning. Combined with conventional closed
captioning, the total number was six videos. Each of the test clips was short in duration,
lasting between 1 minute 30 seconds and 2 minutes. More about enhanced captioning and
extreme captioning will be discussed further on in this section.
A short, 30-second training clip was also produced using closed captioning to
orient the participants to the study. The training clip introduced the characters and the
premise of the show and allowed the test participants to adjust their environment
(volume, proximity to the television) according to their needs. Comments made by the
participants during the training clip segment were not used during data analysis.
The caption styles chosen for testing were divided into three categories: enhanced
captions (EC), extreme captions (XC) and conventional closed captions (CC).
Animations were applied to the captions to emphasize the emotional content of words or
45
phrases, to indicate background noise or sound effects and to denote emphatic words. The
emotive animations were developed through an intuitive approach (described in Chapter
2), where an experienced graphic designer and artist created animations that he thought
(in his artistic experience) would convey the five basic emotions: anger, fear, sadness,
happiness and disgust. Many of the artist’s choices corresponded with literature that
connected prosodic features (e.g. pitch and volume) to animation properties (e.g. size and
vibration). The animations were iteratively reviewed and revised by a committee until
consensus was reached regarding their effectiveness in portraying the emotions. The steps
in developing the emotive captions are outlined in detail in Section 2.01 in Chapter 2.
Figure 11 shows examples of each type of caption style used in this study. In the
enhanced captioned version, the animations were intended to be unobtrusive and all
words were placed at the same location on the screen at the bottom. The static placement
of the captions represents the typical location of conventional captions.
The extreme captions included the same animations as the enhanced captions but
the animated words and phrases were placed dynamically on the screen. In addition, the
extreme caption animations were much larger than the enhanced version. The graphic
designer chose this stylistic interpretation to convey his own artistic expression and
emulate a new style of text display found in advertising.
Conventional closed captions were also produced for the videos as a control
measure to compare the emotive styles of captioning to what is currently being used in
the industry. The conventional captions were produced using the services of an online
captioning company and are seen in Figure 10c
46
a. Enhanced Caption b. Extreme Caption c. Conventional Closed Captions
Figure 11: Example of Enhanced Caption and Extreme Caption
Section 4.02 Data Collection
(a) Pre-Study Questionnaire
Before viewing the videos, the participants were asked to complete a pre-study
questionnaire. The questionnaire was divided into three basic parts. The first section
contained five questions and documented demographic information: hearing level
(hearing, HOH or deaf), gender, age, educational level, and. The demographic
distribution was collected to assess whether age, education level and occupation of the
participants affected their reaction to the animated captions.
The second part of the questionnaire had four questions and sought to ascertain
the participants’ television viewing habits. The first question revealed how much
television/movies they watched in a week, with six choices with the lowest being less
than 1 hour and the highest being more than 20 hours. Question two and three asked
whether the participants watched television by themselves or with others using a five
point Likert scale with categories of Always, Frequently, Sometimes, Seldom, and Never.
Question four asked how often the participants discussed the television program they
were watching with others using the same Likert scale
47
The third section of the questionnaire focused on how the participants used
conventional closed captioning and asked their opinions on the current state of
captioning. Section three contained six questions in total. The first question asked how
often the participants watched television with captioning. The second question asked the
participants to check off a list of features they liked about captioning: rate of speed of
display, verbatim translation, placement on screen, use of text, size of text and colour.
The third question provided the same features as question two but asked the participants
to check off which ones they disliked. The fourth question provided the participants with
a list of potential elements missing from captioning and asked them to check those ones
they believed were important and wanted to see in the future. The elements for question
four were a) emotion in speech, b) background music/notes, c) speaker identification, d)
text displayed at the same as the words being spoken and e) adequate information in
order to understand jokes/puns correctly. The fifth question in section three asked the
participants to select five characteristics of closed captioning that needed to be changed
from a list of fourteen including a) faster speed of displaying captions, b) text description
for background noise, c) use of colours and d) use of different font and text sizes. The
final question in the questionnaire was open-ended and asked the participants to add any
comments not captured by the previous question.
The questionnaire was evaluated prior to the study by two people in order to
eliminate ambiguity and confusion. The pre-study questionnaire can be found in
Appendix B.
48
(b) Data collection during viewing of the content
The study was designed as a within-subjects 3 (caption styles) x 2 (different clips
of Deaf Planet) factor design. The order of viewing of the clips was randomized for each
test participant in order to reduce learning biases resulting from watching the same clip
multiple times. Randomizing the viewing ensured that multiple users were viewing each
clip for the first time at least once.
After the first viewing of the first version of each episode, participants were given
a checklist and asked to identify which emotions were present in the clip. Next, they were
asked to identify the checked emotions during a second viewing of the same version of
the show. At this point, they could also add new emotions recognized in the show but not
recalled during the checklist task. Since I wanted to avoid a learning effect and since
people usually watch a program once, I wanted to obtain people’s initial reaction and
understanding of the emotive content without much time to reflect.
Participants then watched the remaining versions of the episode and commented
on the captioning style, the information presented and the show. The participants were
encouraged to talk out loud their thoughts while watching the clips. All sessions were
video taped as well as transcribed. After each set of three viewings for each episode,
participants were asked to complete a post-viewing questionnaire to provide reactions to
and opinions of the three different captioning styles.
(c) Post-study questionnaire
The post-study questionnaire contained ten questions relating to the participant’s
viewing experience. The first question asked participants to summarize their
understanding of the test clip. Next, on a Likert scale ranging from 1 to 5, the participants
49
were asked to rate the likability of various caption attributes for the enhanced captions
(color, size of text, speed of display, placement on screen, movement of the text, how
well the animations portrayed the character’s emotions, and text descriptions) where 1
was “Liked very much” and 5 was “Disliked very much”. They were then asked to rate
their affinity for the size, speed, screen placement and text descriptions of the
conventional captions using the same Likert scale. Participants were asked to rate
whether the enhanced captions contributed to their understanding of the show and the
emotions, their level of confusion about the show, and their level of distraction. Finally,
participants were asked for any concluding commentary or to expand on specific
comments made during the show that they thought were particularly noteworthy. The
post-study questionnaire was evaluated by two people before commencing the study in
order to reduce ambiguous or confusing wording. The post-questionnaire can be found in
Appendix C Post-Study Questionnaire.
(d) Video Data
During the viewing of the various clips, the participants were encouraged to think
out loud and provide commentary and opinions about the content, and captioning. After
watching each clip, participants were given further opportunity to comment on the
captioning styles and the content. All the studies were video taped to capture participants’
comments.
50
(e) Experimental setup and time to complete study
The pre and post study questionnaires required five to ten minutes to complete
and the total viewing time for the clips, including comments, was 15 to 20 minutes. In
total, all studies were completed within 45 to 60 minutes.
The test environment simulated a home television viewing experience. As a
result, all the captions were viewed on a television set instead of a computer screen. The
participants were allowed to adjust their proximity to the television and the volume level
prior to starting the study.
(f) Data Collection and Analysis
To determine user opinion of and to compare the different caption styles, the video
data was subjected to a thematic analysis proposed by (Miles & Huberman, 1994). First,
themes were identified and defined by three independent reviewers through viewing a
representative sample of the video data. These themes were then distilled into five main
themes (see Table 2) by consensus of the three reviewers. Each theme was further
divided into positive and negative sub-categories.
Two independent raters then evaluated two sample videos using the themes and sub-
categories and the results were analyzed using an intraclass correlation (ICC) to
determine reliability of the theme definitions. The single measures ICC between the two
raters for all categories were 0.6 or higher. The remaining video data was then analyzed
by a single rater.
Table 2: Description of the themes
Theme Description Example Positive Example Negative
51
Comments Comments
Like/dislike
Enhanced
Captioning
Specific direct or
implied comments
about preferring/ not
preferring enhanced
captions.
“I like this” “this
helps me understand
better”
“I don’t like this” “this is
confusing”
Like/dislike extreme
captioning
Specific direct or
implied comments
about preferring/not
preferring extreme
captions.
“this helps me know
who is speaking”
“this is too imposing”
Comments about
closed captioning
Any comments about
closed captioning.
“I like closed
captioning better than
the other two”
“This doesn’t provide me with
all the information”
Technical comments
or comments
unrelated to caption
style
Captures all
comments unrelated
to the captions.
“This is an interesting
show”
“Captions block off too much
of the screen”
Impact of
captioning
Impact of the captions
as a whole.
“Captions are useful” “Captions are distracting”
Impact on Genre Any comments related
to the appropriateness
of animated caption to
a specific genre.
“This would be great
for comedies”
“This would never work for
dramas”
Impact on people Comments about how
captions would impact
other people.
“Kids would love
this”
“Hearing people will find it
distracting”
Sound Effects All comments related
to the sound effects
“I like the rumbling
noise sound effect”
“The word music should be
replaced with notes”
52
The emotion identification portion of the video data was categorized separately
using signal-detection theory as a basis for the design of the categories (Abdi, 2007). The
categories were designed to take into account the correctly detected stimuli (“hit”),
undetected stimuli (“miss”), incorrectly detecting the absent stimuli (“false alarm”) and
missing or absent stimuli (“correct rejection”) (Abdi, 2007).
As mentioned in Chapter 2, the artist who created the animations for each of the
emotions also categorized the emotions in the content with the help of a small committee.
The classified emotions were then approved by Deaf Planet’s production company,
marblemedia, to verify the validity of the classification. A “hit” or correct identification
was recorded when the user identified the correct emotion (as classified by the artist and
approved by the production company) experienced by the characters within a two second
timeframe of the occurrence of that emotion. The two-second window was selected to
account for reaction time. The human information processor model proposed by Card,
Moran & Newell (1983) suggests that reaction time involves the time it takes for an
individual to see (70-700 msec), perceptually process (50-100 msec), cognitively process
(25-170 msec) and respond with a physical action (30-100 msec) to a visual stimuli (a
total of 175 msec - 1.1 sec). Unlike the model-human processor where the task was to
press a button when a correct stimulus appeared on a screen, the participants in this study
completed a more complex series of tasks: watch a program, process the emotions and
verbalize the emotions aloud. As a result, I increased the window of acceptable reaction
time to two seconds.
Incorrect identification was noted when the user identified the “wrong” emotion
(compared to what was classified) but within the two second timeframe (+ or – 2
53
seconds). No identification or “miss” was noted when the viewer did not indicate that
there was emotion when it was present in the captioning. Extra emotion identification or
“false alarm” was noted when the viewer identified an emotion outside of the classified
emotion window. Correct rejection was not applicable in this case because the
participants were not asked to identify instances where there was no emotion. As a result
D-prime, the index of discriminability of a signal, could not be calculated and future
studies might consider a mechanism to incorporate this element. The D prime statistic in
future studies can be used to compare hit, miss, false alarm and correct rejection rates
between enhanced caption style and regular caption style as perceived by the users.
Table 3: Emotion identification categories
Category Definition
Exact Correctly identified emotion within the correct time frame.
Incorrectly
identified
Incorrectly identified emotion within the correct time frame.
No
identification/Miss
Did not identify the emotion at all within correct time frame
Extra emotion
identification
Identified emotion outside of the designated time frame
54
Chapter 5. Results
Section 5.01 Questionnaire Data
Twenty-five participants completed one pre-study and two post-study
questionnaires during the session. The pre-questionnaire results revealed that 11 of 25
(44%) of the participants spent one to five hours watching television per week, five of 25
(20%) watched six to ten hours television per week, six of 25 (24%) watched 11 to 15
hours of TV, one of 25 (4%) watched 11 to 15 hours of TV and two of 25 (8%) watched
more than 20 hours of TV a week.
Thirteen of 25 participants (52%) watched television alone either always or
frequently. Eight of 25 participants (32%) said they watched television alone sometimes
and four of 25 (16%) said they seldom watched television alone. Fifteen of 25 (60%) of
participants have conversations with friends and family while watching television, and
40% seldom or rarely have conversations with friends and family while watching
television. Twenty of 25 (80%) reported using closed captioning either always or
frequently and five of 25 (20%) said they never used captioning when watching
television. The participants were asked what they liked most about captioning and 40%
claimed they liked the rate of display, 40% chose verbatim translation of the dialogue,
12% chose use of text, eight percent chose use of black and white colours and four
percent chose the “other” category (they were allowed to choose more than one option).
When reporting what elements the participants thought were missing from closed
captioning, the most popular answer, with 44%, was the lack of emotions in speech,
55
followed by speaker identification with 16%, inadequate information for jokes and puns
with 32%, lack of background noise/music with 12% and lack of synchronization with
16%.
Finally, the participants were asked to choose five characteristics of existing
closed captions that they would like to see changed. Most participants (84%) chose the
use of different fonts and text sizes, and 68% chose speaker identification. Less popular
were speed of captions with 44%, floating captions and use of colours with 36% and text
description for emotions with 32%.
A chi-square (χ2) analysis of response categories comparing responses with an
expected frequency of equal distribution among all answer categories for all questions in
the post-study questionnaire for both episodes, To Air is Human (TA) and Bad Vibrations
(BV), showed some interesting similarities. There was a significant difference between
response categories at the p < 0.05 level for most questions regarding the “likability”
ratings for the enhanced captions attributes for both episodes. The “likability” ratings, as
mentioned in Section 4.02, ranged from “Liked very much” (score of 1) to “Disliked very
much” (score of 5) on a 5-point Likert scale and were applied to colour, size, speed,
placement, movement and emotiveness of the captions. Table 4 and Table 5 show the
CHI-square results and the means and standard deviations (SD) for each attribute
(“likeability” rating) that showed significance for episodes TA and BV respectively.
Table 4: χ2, degrees of freedom (df), mean and standard deviation for enhanced
captioning attributes for To Air.
Attribute Χ2 df Mean SD
56
Colour of text 12.7 4 2 1.1
Size of text 21.6 3 1.8 0.7
Speed of caption display 14.8 3 1.8 0.8
Placement on screen 14.8 3 1.7 0.8
Movement of text on screen 15.8 3 1.9 1.0
Portrayal of emotions 17.2 4 2 1.0
For To Air, 17 of 24 (71%) rated the colour as positive or very positive, five of 24
(21%) rated the colour as neither liked nor disliked and two of 24 (8%) reported disliking
the colour. Twenty-three of 25 (92%) reported the size as liked or very much liked but
eight percent said they were neutral or they disliked the size. Twenty-two of 25 (88%)
liked or very much liked the speed while 12% reported being neutral or negative towards
the speed. Twenty-two of 25 (88%) reported placement of text as positive or very
positive but 12% reported placement to be neutral or negative. Twenty-one of 25 (84%)
reported the movement of the text on the screen as positive or very positive while 16%
rated movement as neutral or negative. Finally, 20 of 25 (80%) reported the portrayal of
the character’s emotion using movement of the text on the screen as positive or very
positive, 12% remained neutral in this category while 8% said they either disliked or
disliked very much the portrayal of emotions.
Table 5 χ2, degrees of freedom (df), mean and standard deviation for enhanced
captioning attributes for Bad Vibes.
57
Attribute χ2 df Mean SD
Colour of text 10.5 2 2.0 0.6
Size of text 15.5 3 2.0 0.8
Speed of caption display 19.6 3 1.9 0.8
Placement on screen 12.2 3 1.9 0.9
Portrayal of emotions 14.8 4 2.0 1.0
Text description 16.8 3 2.2 0.8
For Bad Vibes, 19 of 23 responses (83%) rated the colour as positive or very
positive and 17% rated the colour as neutral or negative. Twenty-one of 25 (84%)
reported the size as positive or very positive, while 16% reported being neutral or
negative towards the size. Twenty of 25 (80%) liked or very much liked the speed while
20% were neutral or negative towards the speed. Twenty-one of 25 (84%) reported
placement as positive or very positive while 16% were either neutral or negative towards
the placement. Nineteen of 25 (76%) reported portrayal of the character’s emotion using
movement of the text on the screen as positive or very positive, while 16% reported being
neutral and 8% reported being negative or very negative. Finally, 19 of 25 (76%) reported
the text descriptions as positive or very positive while 24% reported being either neutral
or negative towards text descriptions.
Also, the χ2 results between response categories for the question regarding
whether participants would be willing to discuss the show with hearing friends or family
showed a significant difference for both episodes [χ2(4, 25) = 13.9 for TA and χ2(4,
58
25)=12.8 for BV, p < 0.05]. The mean and SD for TA was 4.0 and 1.1 respectively on the
Likert scale of 1 to 5 where 1 was “not willing to discuss show” and 5 was “very
willing”. Eighteen of 25 (72%) respondents said that they were either willing or very
willing to discuss the show with hearing friends. The mean and SD for BV was 3.9 and
1.1 respectively. Eighteen of 25 (72%) of respondents said that they would be willing or
very willing to discuss the show with hearing friends.
For Bad Vibes, a χ2 analysis between responses categories showed a significant
difference for the question regarding the level of distraction caused by the enhanced
captions [χ2(4, 25) = 14.8, p < 0.5] with the mean = 3.0, SD = 1.3. Thirteen of 25 (52%)
of respondents said that they were either very or slightly distracted by the enhanced
captions while 10 of 25 (40%) said that they were not or only slightly distracted by the
enhanced captions.
When the descriptive statistics for the extreme captions were examined there were
some conflicting results. First, 19 of 28 (68%) responses very much liked or liked how
the emotions were portrayed with the captions. However, a majority of participants 16 of
28 (57%) disliked the movement of the captions
A t-test analysis comparing two independent samples was carried out between the
hard-of-hearing and hearing subject groups for each caption style of each different
episode for the questionnaire variables related to captioning attributes. For To Air, a
significant difference was found in the enhanced caption text descriptions attribute [t(23)
= -3.43, p< 0.05] where the average responses for the hard of hearing (HOH) group was
1.64 (SD = .63) and hearing group was 2.45 (SD = 0.52). A significant difference was
also found in the extreme caption colour attribute [t(12) = -2.31, p< 0.05), where the
59
average responses of the HOH group was 1.67 (SD = 0.52) and hearing group was 2.50
(SD = 0.76).
For Bad Vibes there was one significant difference, in the enhanced caption
placement attribute [t (23) = -4.29, p< 0.05] between the two groups. Hard of hearing
group very much liked/liked the placement of the enhanced captions (m = 1.57, SD =.85),
while the hearing group liked/felt neutral about the enhanced captions placement (m=
2.27, 0.79).
A second t-test analysis was carried out to determine differences in the caption
attributes between the two episodes, "Bad Vibes" (BV) and "To Air Is Human" (TA) for
all participants. The movement of the text for the enhanced captions between these
episodes was found to be significant [t (48) = -2.2, p < 0.05] where mean = 1.7, SD = 1.0
for TA and mean = 2.3, SD = 1.1 for BV. The text description for conventional
captioning was also found to be significantly different [t(48) = -2.2, p < 0.05] for TA
(mean = 2.3, SD = 0.9) and for BV (mean = 2.8, SD = 0.9).
A t-test analysis between both episodes for hearing subjects only was also carried
out. There were no significance differences between any of the enhanced or conventional
caption attributes for the two episodes.
A final t-test was carried out between both episodes for only the hard-of-hearing
subjects. The movement of text attribute for the enhanced captions was significantly
different between the two episodes [t (26) = -2.5, p < 0.05] (TA mean = 1.4, SD = 0.8 and
BV mean = 2.3, SD = 1.1). The text description of enhanced captioning was found to be
significantly different between the episodes [t (26) = -2.3, p < 0.05] (for TA mean = 1.6,
SD = 0.6 and for BV mean = 2.3, SD = 0.8). The text description of conventional
60
captioning was also found to be significantly different between the episodes [t (26) = -
2.3, p < 0.05] (for TA mean = 2.0, SD = 0.88 and for BV mean = 2.9, SD = 1.1).
Crosstab analyses are applied to nominal data in order to uncover correlations
between two variables. Crosstab analysis were carried out on all of the participant data to
examine whether there was any correlation between the ratings of understanding
(provided on a 5-point Liker scale) with the various attributes of the enhanced captions.
For this analysis, data were separated by the participant groups of HOH and hearing and
aggregated over both episodes—to ensure enough data points.
For the hearing group, there were six significant crosstab Pearson correlations found
for the enhanced caption attributes of:
1) Size of the text and the overall understanding of the show [χ2(12, 22) = 24.8, p<0.05]
where 13 of 22 responses rated their level of confusion as low and liking the size of
the text;
2) Size of the text and level of distraction [χ2 (12, 22) = 27.7, p<0.05] where 8 of 22
respondents rated the level of distraction of the enhanced captions as not distracting
or slightly distracting and liking or very much liking the size. However, 7 of 22
respondents reported an inverse relationship where the enhanced captions somewhat
or greatly distracted them but they still liked or liked very much the size of the
captions;
3) Movement of the text on screen and understanding of the show [χ2(16, 22) = 34.3,
p<0.05] where 12 of 22 responses found their level of understanding increased or
greatly increased and they liked or liked very much the animated text;
61
4) Movement of the text on screen and distraction [χ2(16, 22) = 36.2, p<0.05] where 8 of
22 responses found they were only slightly or not distracted and they liked or liked
very much the animated text;
5) Portrayal of the emotions using the animated text and understanding of the show
[χ2(12, 22) = 30.5, p<0.05] where 14 of 22 responses found their level of
understanding increased or greatly increased and they liked or liked very much the
animated text; and
6) Portrayal of the emotions using the animated text and distraction [χ2 (12, 22) = 27.9,
p<0.05] where 9 of 22 responses found they were slightly or not distracted and they
liked or liked very much the animated text. However, 8 of 22 respondents reported an
inverse relationship where the enhanced captions somewhat or greatly distracted by
the enhanced captions they still liked or liked very much the emotions portrayed in
the captions.
For the HOH group, there were six significant crosstab Pearson correlations found
for the enhanced caption attributes of:
1) Size and the overall understanding of the show [χ2(8, 28) = 18.9, p<0.05] where 15 of
28 responses rated their level of understanding of the video increased or greatly
increased and liking the size of the text;
2) Size and level of understanding of the character’s emotions [χ2(8, 28) = 21.0, p<0.05]
where 17 of 28 respondents rated the level of understanding of the emotions as
greatly increased or increased and liking or very much liking the size of the text;
62
3) Speed of the captions appearing on the screen and confusion with the show [χ2(8, 28)
= 21.6, p<0.05] where 10 of 28 responses found their level of confusion with the
show as less or much less and liking or liking very much the speed of animated text;
4) Speed of the captions appearing on the screen and understanding of the character’s
emotions [χ2(8, 28) = 21.0, p<0.05] where 15 of 28 responses found their level of
understanding increased or greatly increased and liking or liking very much the speed
of animated text;
5) Speed of the captions appearing on the screen and distraction [χ2(8, 28) = 20.5,
p<0.05] where 17 of 28 responses found they were not distracted or slightly distracted
and liked or liked very much the speed of animated text;
6) Placement on the screen and understanding of the show [χ2(12, 25) = 24.9, p<0.05]
where 16 of 28 responses found their level of understanding increased or greatly
increased and they liked or liked very much the placement of the animated text on the
screen; and
7) Placement on the screen and understanding of the character’s emotions [χ2(12, 28) =
23.0, p<0.05] where 18 of 28 responses found their level of understanding increased
or greatly increased and liking or liking very much the speed of animated text.
Section 5.02 Video Data
As described in Chapter 4, the video data was coded using a thematic analysis
approach according to the themes and sub-categories in Table 2. The comments made by
participants were recorded and then analyzed using the pre-defined themes. Many
participants commented negatively and positively at various times regarding the same
63
theme. Repeated measures ANOVA and t-tests were applied to analyze the video data to
compare the mean data among the various groups. ANOVA and T-tests were appropriate
for this data set because they met the homogeneity of variance and independence of cases
criteria. However, as the data set was not normally distributed and non-parametric tests
such as Kruskall-Wallace and Mann-Whitney were also applied to the data set. Since the
results were identical between the parametric and non-parametric tests, the more
common, parametric results were considered for reporting. Due to the number of tests, a
Bonferroni adjustment was applied to the 0.05 to derive an adjusted alpha value of 0.02
to reduce type one errors of (error of incorrectly detecting a significant difference).
A t-test was carried out to determine whether there was a difference between the
hearing and HOH participants for each theme. A significant difference was found
between the hearing and HOH group in the Impact on People Positive category [t (7) = -
3.0, p <0.02]. The results of video comments analyses somewhat mirrored results from
the questionnaire data where no significant differences was found between the hearing
and the HOH in most categories.
In order to determine whether the participants had a preference regarding the style
of captioning, the number of positive and negative comments for the three captioning
styles was analyzed using Repeated Measures Analysis of Variance (ANOVA). There
was a significant difference among the three caption styles in the positive [F (2, 39) =
9.32, p< 0.02] and negative categories [F (2, 41) = 4.11, p<0.05]. A Tukey post-hoc test
revealed significance for:
1) Positive comments between enhanced captioning and extreme captioning (p =
0.002).
64
2) Positive comments between enhanced captioning and closed captioning (p =
0.005).
3) Negative results between extreme captions and closed captions (p = 0.018).
The Repeated Measures ANOVA revealed no significant differences for:
1) Positive comments between extreme captioning and closed captioning
2) Negative comments between enhanced captions and extreme captions.
Table 6 and Figure 12 compare the number of positive and negative comments
made by the HOH and hearing group for each category. The most number of total
comments were in the enhanced caption positive category (75 comments made by HOH
and H participants combined). This was, however, closely followed by the number of
total negative comments for the extreme captions (total of 68 negative comments by
HOH and H combined). The fewest number of total comments were for the extreme
caption positive sub-category with only eleven comments between the HOH and H
participants. This was followed by 12 total comments for the Impact on People negative
sub-category.
As shown in Table 6, Table 7 and Table 8, 12 of 15 (80%) of the HOH (m = 2.8,
SD = 1.71) and 10 out of 10 (100%) hearing participants (m = 4.2, SD = 1.93) made at
least one positive comment about the enhanced captions. The total number of positive
comments about the enhanced captions made by the HOH group was 33 and total number
of positive comments made by the hearing group was 42.
Four of 15 (27%) HOH participants (m = 2.8, SD = 1.25) commented negatively on
enhanced captions and one of 10 (10%) of hearing participants (m = 4.0, SD = 0)
65
commented negatively about the enhanced captions. The total number of negative
comments related to enhanced captioning was 11 for HOH and four for hearing.
Seven of 15 (46.67%) HOH participants (m = 1.4, SD = 1.13) and four of 10
(40%) of the hearing participants (m = 1.8, SD = 0.96) made positive comments about
closed captioning, with total positive comments for closed captioning of 10 and seven
respectively. Ten of 15 or (66.67%) of the HOH participants (m = 2.2, SD = 1.14) and
nine of 10 (90%) of hearing participants (m = 2.0, SD = 1.11) made negative comments
about closed captioning, with a total of 22 and 18 negative comments respectively.
Extreme captioning received the fewest number of positive comments out of all
three styles with only four of 15 (26.67%) HOH (m = 1.3, SD = 0.50) and five of 10
(50%) of participants (m = 3.6, SD = 1.69) commenting positively. The total number of
positive comments for extreme captioning was five for HOH group and six for hearing
group. Conversely, extreme captioning had the most negative comments out of the three
styles. Eleven of 15 (73%) HOH participants (m = 3.6, SD = 1.69) and nine of 10 (90%)
of hearing participants (m = 3.1, SD = 1.76) commented negatively about extreme
captions. In total, the HOH group made 40 negative comments and the hearing group
made 28 negative comments about extreme captions.
The technical category amalgamated comments about a number of elements: the
content (Deaf Planet), the use of the translucent background for the captions, comments
regarding colour, size or speed of the captions and the television screen. Eight of 15 or
53% of HOH participants (m = 1.8, SD = 1.03) made positive comments about the
technical category with 14 positive comments in total. Only one out of 10 (10%) hearing
participant (m = 1.0, SD = 0) made one positive comment about the technical category.
66
The HOH group made 13 negative comments in the technical category with six of 15
(40%) of the participants commenting (m = 2.2, SD = 1.33). The hearing group made 17
negative comments in the technical category with 7 of 10 (70%) of participants
commenting (m = 2.4, SD = 1.40).
Comments regarding the sound effects were divided almost equally between
positive and negative. Six of 15 (40%) of HOH participants made 16 positive comments
in total about the sound effects (m = 2,7, SD = 1.63). Five of 10 (50%) of hearing
participants made seven positive comments about the sound effects (m = 1.4, SD = 0.54).
Eight of 15 (53%) HOH participants commented negatively about sound effects with a
total of 16 negative comments about the sound effects (m = 2.0, SD = 1.07). Four of 10
(40%) of hearing participants made eight negative comments about the sound effects (m
= 2.0, SD = 1.41).
Comments in the Impact of Captioning category were regarding participant’s
opinions about how the captions contributed to their television viewing experience. Eight
of the 15 (53%) HOH participants commented favourably about the overall impact of the
captioning on their entertainment (m = 2.3, SD = 2.05) and seven of 10 (70%) of the
hearing participants commented positively (m =1.6, SD = 1.13) on impact of captioning.
Conversely, nine of 15 (60%) of HOH participants commented unfavourably about the
impact of captioning (m =, 1.6 SD = 0.72) while five of 10 (50%) of the hearing
participants commented negatively on impact of captioning (m = 1.4, SD = 0.55).
Impact on People category represented data on how the participants thought the
animated (enhanced and extreme) captions would affect other people. Six of 15 (40%)
HOH participants commented that the animated captions would positively affect other
67
viewers (m = 1.7, SD = 4.1), with a total of seven positive comments. Three of 10 (30%)
hearing participants said that the animated captioning would affect others positively (m =
2.7, SD = 1.20), with total positive comments of eight. Six of 15 (40%) HOH participants
said the animated captioning would affect others negatively with 11 negative comments
in total (m = 1.8, SD = 1.33). Only one of 10 (10%) of hearing participants made one
comment that the animated captions would affect others negatively with 1 comment in
total (m = 1.0, SD = 0).
The final category measured was Impact on Genres, designed to determine
whether the participants thought that the emotive captions would be suitable for all
genres. Eleven of 15 (40%) of the HOH participants suggested that the emotive captions
could be applied to all genres (m = 1.6, SD = 0.92) and five of 10 (50%) of the hearing
participants suggested that these captions would be appropriate for all genres (m = 1.8,
SD = 1.10). Six of 15 (40%) of the HOH participants thought emotive captions would not
be appropriate across many genres (but not all) (m = 1.8, SD = 1.17) and four of 10
(40%) of the hearing participants thought emotive captions were not appropriate for all
genres (m = 1.5, SD = 1.00).
Table 6: Number of positive/negative comments by groups for the video data
Themes Hard of Hearing Hearing Total
Comments
Number
of People
Total
Comments
Number of
People
Total
Comments
Enhanced Captions Positive 12 33 10 42 75
Enhanced Captions Negative 4 11 1 4 15
Closed Captions Positive 7 10 4 7 17
68
Closed Captions Negative 10 22 9 18 40
Extreme Captions Positive 4 5 5 6 11
Extreme Captions Negative 11 40 9 28 68
Technical Positive 8 14 1 1 15
Technical Negative 6 13 7 17 30
Sound Effects Positive 6 16 5 7 23
Sound Effects Negative 8 16 4 8 24
Impact of Captions Positive 8 18 7 11 29
Impact of Captions Negative 9 13 5 7 20
Impact on People Positive 6 7 3 8 15
Impact on People Negative 6 11 1 1 12
Impact on Genre Positive 11 18 5 9 27
Impact on Genre Negative 6 11 4 6 17
Table 7: Descriptive statistics for HOH group in all categories of the video data analysis
N Mean SD
Enhanced Captions Positive 12 2.75 1.71
Enhanced Captions Negative 4 2.75 1.26
Closed Captions Positive 7 1.43 1.13
Closed Captions Negative 10 2.20 1.14
Extreme Captions Positive 4 1.25 0.50
Extreme Captions Negative 11 3.64 1.69
Technical Positive 8 1.75 1.04
69
Technical Negative 6 2.17 1.33
Sound Effects Positive 6 2.67 1.63
Sound Effects Negative 8 2.00 1.07
Impact of Captions Positive 8 2.25 2.05
Impact of Captions Negative 9 1.44 0.73
Impact on People Positive 6 1.17 0.41
Impact on People Negative 6 1.83 1.33
Impact on Genre Positive 11 1.64 0.92
Impact on Genre Negative 6 1.83 1.17
Table 8: Descriptive statistics for hearing group in all categories of the video data
N Mean Standard Deviation
Enhanced Captions Positive 10 4.20 1.93
Enhanced Captions Negative 1 4.00 .
Extreme Captions Positive 5 1.20 0.45
Extreme Captions Negative 9 3.11 1.76
Closed Captions Positive 4 1.75 0.96
Closed Captions Negative 9 2.00 1.12
Technical Positive 1 1.00 .
Technical Negative 7 2.43 1.40
Sound Effects Positive 5 1.40 0.55
Sound Effects Negative 4 2.00 1.41
Impact of Captions Positive 7 1.57 1.13
70
Impact of Captions Negative 5 1.40 0.55
Impact on People Positive 3 2.67 1.15
Impact on People Negative 1 1.00 .
Impact on Genre Positive 5 1.80 1.10
Impact on Genre Negative 4 1.50 1.00
Comparing Captioning Styles
75
1511
68
17
40
0
10
20
30
40
50
60
70
80
Positive Negative
Comments
Nu
mb
er
of
Co
mm
en
ts
Enhanced
Extreme
Closed
Figure 12 Comparison of the three captioning styles by number of comments
Section 5.03 Emotion ID
Participants were asked to identify the emotions present in the first clip of the six
that they watched. These identifications were then analyzed according to five different
categories (four for the conventional captions): 1) correct identification within a two
71
second window from the time the emotion occurred (to account for response time
processing); 2) identification of a different/incorrect emotion within the two second
window; 3) missed identification; 4) emotion identified outside of the two second
timeframe of when the animation occurred, and 5) emotions identified for non-emotion
animation (e.g., for sound effects).
A t-test was used to assess differences in emotion ID between the hearing and HOH
groups. A significant difference was found between hearing and HOH participants for the
emotion ID of fear (t (23) = -2.11, p<0.05) where the HOH made an average of 1.7
identifications of fear (SD = .82) and the hearing identified a mean of 2.9 identifications
of fear (SD = 1.7). No other significant differences were detected.
As shown in Table 9, the HOH group identified an emotion correctly 55% of the
time and the hearing group 50% of the time. The HOH group incorrectly identified an
emotion 6% of the time, same as the hearing group. The HOH group missed
identification 13% of the time, while the HOH group missed identification 23% of time.
The HOH group identified an emotion outside the animation boundary 12% of time,
while the HOH did so 13% of time. Finally, the HOH group identified an emotion during
other animations 13% of time, while the hearing group identified an emotion during other
animations 7% percent of the time.
Table 9 Average number of Emotions identified in each category
Total
Identified
# of Correct
emotion/correct
time (% of
total)
Incorrect
emotion/corr
ect time
Missed
emotion
completely
Emotion
identified
outside of
animation
window
Emotions
identified
during other
animations
72
Hard of
Hearing
67 37 (55) 4 (6) 9 (13) 8 (12) 9 (13)
Hearing 125 63 (50) 8 (6) 29 (23) 16 (13) 9 (7)
A Repeated Measures ANOVA was applied to detect any differences between type
of emotion and their identifications. Bonferroni adjustment was applied to the alpha level
to derive adjusted alpha level of 0.02. Significance was detected in the correct emotion
identification category [F (4, 58) = 12.23, p< 0.02]. A post-hoc Tukey analysis revealed
the following:
1) Correct identification of fear versus anger (P= 0.005).
2) Correct identification of fear versus sadness (P = 0.000).
3) Correct identification of fear versus happiness (P = 0.000).
4) Correct identification of fear versus disgust (P = 0.000).
As can be seen from Table 10, anger was correctly identified 50% of the time
mistakenly identified 4% of the time, missed completely 13% of the time, identified
outside of the animation window from non-caption related cues (such as facial
expressions) 23% of the times and identified during other animations (such as sound
effects) 10% times.
Fear was correctly identified 67% of the time, mistakenly identified 7% of the time,
completely missed 15% of the time, identified from non-caption related cues 7% of the
time and from other animations 5% of the time. Sadness was detected correctly 36% of
the time, mistaken for another emotion 14% if the time, missed 32%,perceived based on
non-caption cues 4% of the time and identified during other animations 4% of the time.
73
Happiness was correctly identified 37% of the time, mistakenly identified 7% of the time,
missed entirely 41 % of the time, not identified outside of the animation window, and
identified during other animations 15% of the time. Finally, disgust was identified 42%
of the time, not mistaken for another emotion, missed 8% of the time, detected outside of
the animation window 29 % of the time and identified during other animations 21% of
the time.
Of all of the emotions, fear appeared to be correctly identified more often than the
other emotions (67% of the time fear was correctly identified when it occurred), while
happiness was missed more often than the other emotions (41% of the times it occurred it
was missed). Disgust appears to be more often identified based on non-caption related
cues (e.g. facial expressions, gestures or voice) than any of the other emotions (it was
identified based on non-caption cues 21% of the time).
Table 10: Emotions versus type of identification
Total
Identified
# of Correct
emotion/corr
ect time (%
of total)
Incorrect
emotion/corre
ct time
Missed
emotion
completely
Emotion
identified
outside of
animation
window
Emotions
identified
during other
animations
Anger 52 26 (50) 2 (4) 7 (13) 12 (23) 5 (10)
Fear 61 41 (67) 4 (7) 9 (15) 4 (7) 3 (5)
Sadness 28 13 (36) 4 (14) 9 (32) 1 (4) 1 (4)
Happiness 27 10 (37) 2 (7) 11 (41) 0 (0) 4 (15)
Disgust 24 10 (42) 0 (0) 2 (8) 7 (29) 5 (21)
74
A Repeated Measures ANOVA analysis was applied to determine possible
differences in emotion understanding between the caption styles. No significant
differences were found. The conventional captioning clips did not have animations; as
such the fifth identification category (emotions identified during other animations) shows
no counts. Table 11 shows the total percentage for each type of emotion identification
listed by captioning category (percentage shown in brackets). Emotions were correctly
identified most often for conventional captions (57%) but also most often incorrectly
identified (9% of the time). Emotion identification was missed most often for the extreme
captions (27% of the time).
As shown in Table 11, Conventional Captions were correctly identified 57% of the
time, incorrectly identified 9% of the time, not identified 16% of the time, identified
outside of animation window 20% of the time and not identified during other animations
(since conventional captioning did not contain any animations). Enhanced captions were
correctly identified 50% of the time, incorrectly identified 2% of the time, not identified
19% of time, identified outside of animation window 11% of time and identified during
other animations 19% of time. Finally, Extreme captions were correctly identified 47% of
time, incorrectly identified 7% of time, not identified 26% of time, identified as a result
of non-animation related cues 5% of time and identified during other animations 14% of
time.
Table 11: Caption category versus type of identification
Total
Identified
# of Correct
emotion/correct
time (% of
Incorrect
emotion/correct
time
Missed
emotion
completely
Emotion
identified
outside of
Emotions
identified
during
75
total) animation
window
other
animations
Conventional
Captions
81 46 (57) 7 (9) 13 (16) 15 (20) 0
Enhanced
Captions
54 27 (50) 1 (2) 10 (19) 6 (11) 10 (19)
Extreme
Captions
57 27 (47) 4 (7) 15 (26) 3 (5) 8 (14)
76
Chapter 6. Discussion
Animated captions were completely novel to all the participants, except in the
form of animated titling in film and television. Therefore, the results of this study were
unexpected and generally surprising. In addition, since the method used to generate the
captions was artistic, audience reactions could not be predicted in advance. One desired
outcome of the study was to establish some future research directions to pursue on how
non-dialogue sound information can be better represented to users of closed captioning.
The audiences’ ability to understand the intention of the animated captions and their level
of acceptance of this approach were critical factors in achieving this outcome.
The first objective of the user study was to gain an understanding of the level of
acceptance of animated captions by the hearing and hearing impaired audience members.
The video taped comments and questionnaire data were examined closely to gain insight
into this objective. The hard of hearing (HOH) and hearing viewers seemed to prefer the
enhanced captions to the conventional or extreme captions for both episodes. As seen in
Table 6, 80% of (22 of 25) participants made at least two positive comments about the
enhanced captions with a combined total of 75 positive comments. Comparatively, only
44% (11 of 25) commented positively about conventional captioning and 36% (9 of the
25) commented positively about extreme captioning with totals of 17 and 11 comments
respectively. The Repeated measures ANOVA of the number of positive comments made
about each of the captioning styles was statistically significant. A Tukey post-hoc showed
that there was a significance differences in user preference for enhanced captions for
conventional captioning (p= 0.005) and extreme captioning (p= 0.002). In the interviews
77
following completion of the questionnaire, participants made comments such as enhanced
captions “added a new dimension” and “I want to see more captioning like
this…someone should market it” that reinforced the results from the video and
questionnaire data. Participants liked that enhanced captions showed the inflection of
voice and emotions and that it “wasn’t flat” because it “added excitement.”
In contrast to the enhanced captions, the extreme captions and to a lesser extent
conventional captions were criticised far more. Nineteen of 25 participants commented
negatively about the conventional captions and 21 of 25 participants commented
negatively about extreme captions, generating a total of 40 and 68 negative comments
respectively. Comparatively, only five people made negative comments about enhanced
captions with a total of 15 negative comments. The Repeated Measures ANOVA
uncovered significance in the number of negative comments among the three captioning
style categories. A Tukey post-hoc revealed that there was a significant difference
between the extreme captions compared to conventional captions (p = 0.18). Participants
commented in the open discussion session that the extreme captions “does not add value”
and exclaimed “what was the point of that” while watching. They went as far as to make
comments such as “Deep 6 aka bury this one.” Comments about the conventional
captions were more moderate but showed noticeable preference for enhanced captions
through complaints like “it won’t be easy going back to the regular [conventional]
captions”. The extreme captions were described as overwhelming, distracting and too
bouncy by a majority of participants. The conventional captions were found to be less
exciting and less informative than the enhanced captions.
78
For the HOH group, the ratio of positive to negative comments for the enhanced
captions was 3 to 1 and for the hearing group, it was 10.5 to 1, providing additional
evidence of the preference trend. For the conventional and, particularly, the extreme
captions, the ratio of positive to negative was far less than 1, meaning many more
negative than positive comments were made. Given that 73% of the HOH participants
reported using conventional captions on a regular basis, it is surprising to find such strong
support for the enhanced captioning style.
The questionnaire results support the findings from the video data and tell a
similar story. Enhanced captions received a very positive response in all of the “caption”
attributes (Colour, Size of Text, Speed of Display, Placement on Screen, Text
Description, Movement of Text, How moving Text Portrayed Emotion and Text
Descriptions) with high ratings of positive or very positive by more than 70% of all
participants. The questionnaire data also revealed that many of the participants in the
HOH group reported they were less confused (54%), less distracted (61%) and more
perceptive (64%) of the character’s emotions in the show while watching enhanced
captions. The open ended comments and discussion after each show also support these
findings. Participants made comments such as “liked emphasis on words and bouncing to
match actions.” and “dramatic improvement over traditional captioning, specifically
when trying to convey emotions in a show”.
Sixty-seven percent of the people reported liking the extreme captioning style
overall in the questionnaires; however, 57% of people reported disliking the movement of
extreme captioning text. The extreme captioning was bold, colourful and blended in well
79
with the children’s program. It is possible that the participants enjoyed the aesthetics of
the captions as an artistic piece but found the shaking letters difficult to read.
On the whole, video data, questionnaire data and free form comments indicate
that enhanced captions are perceived to be well-understood and accepted by the
participants. Enhanced captions are also identified by the participants as an improvement
over the traditional captions (at least in short examples). Additionally, people described
themselves to be willing or very willing to discuss the show with hearing friends. The
willingness of the HOH participants indicate enough confidence in their level of
understanding of the show’s content to discuss it with hearing friends who can access the
sound through the audio channel. Communication difficulties can often exist between
caption user and non-caption user groups regarding show content. Hearing impaired
caption users often believe that they do not necessarily have sufficient information about
the show, and thus they are reluctant to discuss it with hearing viewers (Fels et. al.,
2005). Captions that seem to increase user comfort level with the content are an
important improvement towards more equal access to television and film content.
Another objective of my study was to uncover possible differences between the
HOH and hearing group in their reactions to the animated captions. A large portion of
caption users are not hearing impaired. Also, many hearing impaired viewers use captions
in the presence of hearing friends and family. Thus, analyzing and comparing the
reactions of the two groups was important to consider. Surprisingly, only the category of
“Impact on People” showed a significant difference between the two groups.
The data revealed that six of 15 HOH participants thought that hearing people
considered all captions to be disruptive (not just animated captions) to their television
80
viewing experience. The six HOH users had particular reservations about the acceptance
of the animated (enhanced and extreme) captions by their hearing peer. One HOH
participant said, for example, that hearing people would be greatly bothered and
distracted by the animated captions. Contrary to what the HOH participants believed,
100% of the hearing users reacted very positively to the enhanced captions (but not to the
extreme captions). The questionnaire data supported the video data and showed that in
the hearing group 61% of the people found that their level of understanding of the show
increased with the addition of the animated text. Meaning, the moving text did not
impede the understanding of the hearing participants of the show but rather enhanced it.
These types of assumptions can colour people’s ability to empathize with others, in this
case hearing television viewers with hard of hearing viewers and vice versa. This
discrepancy in perception may be worthwhile exploring in future research.
The questionnaire data showed another significant difference between the hearing
and HOH groups in the placement of the text on the screen for the enhanced captions.
The placement of the enhanced captions was near the bottom of the television screen,
which closely resembled that of closed captions. The HOH group seemed to like it very
much (m = 1.5, SD = 0.75, where 1 is liked very much and 5 is disliked very much) and
the hearing group did not like it as much—but the mean still fell close to the liked
category (mean = 2.1, SD= 0.83). This difference in preference can be explained by
examining the caption viewing habits of the two groups. None of the hearing participants
reported watching television with captions always, only “occasionally” (60%) or “never”
(40%); however, 73% of the HOH group reported always watching television with
captions. Since the HOH participants are more accustomed to this position than the
81
hearing participants, their preference for the familiar placement would naturally be
higher.
All the questionnaire responses were analyzed separately using crosstab analyses
to understand possible correlations between participant ratings of caption attributes and
comprehension attributes. These crosstabs were aggregated over two episodes and
grouped by hearing status to ensure enough data for each category and to detect any
differences between the two groups.
In the HOH group, those who liked the speed of the enhanced caption
presentation also found that they were less confused (36%); less distracted (61%) and had
a better understanding of the character’s emotions the show (54%). Standard captioning
speed is recommended at 145 words per minute (Jensema, 1997) and in this study, this
standard was used as the rate at which enhanced captions appeared. It appears that this
reading speed was optimal for this participation demographic. For the hearing group, the
crosstab analysis showed that 63% of hearing participants liked the portrayal of emotions
using enhanced captions and that this portrayal also increased their level of understanding
of show. This positive correlation indicated that hearing viewers appeared to derive value
from viewing television shows and films with enhanced captions.
Having captions at the bottom of the screen is a conventional display style. It is
expected that regular users of captions would show a preference for this style. The
crosstab results showed that the placement of captions at the bottom of screen was
correlated with the participants’ understanding of the show and of the character’s
emotions. Participant comments showed that the familiar placement prevented the
viewers from hunting around the screen for the caption. Participants made strong
82
suggestions for placement by saying “leave the captions at the bottom”. In addition,
animated text captions that are placed in different locations on the screen were found to
be generally undesirable. Participants commented that “a little word here and there is
distracting” and the unusual placement of words is causing “competition for attention.”
Fels et al. (2005) reported similar preferences expressed by participants in a study with
graphic captions. Fels et. al. (2005) found that deaf and HOH users strongly disliked
having captions located near the speaker because captions located close to the face of a
speaking person were interpreted by viewers as being told to lip read that person.
The more likely cause of this dislike of captions in different screen locations is
that more extensive eye movement or work is required to process/read the dynamic
captions. In my study, not only were extreme captions placed in different locations on the
screen but also only some of the words from the captions were placed in those different
locations. In order to read the whole caption, the viewer had to focus on the full text at
the bottom of the screen and move their eyes to the moving text that was placed in the
different screen locations. Because the text was animated, it easily attracted/distracted the
viewer’s attention; a common reaction to moving objects (Peterson & Dugas, 1972).
Having only some words animated at different locations on the screen forced the user’s
attention to be diverted from reading the caption sentences. Consequently, the user’s eyes
moved back and forth between the moving words and the static captions, increasing the
work the eye and the cognitive process must do. Therefore, I would recommend that
moving text in captions should be contained within the full caption text and be located at
the bottom of the screen.
83
The crosstab analysis also showed a positive relationship between the size of the
captions and participant’s level of understanding of the show. In their study on people’s
preferences for captions, The National Centre for Accessible Media found that people
preferred “medium” sized captions, where medium was defined as “roughly equivalent to
the 26 lines of the TV picture” or 26 pixels (NCAM, n.d.c, section Test Element: Caption
Size, paragraph 1). The size of the enhanced captions in this study began at 25 pixels and
generally increased with the various emotions, closely following recommended font
sizes. For emotions such as anger, afraid and happy the animation became larger, only
sadness became smaller in vertical size. It is not surprising then that this size was
positively correlated to participant’s levels of understanding. No comments were made on
the impact of changing the caption sizes. Further research is required with more examples
of animated captions to determine what size attributes are causing the positive
relationship between understanding emotions and caption size.
A final objective of my research was to measure the participants’ understanding
of the emotions experienced by the characters of the show as a result of the animated text.
In my attempt to attain this objective, I realized that reliably measuring people’s
understanding levels about a piece of content is tremendously difficult, particularly when
one group has a communication disability. No standard measures exist for establishing
equivalency between HOH and hearing groups because most tests of understanding
involve questions developed by hearing people, which are often inappropriate for
comparison. Thus, the design for measuring the understanding of emotions had
drawbacks which will be discussed in the limitation section of this thesis.
84
The emotion analysis portion of the study showed that, hearing participants
accurately identified more emotions than the HOH participants. However, the only
significant difference between the groups was for fear, where the HOH group made on
average 2.0 correct identifications while the hearing group made 2.7 correct
identifications.
The fear emotion occurred more often than any of the other emotions and fear was
identified by more people than the other emotions. Thirty-two percent of all emotion
identifications were for fear compared with 27% for anger, 15% for sadness, 14% for
happiness and 13% for disgust. Fear was also identified correctly more than all other
emotions (67%), and had a low percentage of incorrect (7%) and missed identifications
(15%). Out of all the emotions, fear was significantly identified more accurately [F (4,
58) = 12.23, p< 0.02] for both groups compared to the other emotions. The results then
indicate that the animation for fear represented the emotion to the participants with a high
degree of clarity.
Anger also showed a similar trend to that of fear, although the results were not
significant. Fifty percent of the time the anger animation was correctly identified, only
4% of the time it was incorrectly identified, and 13% of the time missed completely. The
high number of correct identifications (and low number of incorrect identification)
indicate that the animation for anger could correctly represent the emotion. However,
further research will need to be carried out in order to validate the anger animation,
especially using content that contains a range of anger emotions.
As seen in Table 10, happiness, disgust and sadness seem less understandable and
more difficult to interpret, although sadness had a 46% correct identification rate.
85
Possible explanations for this result are: 1) the animation mapped somewhat
unsuccessfully to the emotion; or 2) too few instances of these three emotions were
present compared to anger and fear in the content. Instances of anger and fear occurred
three times each in the clips while happiness, disgust and sadness occurred only once.
The mixed results in the case of happy, sad and disgust indicate that more research is
definitely required in order to create more reliable animations. However, it is important to
note that the main purpose of this exploratory study was to establish if animated captions
are a worthwhile avenue to pursue, not to establish categorically consistent and accurate
emotion animations.
One surprising result of the study was that the emotions were correctly identified
the most number of times in the conventional captions (57%), more often than the
enhanced (50%) or extreme captions (48%)—although these differences were not
significant. Also, emotions were missed 16% of the time for conventional captions, 19%
for enhanced and 26% for extreme. Emotions were incorrectly identified for the
conventional captions more than the other types (9% compared with 2% for enhanced
captions and 7% for extreme captions). When compared to how well enhanced caption
were rated compared to extreme and conventional, the identification data seem difficult
to rationalize. It is possible that enhanced and extreme captions increase the intrusive
nature of captioning, thus making it difficult for the participants to understand what is
going on. However, this seems unlikely since 64% of participants found that their
understanding of character’s emotions was increased by enhanced captioning. Thus, the
more possible explanation for the contradictory results can be attributed to methodology
and the difficult task of measuring understanding of television content. Many events were
86
occurring simultaneously when participants were asked to view a new piece of content,
with new style of captioning and then verbalize the emotions of the characters. These
events included watching the content, understanding and interpreting the emotions,
looking at the checklist to keep track of the available options and speaking out loud the
appropriate emotion. Since participants were already familiar with the closed captioning,
the task of identifying and verbalizing the emotions may have been less difficult than
performing this task with enhanced captioning. Animated text is inherently more
distracting than static text, as a result enhanced captioning, with its added level of
distraction may have hindered the viewer’s ability to correctly verbalize rather than
correctly understand the emotion.
Other categories of interest measured in the study were the technical category and
the sound effects category, although neither showed any statistical significance between
the HOH and hearing groups or positive versus negative comments. The technical
category amalgamated several factors related to captioning, such as the translucent bar
against which the animated captions were displayed, and the content (Deaf Planet).
Viewer opinions about the translucent bar were somewhat divided. Three viewers
appreciated that the show was more visible because of the translucency, which indicates
that the translucency feature of the EIA-708 standard may be useful in enhancing viewer
experience. The negative comments about the bar (by one participant) related to the fact
that it stayed on even when the captions were not being displayed. For this particular
study, it was too time consuming to synchronize the translucent bar with the appearance
of captions. In the future this problem should be rectified so that the translucent bar
disappears if there are no captions. The comments about the content, Deaf Planet, were
87
also divided (six participants liked it, four disliked it). Six participants appreciated Deaf
Planet’s inclusive theme and enjoyed watching a program geared towards sign language
users. One of the hearing viewers was not accustomed to signing and was distracted by
signing characters that did not speak. Other participants (hearing and HOH) disliked the
qualities intrinsic to children’s programming such as extreme colours, over expressive
acting, and slower, simpler plot lines. It is imperative that further research with animated
captions be evaluated using different genres and different lengths of content, in order to
eliminate the biases (positive and negative) that are inherent with a sign language based
children’s show.
There were almost an equal number of positive (23) and negative (24) comments
made by the users in the category of sound effects. Eleven of 25 users found that
animating the sound effects was beneficial in providing more information not readily
available from the dialogue and visual cues. However, the animation of the sound effects,
particularly in the Bad Vibration segment, was thought to be too overpowering and
distracting. In particular, the seemingly innocuous use of dancing, colourful letters to
depict music was criticised by most HOH viewers as being unnecessary and offensive.
These users suggested that the word “Music” be replaced by musical notes and
mentioned that the word “music” unwittingly draws attention to their disability.
The questionnaire data mirrored what was found in the video data regarding the
animated sound effects. Seven people reported not liking the word “Music” animated to
represent the emotion in the music and suggested animating music notes instead of
letters. People also reported not liking the animation for the word “rumble” because it
was unreadable due to rapid motion and a faded appearance.
88
When the questionnaire data was grouped by episodes and analyzed, a significant
difference was found between the two episodes (To Air and Bad Vibes) for the text
description (or sound effects) category. A closer look revealed that the significance was
being caused by the HOH group. For To Air, the HOH group reported liking the text
description very much (mean = 1.6, SD = 0.63) compared with Bad Vibes where they
reported only liking the text descriptions (mean = 2.3, SD = 0.83). As Bad Vibes had
more sound effects than To Air, more text descriptions were used for them (e.g.,
“rumble” whirling” and “music”). This data seems to indicate that overusing words to
represent sound effects may not be a suitable direction to pursue.
The voiced opinions about the sound effects were decidedly split. The sound
effects were reported to be helpful but also distracting. It appeared that participants
preferred the sound effects when dialogue text was not present on screen, as opposed to
having dialogue text and description text be simultaneously displayed. Participants
complained that having dialogue and sound description at the same time is “hard to
watch,” because there is “too much going on”. Also, the participants wanted the effects to
stop showing as soon as they had a chance to read them, instead of having the description
continue as long as the sound was present. For instance, in one scene in the test clips, a
rumbling noise can be heard continuously for about 20 seconds. When the sound effect
“Rumble” was displayed throughout those twenty seconds, the participants commented
that they wanted the words to disappear as soon as they had a chance to read it. One
participant commented, for example, that “… the word rumble went on for too long”.
This is very surprising considering that one of the most desirable but missing element of
captioning is synchronization. When it comes to sound effects; however, users do not
89
seem to want the description to be synchronized with the sound—to have the sound effect
description be displayed for as long as the sound is present. Since all participants did not
comment about sound effects, it is difficult to generalize a trend. However, it seems that
viewers are having difficulty with associating the descriptive text with the effect itself.
Furthermore, animating descriptive words to mimic a particular sound effect by
manipulating the speed and amplitude of the word may render it unreadable and
ineffective. As a result, animating descriptive text to exactly match the sound effect (as
was done in this case) may not be the most appropriate technique. Animated graphics
may be one possible alternative to descriptive text, as suggested by some of the
participants, especially in the case of music. However, effective graphics can be difficult
and costly to create, and the number of possible sound effects can be overwhelming. It
would be impractical to create a library of graphics for a captioner to use containing all
possible sound effects. Caption users can be engaged in defining specific sound
categories and creating meaningful graphics through a participatory design approach
(Clement & Van den Besselaar, 2004). While the number of possible sound effects is
limitless, some commonality can be established using group consensus. Caption users can
then create sound effect animations that are most understandable to them collectively.
Another solution maybe to use an animated, graphical bar, situated at an
inconspicuous part of the screen. This bar would move according to the sound effect
(similar to the equalizer bars found on media players or sound display bars on a stereo).
In the future, tactile elements could also be introduced in conjunction with visuals to
present sound effects to users. Hot and cold sensations can be used to emulate angry or
tension building sounds, while vibrations can mimic things like rumbling effects.
90
Essentially, a lot more research is needed in this area before any concrete conclusions can
be drawn.
Overall, animated text captions appeared to be accepted and understood by the
participating viewers. The fact that enhanced captions were particularly well-liked by the
participants indicate that viewers are able to interpret meaning from these animated
captions. It also appears that, at least based on the results of this study, the fear and anger
emotions are being accurately represented by animated text. The animations for happy,
sad and disgust emotions still need to be revised. It appears that animation properties
such as variation in size, direction of movement, change in opacity and manipulation of
fonts and colours have the potential to be mapped to specific emotions and applied to
captioning.
Section 6.01 Limitations
(a) Emotion Identification Method
The most difficult part of the study was designing an accurate way of gauging the
relationship between the various animations and the emotions as understood by viewers.
Watching a completely new piece of content with animated captions (a style never before
seen) is difficult for the participants in the first place. The HOH viewers were
accustomed to watching the conventional captioning and the results of the study may
have shown a novelty effect. Conversely, as the hearing viewers were not regular users of
any captioning, having to look at captioning (especially in a study environment) could
have also been somewhat disconcerting. None of the participants had seen the show
“Deaf Planet” prior to the study, so they were completely unfamiliar with the characters,
the setting, and the plot. The test clips were very short (approximately one and a half
91
minutes long) and neither one started at the beginning of the episode, causing further
confusion. In future studies, an entire episode of a show that is familiar to the participants
is recommended for captioning to reduce the confusion. The captioned episode should
also be viewed by the participant from the beginning to ensure maximum comprehension.
However, watching a half-hour show multiple times in a study environment would be
difficult to arrange with test participants, as it would likely last for several hours. I would
suggest, then, that study participants be allowed to view longer episodes at home and
possibly over several hours. Also, data capture and analysis for a long study would also
be difficult to coordinate.
The emotion identification process followed by the participants was a complex
one: watching a show, identifying the emotions the characters were feeling on a checklist
and then verbalizing those emotions using words from the checklist during a second
viewing. It should be noted here that only one of the hard of hearing participants was oral
deaf, as a result the verbal protocol method was appropriate for this study. All of the
participants had the option of writing down their responses but none (except the one who
was oral deaf) chose to do so. The addition of the new and distracting factors (new show
and new style of captioning.) to the identification process only made an already difficult
task harder. In addition, it was obvious from the video data participants seemed to want
to speak but did not as the scene changed. This occurred particularly during enhanced and
extreme captioning. The normal television viewing experience does not require the
viewer to decompose content into parts such as emotions and asking people to do that
may have been too complex of a task to achieve successfully.
92
A physiological measure such as galvanic skin response may be more appropriate,
since it would eliminate the need for the participants to articulate the character’s
emotions. However, these techniques are used to measure arousal of emotion in the
participant rather than their understanding of it [Vetrugno et. al., 2003]. Using sliders or
dials to identify emotions could have also been used, but the number of emotions was
quite high. Asking the participants to remember which dial or slider corresponded to what
emotion would have had its own limitations.
To accurately measure understanding of emotion in the context of watching
television, the participants need to watch more than just a few minutes of a show. They
also need watch the show from the beginning to fully understand what is going. In
addition, verbalizing emotions while watching content and concentrating on it is very
difficult and highly unnatural. One interesting and possibly more accurate way to
measure understanding would be to allow the participants to communicate the story of
the show after watching it through a focus group. In this way, a facilitator could
encourage all participants to make story contributions to develop a rich and thorough
discussion of the content.
(b) Communication barriers
In addition to the difficulty of designing the study, the relaying of the instructions
to HOH users was somewhat difficult as well. Many of participants had severe hearing
loss and were not expert lip readers, even though they themselves were able to
communicate articulately. Communicating the complex instructions of identifying the
emotions to some of the HOH participants while they watched the clips was at times very
difficult. The hearing participants were much more communicative during the emotional
93
identification than the HOH. In fact, several HOH emotional ID portions had to be
excluded because either the instructions were not properly conveyed to them. One HOH
participant did not understand that he was supposed to verbalize his identifications while
watching the videos; instead he continued to mark them down on the emotion checklist
during the second viewing. Another HOH participant, who was oral deaf, was only able
to communicate through writing and it was difficult for her to start and stop the video
each time she wanted to identify an emotion. Three of the HOH users also were visually
impaired adding further communication obstacles.
These barriers to communication are very difficult to mitigate in a study setting.
For future studies involving HOH users, an online chat session through which
participants and facilitator can communicate through typing maybe helpful. Also, running
through a short, but complete, practice study for training would also be helpful—
especially when complex directions are involved. Future studies should also be set up in a
usability lab specifically designed to accommodate HOH users. The lab should have
minimal background noises and high lighting levels for lip readers to see the study
facilitator clearly. If any of the HOH users (in future studies) is also a sign language user,
then interpreters will be used to ease the communication difficulties.
(c) Small Number of Participants
The animated caption study was conducted with a relatively small number of
participants, with only fifteen hard of hearing and ten hearing individuals. While the
smaller participant groups are sufficient to gauge the soundness and legitimacy of the
animated text captions, larger control and experimental groups are needed before
categorical conclusions can be drawn.
94
(d) Limited Genre
Deaf planet is not only a children’s show, it is also an atypical one. The premise of the
show is based on science-fiction, thus is a mixture of animation and live action.
Furthermore, all but two characters in the show are deaf and communicate using sign-
language, making all communication between the main characters to be conducted
through an interpreter (who is a female but imitates male voices when necessary).
Finally, the actions, colours and acting of Deaf Planet are highly exaggerated to appeal to
a younger audience. While animated captions worked well with this show, its unusual
nature makes it very difficult to predict the extent to which animated captions would be
applicable to other genres such as drama, horror or comedy. It should be noted that all
content have built-in biases that can influence the effectiveness of a study of this nature.
Thus, it is imperative that animated captions are tested with as many genres and as many
different shows as possible.
(e) Short and Limited Test Clips
The test clips were too short in length and did not provide the participants enough time
to fully understand the story or the characters. Furthermore, since Deaf Planet is a
children’s show, the range of emotions was very narrow with only a few instances of
exaggerated anger, fear and even less instances of happy, sad and disgust. Since I am
trying to establish an emotion model based on a few basic emotions combined with
intensity rating, a show with a much broader range of emotions (e.g. from sarcasm to
rage) is necessary for future research.
95
Chapter 7. Recommendation/Conclusion
Closed captioning has remained essentially unchanged since its inception and
little has been done to improve or alter it over the past few decades. One of the issues that
caption users have identified as problematic is the lack of access to non-dialogue sound
information such as music and speech prosody. The purpose of this thesis work has been
to investigate one alternate method of presenting that non-dialogue sound information, in
the form of animated text. A framework was constructed that mapped a model of
emotions, consisting of a set of basic emotions, onto properties of animation that
represent those emotions. Example video content that contained implementations of this
framework with text captions was then evaluated with hard of hearing and hearing users.
The first research question of my thesis asked if the participants are able to
understand, appreciate and interpret the animated text captions, to determine whether
caption viewers can watch and understand a show with animated text. Based on
participant reactions, I conclude that animated text shows promise as an effective avenue
for representing non-dialogue sound information because most of the participants liked
and understood this new method. Enhanced captioning was very well received by all of
the participants and was thought to be an improvement over conventional captioning.
However, extreme captioning, where animated text was located in different spots around
the screen, was shown to be distracting to the participants. Therefore, my
recommendation is that the extreme captioning style (with dynamic placement of text) is
abandoned moving forward and only enhanced captioning be used for continued
investigation.
96
The second research question asks whether the emotions can be accurately and
consistently represented by the animated text. Out of all the emotive animations, the
animation used to express fear was most obvious to the participants than any of the other
animations. Thus, I recommend that the animation for fear can be used consistently to
represent that emotion. Next, the animation for anger, although not statistically
significant, was understandable to most people (based on the descriptive statistics). I
therefore suggest that the animation used to represent anger is also acceptable. The
animations for the remaining three emotions (happy, sad, and disgust) were identified
correctly on fewer occasions than fear or anger. However, the two test clips had many
more instances of anger and fear than happy, sad or disgust. As a result, I recommend
that animation styles for happiness, sadness and disgust be re-evaluated with content
containing more of these emotions.
The final research question asks what properties of animation related to what
properties of emotion. To answer that research question, I refer to the animation/emotion
model outlined in Chapter 2. As the model shows, the fear emotion can be represented
with a constant expansion and contraction of size, combined with a vibration effect for
the duration of the animation. The anger emotion can be conveyed with an abrupt, one
cycle expansion and contraction of text, with the vibration occurring at the largest size.
Sadness is represented with text decreasing vertically, moving in a downward motion and
decreasing in opacity. Happy animation is almost opposite of sad, with a vertical increase
of text and upward movement of text from baseline.
Animating descriptions of sound effects is definitely an area to be investigated
further; however, text descriptions may need to be replaced with meaningful symbols or
97
graphics. Although, as long as the animated text descriptions are tested for readability
and understanding, they may still be used sparingly to effectively represent sound effects.
Based on this study, it is apparent that all text suggesting musical elements should be
replaced with musical notes.
Section 7.01 Future Research
The participants in my study were only able to watch two short clips (less than
two minutes each) containing animated captions for a speciality children’s programming.
In order to better establish the effectiveness of this new style of captioning, full-length
(30 to 60 minute) shows need to be captioned using this method and tested with a larger
number of participants. Furthermore, the subject matter of Deaf Planet is unusual and
directed towards young children. A more adult-oriented, conventional show need to be
captioned in order to gauge reactions of a wider audience including those who are sign
language users. A typical comment by participants was that animated captions appear
more appropriate for comedies and dramas. Further research is needed to examine the
appropriateness and acceptability of animated text for full length television and film
shows in different genres and to determine the limitations of text captions within those
genres.
The animations for sad, happy and disgust must be evaluated and reapplied to a
show that contains more of these emotions. It is possible that one or all of these
animations must be either abandoned or fine-tuned for accuracy.
The hearing participants of this study appeared to be overwhelmed by the
presence of sound and captions. Future studies need to recreate a typical environment for
hearing viewers, where sound is muffled or reduced such as in an exercise setting. An
98
improved method for measuring audience understanding of emotion that does not rely on
participants verbalizing what they think the emotions characters are feeling is required for
future evaluation.
If using the same approach for the emotion identification portion of the study,
where participants verbalize the emotions present on screen, a complete signal detection
methodology can be applied. Using the signal detection approach, participant responses
can be categorized into hit, miss, false alarm and correct rejection. The data can then be
analyzed using a D prime statistics to see the relationship among the four categories, for
each captioning type.
The primary purpose of this explorative study was to establish the viability of
animated text as captions. Since it seems that animated captions are acceptable,
expressive and effective, more research time and effort can be committed to this method.
An iterative, participatory design (Clement & Van den Besselaar, 2004) approach maybe
used to either refine or create the emotive animations. The end user of captioning can
work collectively to arrive at animations that best convey emotions from their
perspective. The participatory design approach could be most helpful in creating
meaningful animations for sound effects.
Section 7.02 Conclusion
The results of this first study are encouraging and suggest that animating captions
is one way of capturing more sound information contained in television and film. People
who are hard of hearing indicate that they want more sound information and it appears
that animated text may provide progress toward that goal. The emergence of digital
99
television should be leveraged to improve the quality of captions and it is apparent from
this research that most users are ready and willing to accept new forms of captioning.
100
Chapter 8. References
Abdi, H. (2007). Signal detection theory. Encyclopedia of Measurement and Statistics.
Thousand Oaks (CA): Sage.
Abrahamian, S. (2003). EIA-608 and EIA-708 Captioning. Retrieved September 19, 2006,
from http://www.evertz.com/resources/eia_608_708_cc.pdf
Ambrose, G., Harris, P. (2006). The fundamentals of typography. New York: Watson-Guptill
Publication
Barnett, S. (2001). Clinical and cultural issues in
caring for deaf people. . Retrieved April 3, 2007, from
http://www.chs.ca/info/vibes/2001/spring/clinicaldeaf.html
Blanchard, R. N. (2003). EIA-708-B closed captioning implementation. Paper presented at the
IEEE International Conference on Consumer Electronics, 80-1.
Bodine, K., & Pignol, M. (2003). Kinetic typography-based instant messaging. . [Electronic
version]. CHI'03 Extended Abstracts on Human Factors in Computer Systems, 2003, 914-
1. Retrieved September 7, 2006,
Boltz, M. G. (2004). The cognitive processing of film and musical soundtracks. [Electronic
version]. Memory & Cognition, 32(7), 1194-15.
Bruner, G. C. (1990). Music, mood and marketing. [Electronic version]. Journal of Marketing,
54(4), 94-11.
Canadian Association of Broadcasters. (2004). Closed captioning standards and protocol for
canadian english language broadcaster. Retrieved February 6, 2007, from
http://www.cfv.org/caai/nadh20.pdf
101
Canadian Hearing Society. (2004). Status report on deaf, deafened and hard of hearing
ontario students. Retrieved January 19, 2007, from
http://www.chs.ca/pdf/statusreport.pdf
Canadian Television and Radio Commission. (2000). Decision CRTC 2000-60. Retrieved April
7, 2007, from http://www.crtc.gc.ca/archive/eng/Decisions/2000/DB2000-60.htm
Card, S.K., Moran, T., & Newell, A. (1983). The psychology of human-computer interaction.
Hillside, NJ: Lawrence Erlbaum Association Inc.
Chion, M. (1994). Audio-Vision: Sound on Screen ed., New York: Columbia University Press.
Clement, A. Van den Besselaar, P. (2004). Proceedings of the Eighth Conference on
Participatory Design: Artful Integration: Interweaving Media, Materials and Practices, PDC
2004, Toronto, Ontario, Canada, July 27-31
Consumer Electronics Association. (2005). CEA-608-C: Line 21 data services. Retrieved
September 20, 2006, from
http://www.ce.org/Standards/StandardDetails.aspx?Id=1506&number=CEA-608-C
Consumer Electronics Association. (2006). CEA-708-C: Digital television (DTV) closed
captioning. Retrieved January 13, 2007, from
http://www.ce.org/Standards/StandardDetails.aspx?Id=1782&number=CEA-708-C
Cook, N. D. (2002). Tone of voice and mind: the connections between intonation, emotion,
cognition and consciousness. John Benjamin Publishing Company,
Creswell, J. (2003). Qualitative, quantitative, and mixed methods approaches. Thousand Oaks,
California: Sage.
Cruttenden, A. (1997). Intonation .Cambridge: Cambridge University Press
102
DCMP. (n.d.). Steps in the captioning process. Retrieved March 17, 2007, from
http://www.dcmp.org/caai/nadh29.pdf
Downey, G. (2007). Constructing closed-captioning in the public interest: From minority media
accessibility to mainstream educational technology. [Electronic version]. Info, 9(2/3), 69-
13.
Ekman, P., & Friesen, W. V. (1986). A new pan-cultural facial expression of emotion.
[Electronic version]. Motivation & Emotion, 10, 159-19.
Ekman, P. (1999). Basic emotions. In T. Dalgleish, & M. J. Power (Eds.), Handbook of
cognition & emotion (pp. 301-19). New York: John Wiley.
Ernst, S. B. (1984). ABC's of typography (Revised Edition ed.) Art Direction Book Co.
Fels, D. I., Lee, D. G., Branje, C., & Hornburg, M. (2005). Emotive captioning and access to
television. ACMIS 2005, Ohmaha
Fels, D.I., Silverman, C. (2002). Emotive captioning in a digital world. ICCHP 2002.284(7).
Ford, S., Forlizzi, J., & Ishizaki, S. (1997). Kinetic typography: Issues in time-based
presentation of text. Paper presented at the CHI '97 Extended Abstracts on Human
Factors in Computing Systems: Looking to the Future , Atlanta, Georgia. 269-1.
Forlizzi, J., Lee, J. C., & Hudson, S. E. (2003). The kinedit system: Affective messages using
dynamic texts. Paper presented at the Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems , Ft. Lauderdale, Florida. 377-7. from Conference on
Human Factors in Computing Systems database.
Frijda, N. (1986). The emotions. New York: Cambridge University Press.
103
Gaulledet Research Institute. (2007). How many deaf people are in the united states.
Retrieved June 19, 2007, from http://gri.gallaudet.edu/Demographics/deaf-US.php
Geffner, D. (1997). First things first. Retrieved March 28, 2006, from
http://www.filmmakermagazine.com/fall1997/firstthingsfirst.php
Harkins, J., Korres, E., Singer, M. S. & Virvan, B. M. (1995). Non-speech information in
captioned video: A consumer opinion study with guidelines for the captioning industry.
Retrieved March 11, 2006, from http://www.cfv.org/caai/nadh126.pdf
Jacko, J., & Sears, A. (2007). The human computer interaction handbook: Fundamentals,
evolving technologies and emerging applications. ONLINE: CRC Press.
Jensema, C. J. (1997). Final report for presentation rate and readability of closed captioned
television. Retrieved June 15, 2007, from
http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/1
5/75/be.pdf
Juslin, P. N., & Laukka, P. (2003). Communication of emotions in vocal expression and music
performance: Different channels, same code? [Electronic version]. Psychological Bulletin,
129(5), 770-14.
Lee, J. C., Forlizzi, J., & Hudson, S. E. (2002). The kinetic typography engine: An extensible
system for animating expressive text. Paper presented at the 15th Annual ACM
Symposium on User Interface Software and Technology Paris, France. 81-9.
Lewis, M. S. J. (2000). Television captioning: A vehicle for accessibility and literacy. Paper
presented at the CSUN, Los Angeles, California.
104
Marshal, J. K. (n.d.). An introduction to film sound. Retrieved June 20, 2007, from
http://www.filmsound.org/marshall/index.htm
Miles, M. B., & Huberman, M. (1994). Qualitative data analysis: An expanded sourcebook (2nd
ed.). Newbury Park, CA: Sage.
Minakuchi, M. & Tanaka, K. (2005). Automatic kinetic typography composer. ACM International
Conference Proceeding Series; Vol. 265. 221(3)
Mowrer, O. H. (1960). Learning theory and behavior. New York: Wiley.
National Captioning Institute. (2003). New analytical study of closed captioning
finds audiences think It’s important but improvements are needed. Retrieved March 12,
2006, from http://www.ncicap.org/AnnenbergStudy.asp
NCAM. (n.d.a). International captioning project description of line-21 closed captions.
Retrieved March 15, 2006, from http://ncam.wgbh.org/resources/icr/line21desc.html
NCAM. (n.d.b.). International captioning project description of subtitles. Retrieved March 15,
2006, from http://ncam.wgbh.org/resources/icr/subtitledesc.html
NCAM. (n.d.c.). ATV closed captioning project. Retrieved June 20, 2007, from
http://ncam.wgbh.org/projects/atv/atvccpart2size.html
Ortony, K., & Turner, T. J. (1990). What's basic about basic emotions? [Electronic version].
Psychological Review, 97, 315-16.
Peterson, H. & Dugas, D. (1972). The relative importance of contrast and motion in visual
perception. Human Factors, 14, 207 (9)
Plutchik, R. (1980). Emotion, a psychoevolutionary synthesis. New York: Harper & Row.
105
RIT. (2000). Captioning. Retrieved August 13, 2007, from
http://www.netac.rit.edu/publication/tipsheet/captioning.html
Stanislaw, H., & Todorov, N. (1999). Calculation of signal detection theory measures. Behavior Research Methods, Instruments, and Computers, 31, 137-149.
Vetrugno R, Liguori R, Cortelli P, Montagna P. (2003). Sympathetic skin response: basic
mechanisms and clinical applications. Clin Auton Res. 2003;13: 256(4).
Woolman, M. (2005). Type in motion 2. New York: Thames & Hudson.
106
Appendix A To: Deborah Fels Re: REB 2004-092-1: Burnt Toast Date: July 20, 2006 Dear Deborah Fels, The review of your protocol REB File REB 2004-092-1 is now complete. This is a renewal for REB File REB 2004-092. The project has been approved for a one year period. Please note that before proceeding with your project, compliance with other required University approvals/certifications, institutional requirements, or governmental authorizations may be required. This approval may be extended after one year upon request. Please be advised that if the project is not renewed, approval will expire and no more research involving humans may take place. If this is a funded project, access to research funds may also be affected. Please note that REB approval policies require that you adhere strictly to the protocol as last reviewed by the REB and that any modifications must be approved by the Board before they can be implemented. Adverse or unexpected events must be reported to the REB as soon as possible with an indication from the Principal Investigator as to how, in the view of the Principal Investigator, these events affect the continuation of the protocol. Finally, if research subjects are in the care of a health facility, at a school, or other institution or community organization, it is the responsibility of the Principal Investigator to ensure that the ethical guidelines and approvals of those facilities or institutions are obtained and filed with the REB prior to the initiation of any research. Please quote your REB file number (REB 2004-092-1) on future correspondence. Congratulations and best of luck in conducting your research.
Nancy Walton, Ph.D.
Chair, Research Ethics Board
107
Appendix B Pre-Study Questionnaire Pre-study Questionnaire
The purpose of this questionnaire is to gather information about participants’ TV viewing habits and preferences. The information gathered here will be used in order to analyze the results of this study, the effectiveness of enhanced closed-captions for interactive television, and to improve the effectiveness emotional enhancements to closed-captions. This questionnaire will take approximately 5 minutes to complete. Please read the questions carefully. To record your answer, please check the box or write your answer in the space provided. Thank you for participating in the Burnt Toast study. Part I – Demographics
1. Do you identify yourself as (check one):
� Hearing � Hard of hearing � Deaf
2. Are you:
� Male � Female
3. Please indicate your age range:
� Under 18 � 19 – 24 � 25 – 34 � 35 – 44 � 45 – 54 � 55 – 64 � over 65
4. What is your highest level of education completed?
� No formal education � Elementary school � High School � Technical/College � University � Graduate School
5. What is your occupation? ___________________________________________
108
Part II – Current Television Patterns
6. How many hours of television do you watch in an average week?
� Less than 1 hour � 1 to 5 hours � 6 to 10 hours � 11 to 15 hours � 15 to 20 hours � more than 20 hours
7. Do you watch TV alone? Please circle your response. Always Frequently Sometimes Seldom Never 8. Do you watch TV with friends or family? Always Frequently Sometimes Seldom Never 9. If you watch TV with friends or family, do you engage in conversation about the
program: � While you are watching it � After the program is finished � Not at all � Do not watch TV with friends or family
Part III – Closed Captioning
10. Do you use closed captions when watching television?
� Always � Occasionally � Never
11. What do you like about the closed captions on television (check all that apply)
� rate or speed of display � verbatim translation � placement on screen � use of text � size of text � colour (black and white) � other, please specify __________________________________
109
12. What do you NOT like about the closed captions on television (check all that apply)
� rate or speed of display � verbatim translation � placement on screen � use of text � size of text � colour (black and white) � other, please specify __________________________________
13. What elements of closed captions do you believe are not presently available for television (check all that apply):
� Emotions in speech � Background music/ noises � Speaker identification � Text at the same speed as the words being spoken � Adequate information in order to time jokes/puns correctly � Other, please specify
_____________________________________________ 14. The following is a list of characteristics that could be added to, or changed about
closed captions. Please check five characteristics that are most important to you. ____ Faster speed of displaying captions ____ Text descriptions of background
noise. ____ Text descriptions for background
music ____ Use of graphics to represent emotion.
____ Use of overlay/floating captions ____ Use of graphics to represent music
____ Use of colour in captions for emphasis, emotion or tone
____ Use of graphics to represent background noise
____ Use of text descriptions for emotional information.
____ Use of different fonts or text size
____ Flashing captions for emphasis. ____ Use of moving text to represent emotion
____ Use of graphics or symbols to denote background elements such as applause or musical inserts.
____ Use graphics or text to identify
speaker 15. Is there anything else that has not been included in the above list that you would like
to see included in closed captions that is not presently available? _____________________________________________________________________
110
_____________________________________________________________________
111
Appendix C Post-Study Questionnaire Post-study Questionnaire The purpose of this questionnaire is to gather information about your impressions of, likes and dislikes of the six short videos that you viewed. The information gathered here will be used to analyze the effectiveness of enhanced closed-captions for interactive television, and to gather suggestions for improvement. This questionnaire will take approximately 5 minutes to complete. Please read the questions carefully. To record your answer, please check the box or write your answer in the space provided. Thank you for participating in the enhanced captioning study. 16. In a few sentences (or less), please describe what to To Air is Human was about? ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ 17. Please rate each of the following aspects of enhanced captioning version of “To Air is
Human”. Check the box that best fits your rating.
Aspect Liked
very
much
Liked Neither
liked
nor
disliked
Disliked Disliked
very
much
Colour
Size of text
Speed of display
Placement on screen
Movement of the text on the screen
How the moving text portrayed the character’s emotions
Text descriptions
112
18. Please rate each of the following aspects of extreme captioning version of “Bad
Vibes”. Check the box that best fits your rating.
Aspect Liked
very
much
Liked Neither
liked
nor
disliked
Disliked Disliked
very
much
Colour
Size of text
Speed of display
Placement on screen
Movement of the text on the screen
How the moving text portrayed the character’s emotions
Text descriptions at bottom
19. Please rate each of the following aspects of conventional captioning version of “To
Air is Human”. Check the box that best fits your rating.
Aspect Liked
very
much
Liked Neither
liked
nor
disliked
Disliked Disliked
very
much
Size of text
Speed of display
Placement on screen
Text descriptions
20. Compared with the conventional closed captioning of “To Air is Human”, rate your
level of confusion with the enhanced captioning for To Air is Human? Please circle your rating.
Much more confusing
Somewhat more confusing
not different Less confusing Much less confusing
113
21. How much do you think the enhanced captions increased your overall understanding
of the video? Did not increase my level of understanding at all.
Slightly increased my level of understanding
No difference Somewhat increased my level of understanding
Greatly increased my level of understanding
22. How much do you think the enhanced captions increased your overall understanding
of the emotions the characters were feeling in the video? Did not increase my level of understanding at all.
Slightly increased my level of understanding
No difference Somewhat increased my level of understanding
Greatly increased my level of understanding
23. How much do you think the enhanced captions distracted you from the video? Did not distract me.
Slightly distracted me
Did not notice Somewhat distracted me
Greatly distracted me.
24. If you had just watched To Air is Human with friends who were hearing, rate your
willingness to engage in conversation with them about To Air is Human. Not willing to discuss To Air is Human at all
Not really willing to discuss To Air is Human
Don’t care Somewhat willing to discuss To Air is Human
Very willing to discuss To Air is Human.
Please indicate why you selected your particular rating. ________________________________________________________________________ 25. What suggestions do you have for further improvements to captioning? ________________________________________________________________________
________________________________________________________________________ ________________________________________________________________________
114
26. Please add any additional comments. ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________