Representing Emotions with Animated Text · Representing Emotions with Animated Text Raisa Rashid...

Representing Emotions with Animated Text

By

Raisa Rashid, B.Comm (Ryerson University, 2005)

A thesis submitted in conformity with the with the requirements for the degree of Masters in Information Studies

Faculty of Information Studies

University of Toronto

© Copyright by Raisa Rashid 2008

ii

Representing Emotions with Animated Text

Raisa Rashid

Masters of Information Studies

Faculty of Information Studies

University of Toronto

2008

Abstract

Closed captioning has not improved since early 1970s, while film and television

technology has changed dramatically. Closed captioning only conveys verbatim dialogue

to the audience while ignoring music, sound effects and speech prosody. Thus, caption

viewers receive limited and often erroneous information. My thesis research attempts to

add some of the missing sounds and emotions back into captioning using animated text.

The study involved two animated caption styles and one conventional style:

enhanced, extreme and closed. All styles were applied to two clips with animations for

happiness, sadness, anger, fear and disgust emotions. Twenty-five hard of hearing and

hearing participants viewed and commented on the three caption styles and also identified

the character’s emotions. The study revealed that participants preferred enhanced,

animated captions. Enhanced captions appeared to improve access to the emotive

information in the content. Also, the animation for fear appeared to be most easily

understood by the participants.

iii

Acknowledgements

I would like to express my gratitude to all those who gave me the possibility to

complete this thesis. It would be impossible without the people who supported me and

believed in me.

I want to give heartfelt thanks to my thesis supervisors Dr. Deborah Fels and Dr.

Andrew Clement. I want to specially thank Deb for expertly guiding me and inspiring me

at every step. I also want to thank Quoc Vy and Richard Hunt for helping me with the

studies. A huge note of gratitude goes to my fellow “labbies” who were always

encouraging. I want to especially thank Bertha Konstantinidis, Daniel Lee, JP Udo, Emily

Price and Carmen Branje. Finally I want to thank my family. My husband has been a

source of strength and support throughout the last year. He has made my work on this

thesis so much easier by taking care of our lives when I was not able to. I also want to

thank my loving parents, in-laws and sister for their on-going support, love and

encouragement.

I also want to thank University of Toronto, Ryerson University and NSERC for

supporting my research.

iv

Table of Contents Introduction......................................................................................................................... 1 Chapter 1. Literature Review........................................................................................ 5

Section 1.01 Closed Captioning............................................................................... 5 (a) Captioning Standards ...................................................................................... 6 (b) Current Captioning Practices .......................................................................... 7 (c) Captioning Practices outside North America.................................................. 9 (d) Problem with Captioning .............................................................................. 10

Section 1.02 Missing Elements: Music, Sound Effects & Prosody....................... 11 Section 1.03 Kinetic Typography .......................................................................... 16

(a) Elements of Typography............................................................................... 16 (b) Kinetic Typography ...................................................................................... 17

Section 1.04 A Model of Basic Emotions for Captioning ..................................... 23 Chapter 2. Model of Emotion Used in this Thesis ..................................................... 26

Section 2.01 Framework of emotive captions........................................................ 28 Chapter 3. System Perspective ................................................................................... 33 Chapter 4. Method ...................................................................................................... 39

Section 4.01 Content used in this study ................................................................. 42 Section 4.02 Data Collection ................................................................................. 46

(a) Pre-Study Questionnaire ............................................................................... 46 (b) Data collection during viewing of the content .............................................. 48 (c) Post-study questionnaire ............................................................................... 48 (d) Video Data .................................................................................................... 49 (e) Experimental setup and time to complete study ........................................... 50 (f) Data Collection and Analysis........................................................................ 50

Chapter 5. Results....................................................................................................... 54 Section 5.01 Questionnaire Data............................................................................ 54 Section 5.02 Video Data ........................................................................................ 62 Section 5.03 Emotion ID........................................................................................ 70

Chapter 6. Discussion ................................................................................................. 76 Section 6.01 Limitations ........................................................................................ 90

(a) Emotion Identification Method..................................................................... 90 (b) Communication barriers................................................................................ 92 (c) Small Number of Participants....................................................................... 93 (d) Limited Genre ............................................................................................... 94 (e) Short and Limited Test Clips ........................................................................ 94

Chapter 7. Recommendation/Conclusion ................................................................... 95 Section 7.01 Future Research ................................................................................ 97 Section 7.02 Conclusion ........................................................................................ 98

Chapter 8. References............................................................................................... 100 Appendix A..................................................................................................................... 106 Appendix B Pre-Study Questionnaire............................................................................. 106 Appendix B Pre-Study Questionnaire............................................................................. 107 Appendix C Post-Study Questionnaire ........................................................................... 111

v

List of Tables

Table 1: Summary of the relevant animation properties for anger, fear, sadness, happiness........................................................................................................................................... 29 Table 2: Description of the themes .................................................................................. 50 Table 3: Emotion identification categories ....................................................................... 53 Table 4: χ2, degrees of freedom (df), mean and standard deviation for enhanced captioning attributes for To Air. ....................................................................................... 55 Table 5 χ2, degrees of freedom (df), mean and standard deviation for enhanced captioning attributes for Bad Vibes. ................................................................................. 56 Table 6: Number of positive/negative comments by groups for the video data ............... 67 Table 7: Descriptive statistics for HOH group in all categories of the video data analysis........................................................................................................................................... 68 Table 8: Descriptive statistics for hearing group in all categories of the video data ....... 69 Table 9 Average number of Emotions identified in each category .................................. 71 Table 10: Emotions versus type of identification ............................................................. 73 Table 11: Caption category versus type of identification ................................................. 74

vi

List of Figures

Figure 1 Example of closed captioning. The music note is one of the commonly used symbols to represent music (with permission).................................................................... 2 Figure 2 Teletext Example................................................................................................ 10 Figure 3 Shark Week Titling Sequence ............................................................................ 18 Figure 4 Example of font size representing volume (Lee et. al. 2002)............................. 21 Figure 5: Example of high intensity anger over four frames. ........................................... 31 Figure 6 Example of low intensity fear. Initial text size is default size. Text size then expands and contracts rapidly for the entire duration that the text is on the screen. ........ 31 Figure 7 Conventional Captioning System....................................................................... 34 Figure 8 Captioning System, Proposed............................................................................. 34 Figure 9 Captioning Process: Conventional and proposed ............................................... 38 Figure 10: Age distribution............................................................................................... 42 Figure 11: Example of Enhanced Caption and Extreme Caption ..................................... 46 Figure 12 Comparison of the three captioning styles by number of comments ............... 70

1

Introduction

Deaf and hard of hearing viewers have limited access to the rich media of television.

Most are unable to obtain the same quality and quantity of sound information as their

hearing peers. Hearing impaired viewers can compensate for some of the missing sound

information through the use of closed captioning, which relays a portion of aural sound

visually through text and icons.

The size of the hard of hearing population in Canada is very difficult to estimate.

The degree of hearing loss can vary from mild to profound deafness making the

classification of being hearing impaired somewhat amorphous (Gallaudet Research

Institute, 2007). In addition, data on the hearing impaired population is primary based on

self-reported or informant-reported instances and many hearing impaired individuals do

not report their disabilities (Gallaudet Research Institute, 2007). The Canadian Hearing

Society (CHS) roughly estimates that 23% of the Canadian population has some form of

hearing loss (CHS, 2004), which translates into approximately seven million people.

However, not all caption users are hearing impaired, as many viewers turn on

captions to learn English or when sound is unavailable (such as at fitness centres, bars or

restaurants) (Lewis, 2000). The National Captioning Institute estimates that there are 100

million viewers in the US that benefit from using closed captioning (2003). This is

neither a small nor insignificant population that can take advantage of this technology.

Despite the large user group, very few industry or research resources are allocated to

develop closed captioning technology, which has remained stagnant for decades.

The North American system of closed captioning is called Line 21 captioning and it

was developed for analogue television in the 1970s (Canadian Association of

2

Broadcasters, 2004). This type of captioning is limited to a small set of fonts, colours and

graphics (Canadian Association of Broadcasters, 2004). Limitations in decoder

technology posed legibility problems for early captions, forcing the text to be mono-

spaced, upper case, and in white font colour displayed against a black background

(Canadian Association of Broadcasters, 2004).

Recent advancements in decoder technologies allow for legible mixed case letters,

some symbols, and a choice of fonts and colours to be used (Canadian Association of

Broadcasters, 2004). However, despite the new capabilities, the uppercase, white text on

black background is still the principle format for captions used today (National

Captioning Institute, 2003) (see Figure 1 for an example of standard closed captioning).

Figure 1 Example of closed captioning. The music note is one of the commonly used symbols to

represent music (with permission).

The lack of progress within the captioning industry is incongruent with the

advancements in television technology over the last few decades, especially the

emergence of digital television. Closed captioning have not only remained unchanged for

three decades, current guidelines actually discourage the use of the new available features

3

such as colours, fonts and mixed case lettering (Canadian Association of Broadcasters,

2004).

A major concern with closed captioning is that it provides only for the verbatim

equivalent of the spoken dialogue of a television show and ignores most non-verbal

information such as music, sound effects, and tone of voice. Much of this missing sound

information is used to express emotions, create ambiance, and complete the television

and film viewing experience. The audience members forced to access sound information

through captions alone stand to lose vital components of their television and film viewing

experience.

The inspiration for my thesis is to investigate ways of infusing captions with the non-

verbal sound elements using a relatively new approach: animated or kinetic text. One of

the primary aims of my thesis research work is to determine whether animated text is a

viable option to pursue when representing sound information through animation. My

personal motivation for pursuing this research area comes from being a member of an

assistive technologies lab dedicated to investigating information challenges faced by

people with disabilities. The hearing impaired caption users have tremendous unfulfilled

needs that are not being addressed currently, and as an information technology

professional I am committed to use new technologies to solve unmet information needs.

Animated text has been used in the entertainment industry for exciting title sequences and

it appeared to be an unexplored, yet creative approach to text with the potential for

visually representing sound information.

My research work in combining animated text with captioning is described in this

thesis. It begins with a review of the literature in the following areas: captioning, kinetic

4

text, music, sound effects and emotions. The literature review is followed by a

description of the process used to develop a model of emotion/animation that attempts to

map the properties of animation with the expression emotions. The process description is

followed by a systems look at how captions are created by third-party caption houses.

The next chapter outlines the study that was undertaken to explore and refine the

emotion/animation model. The results of the data analysis obtained from the user study

are reported following the methods section. These results are then interpreted and

analyzed in a discussion section. The final chapter of the thesis includes

recommendations based on the discussion and concluding remarks.

The goal of this thesis is to explore the potential for animation to enhance captioning

and incorporate more of the non-verbal sound information, not to categorically claim that

animation is the only (or even the best) way to present non-verbal sound information.

Additionally, the type of animations explored in this thesis is only one possible

application of animations to represent emotions. Many other possible ways of

representing emotions using animations exist that have not been investigated. The focus

of this thesis is on verbatim captioning in its current state only. While literacy levels of

hearing impaired viewers can impact their ability to watch television with verbatim

captioning, this topic is out of scope for this thesis.

5

Chapter 1. Literature Review

Section 1.01 Closed Captioning

Closed captioning is a technique for translating dialogue into text. A television set

with a built-in decoder is able to display the translated text on the screen to match the

dialogue of a show (Abrahamian, 2003). Captioning can be provided by the broadcasters

in real-time or off-line (for broadcast at a later time) format and can appear on screen as

roll-up or pop-on captions (Canadian Association of Broadcasters, 2004). Off-line

captions are produced by captioners from third party captioning houses or by in-house

captioning departments of large broadcasters. Off-line captions are created before the

program actually airs, which allows time for correcting of errors, inserting of symbols

and organizing of sentence and phrase structure so that it is easy to read. The Canadian

Association of Broadcasters (2004) recommends that off-line captions are provided in the

pop-on format, where the entire sentence/phrase “pop” on screen at once; however, scroll

up captions are sometimes used for off-line captions because they are less costly to

produce (R.I.T., n.d.)

Real-time or on-line captions are created by caption stenographers, who watch

live broadcasts (like sporting events) and transcribe them for the viewers at home

(Canadian Association of Broadcasters, 2004). Real-time captions typically appear using

a roll-up format, where the captions scroll up the screen one line at a time.

6

(a) Captioning Standards

The standard defining the captioning format, spacing and font for North American

analogue television is EIA-608. The EIA-608 standard specifies a restricted set of fonts,

characters and colours for use in captioning (Consumer Electronics Association, 2005).

When captions were first introduced in early 1970s, limitations in television encoder and

decoder technology prevented the use of anything other than a mono-spaced font and

white text colour. However, several options have been added since the original

specification, namely the use of mixed letter cases, more font colours, and special

characters. These new additions are rarely implemented in captions. Moreover,

captioning guidelines strongly discourage the use of the new additions citing viewer

expectation and bandwidth limitations as obstacles.

The reasons behind the discouraging caption guidelines are valid to some extent.

Caption viewers are accustomed to the white, uppercase captions and in general are

hesitant to accept alterations to the old standard (NCAM, n.d.). In one NCAM study, for

example, one participant commented about a green caption colour that although the green

was easy to read, she felt “funny” about a non-white caption (NCAM, n.d.). The

bandwidth allowed for EIA-608 captions is 980 bits per second (bps) (Robson &

Hutchins, 1998) and the quantity and quality of information that can be broadcasted at

this bandwidth is quite constrained, limiting the type and style of information being

transmitted (Consumer Electronics Association, 2005).

The viewer preferences will not change easily, but the bandwidth limitations will

soon be mitigated. The limited analogue closed captioning standard, EIA-608, is to be

replaced by the emerging EIA-708 (or digital television) standard in the near future. The

higher bandwidth of 9600bps for EIA-708 means that more data per minute of video can

7

be transmitted (Consumer Electronics Association, 2006). As a result, EIA-708 will

provide for a much more enhanced character set that includes non-English letters,

accented letters and an array of symbols (Consumer Electronics Association, 2006). The

new standard will also allow features such as viewer-adjustable sizing of text, which will

transfer more control over to the viewers by allowing them to increase or decrease the

size of their caption display. EIA-708 will permit the use of different colours for the text,

as well as translucency in the backgrounds. The translucent background has the potential

to obstruct less of the television screen than the black, opaque background used currently

(Blanchard, 2003).

The digital television standard will further allow for text styles that include edged

or drop-shadowed text and a broad collection of fonts such as mono-spaced, serif, sans-

serif and cursive. The 708 standard will provide for other interesting features such as the

delay command (Blanchard, 2003). The delay command feature has been designed to

instruct the caption decoder to halt processing of the Service Input Buffer data for a

designated period of time. When a delay command is received by the decoder, incoming

data is kept in the Service Input Buffer until the defined delay time has expired

(Blanchard, 2003). Blanchard describes this feature of the 708 caption standard as “time-

release captions” (2003). The Delay command can potentially be used to reduce

synchronization of speech and caption errors that exist in captioning today.

(b) Current Captioning Practices

The existing captioning guidelines, referenced by most captioners, are heavily

influenced by the research of Harkins, Korres, Singer & Virvan in 1995. The Harkins et.

al. (1995) study presented 106 deaf and 83 hard of hearing viewers with 19 television

8

clips and asked them to complete a questionnaire to indicate their captioning preferences.

The study sought to garner user preferences about what the researchers called “non-

speech information” (e.g. speaker identification, music, manner of speaking). Based on

the study results, Harkins et. al. (1995) drafted recommendations for use by various

captioning institutions.

The Harkins et. al. (1995) study brought forth some interesting insights about

caption user preferences and identified possible improvements for captioning. However,

much of the issues uncovered in that study were insufficiently addressed and still require

in-depth investigation. Harkins et. al. (1995) concluded that colour is ineffective in

distinguishing between various speakers compared to explicit text descriptions; however,

little research has been conducted since then to uncover possible benefits of coloured text

to caption viewers. Harkins et. al. (1995) concluded that animations like flashing text is

poorly received by audiences, but no research has been carried out to determine whether

and how animation, graphics or symbols can be beneficial.

Harkins et. al. (1995) used a survey instrument that did not compare different

styles of captions but only asked people to imagine them when answering questions. Use

of animation, symbols, and graphics in captioning is so uncommon that most people will

never have seen it. People are generally poor at imagining features and functionality that

they have never experienced before, and are unable to rate the effectiveness or

desirability of them (Jacko & Sears, 2003). Controlled experiments with different styles

and use of sensory enhancements are warranted to determine the effect of these

alternatives on people’s attitudes and levels of understanding of the content.

9

As mentioned before, very little research has been conducted to build on the

findings of Harkins et. al. (1995) and captioning guidelines have remained virtually

unchanged as a result of it. This lack of advancement in closed captioning can be

attributed to several factors. Firstly, caption lobbyists are still struggling to ensure better

captioning legislation and increase the number of captioned programs (Downey, 2007).

As a result, the social pressure is not on advancing caption research, rather having

captions present at all. Secondly, captions are created by people unrelated to the

production of the content (e.g., third-party caption houses) without any input from the

producers/creators of a particular show. Therefore, all captioning decisions are made

post-production, by outside captioners. It is relatively easy to speculate that these

captioning houses have little compelling reasons to want to change their current way of

producing the captions.

Finally, the broadcasters are mandated by government regulations to provide a

specific number of hours of captioning a week (Downey, 2007). As the broadcasters are

naturally concerned with the bottom line, the easiest and least expensive captioning

solution is the one they tend to opt for. Quality control and assurance then is undertaken

by the captioning house rather than the broadcasters. However, as mentioned before, the

captioning houses have little incentive for improving the state of captions,.

(c) Captioning Practices outside North America

While closed captioning in North America has remained colourless and motion-

less, European and Australian caption viewers have been benefiting from colours,

multiple fonts, and mixed case lettering for many years. The European and Australian

equivalent of captioning, called teletext subtitling (NCAM, n.d.b), emerged out of an

10

advertising model. Teletext, shown in Figure 2, was created by the British Broadcasting

Company to provide accessibility through subtitles to deaf and hard of hearing viewers,

but was subsequently adapted to provide other information such the weather, advertising,

sports etc. to all viewers (NCAM, n.d.b). Teletext’s higher data transmission rate (12

kilobytes per second) allow for the display of more features (such as animation) and

information (NCAM, n.d.b). However, complaints from European viewers about their

subtitling services are surprisingly similar to those from North Americans: lack of

synchronization, poor spelling, too fast onset, delays in update rates, insufficient amount

of text and undesirable position of text on screen (Ofcom, 2005).

Figure 2 Teletext Example

(d) Problem with Captioning

It appears that there is much research and development required to improve

captioning to even a minimally satisfactory level. Many issues such as spelling,

synchronization of text to dialogue, and positioning of text on screen have potential

technical solutions. However, none of the captioning techniques described thus far have

adequately managed to present information beyond verbatim dialogue. The capabilities of

the captioning technologies allow for improvements in this area, but little research has

11

been done to gauge audience reaction to alternative means of sound expression beyond

static text. If the viewers of a show are only receiving the dialogue portion of television

sound, then they are missing, as identified by Harkins (1995) and Ofcom (2005),

important non-verbal elements such as music, sound effects, and speech prosody.

Section 1.02 Missing Elements: Music, Sound Effects &

Prosody

Speech prosody, music and sound effects are almost as integral to the television

viewing experience as the dialogue and the visuals. Speech prosody profoundly

influences emotional context of words (Cruttenden, 1997). Music has the ability to

impact the way viewers recognize, understand and remember a television series or a

movie (Boltz, 2004; Bruner, 1990) and sound effects have the ability to provide

information as well as emotions.

Tone of voice heavily influences emotional context of speech by enhancing

perception of words (Cook, 2002). Speech prosody can be divided into three areas:

rhythm, loudness of voice and intonation (Cook, 2002). All three elements of voice can

convey emotions, in particular changes in loudness and pitch (Cook, 2002). Caption

users who are unable to access these important elements of voice risk losing crucial

dramatic cues, especially when irony and puns are concerned.

As Boltz (2004) suggests, music can be used in parallel with a scene to increase or

decrease the emotional impact of the visual elements. The combination of pitch, timing

and loudness properties have the ability to create eerie suspense or heartbreaking sorrow

(Boltz, 2004). Rising and falling pitch, for example, can represent growing or declining

12

intensity within a specific emotional context. Complex melodies can convey more

sadness or frustration than simple melodies (Bruner, 1990). Up-tempo music is generally

associated with happy or pleasant feelings, while slow tempo suggests sentimental or

solemn feelings (Gundlach, 1935; Hevner, 1937 as cited in Bruner, 1990). Increasing

loudness or crescendo can express an increase in force, while a decreasing loudness or

diminuendo can convey a decrease in energy (Cooke, 1962; Zetti, 1973 as cited in

Bruner, 1990). All of these critical musical properties are manipulated by the filmmakers

to present the audiences with another dimension of information beyond the visuals. If the

viewers are missing the added music dimension because they cannot hear (or cannot hear

fully), then they are missing tremendous amounts of crucial information the filmmakers

intended to deliver and that is not replicated in the visual information.

Chion (1994) claims that music can have two basic types of emotional effects on a

scene: empathetic and anempathetic. Music that creates empathetic effects directly

expresses the feelings of a scene, by emulating the rhythm, tone and phrasing of that

scene (Chion, 1994). Music that produces anempathetic results uses indifferent music

alongside an emotional scene to exaggerate its impact (Chion, 1994). Boltz (2004)

suggests a similar concept, where the ironic contrast technique is used to combine

dissimilar visuals and music to enhance or diminish the emotional intensity of a particular

scene. An example of the ironic contrast technique would be playing cheery music

together with a violent scene to make the actions dramatically more disturbing.

With empathetic music, where visuals are reinforced by music, the risk to viewers

who cannot hear is that they lose some of the emotional impact of the scene. However,

with anempathetic music or where the ironic contrast is used, music plays a much larger

13

role. In this case, the risk to the viewers is much higher because losing those critical,

ironic musical cues can result in misunderstanding an entire scene (or an entire film).

Sound effects, like music, can add much information and affect to a television

program or a film. Sound effects supplement visual elements and dialogue to provide

information and establish mood (Marshal, n.d.). According to Kerner (1989), sound

effects are used to accomplish three objectives: simulate reality, create illusions and

establish mood. Firstly, sounds effects simulate reality by bridging the gap between a

staged, artificial scene and the audience’s perception of reality. For instance, when a fake

bottle is broken over a cowboy’s head in a Western drama, only the sound effects of a

glass bottle crashing make it real to the audience (Kerner, 1989). Secondly, sound effects

create illusions by assisting audiences in imagining scenes that were never filmed or

shown. For instance, off-screen crowd chatter convinces the audience that the actors are

in a crowded arena instead of an empty studio (Kerner, 1989). Finally, sound effects

establish the mood of a scene simply based on the information it provides (Kerner, 1989).

For example, the sound of a door clicking shut can provide the audience with the

information that a door has been shut but the same door clicking during a burglary scene

can communicate a sense of dread (Marshall, n.d.). Viewers who are unable to obtain

these critical sound effect prompts stand to misinterpret (or simply not enjoy) television

shows.

Since closed captioning mostly provides a translation of the dialogue elements,

the richness of music, sound effects and intonation is frequently missed by the hearing

impaired viewers (or viewers missing access to sound). At this time, closed captioning

either ignores the non-verbal sound information or sound/music information is provided

14

through italics and brackets: such as (woman crying) or (door slamming) or (soft music).

These descriptive phrases are, at best, poor replacements for the depth of music, sound

effects and voice. Moreover, time limitations and viewer reading speeds prevent most

captioned film and television from having much text descriptions of the non-dialogue

information.

Jensema (1997) found in a series of studies that the average viewer reads captions

at a speed of 145 words per minute. He also found that caption reading speed of hearing

impaired viewers was somewhat higher than the hearing viewers and that the reading

speed of all caption viewers varied greatly (Jensema, 1997). He showed that hearing

impaired viewers have a higher caption reading speed because they likely have more

experience using captions than the hearing viewers. Harkins et. al. (1995) reported in

their study that 53% of the deaf and hard of hearing participants expressed an interest in

having more of the background sound information presented, in addition to the dialogue.

Since, as reported by Jensema (1997), average caption speed on television is

already 141 words per minute, any further addition of information beyond dialogue may

render the captions too fast for people to enjoy or understand. Also, Jensema, Danturthi

& Burch (2000) found that when viewers watched captioned television programs, they

looked at the captions 84% of the time and the actual video only 14% of the time. These

findings indicate that any additional sound information conveyed to the viewers via the

television screen must be contained in the area where captions are being displayed.

In order to inject more information into the closed captioning without increasing the

number of words, it may be possible to supplement text information with other forms of

information representation such as icons, animation, and colours. Fels, Lee, Branje &

15

Hornburg (2005) have explored the use of colours and icons in captioning. The main

objectives of this research were to communicate emotional elements missing from regular

captioning and to provide access to more sound information. They experimented with

using icons such as emoticons, music notes and symbols in conjunction with colour to

express emotions such as happiness, anger, and surprise.

Fels et. al. (2005) found that participant reactions differed dramatically between the

hard of hearing and deaf groups. The hard of hearing participants found the use of colour

and icons very useful and expressive. However, many of the deaf participants thought the

use of colour was juvenile and preferred to rely on their own faculties to interpret the

emotional content (Fels et. al., 2005). All of the participants found the use of icons to

provide sound information useful (Fels et. al., 2005). The authors drew two main

conclusions from the findings: first colours and icons can be used successfully to convey

sound and emotional information, and second, enhanced captioning may be perceived as

more beneficial to hard of hearing viewers than deaf viewers (Fels et. al., 2005).

Silverman & Fels (2002) also investigated the use of emotive captions, but in the

form of comic book style graphics. The researchers used speech bubbles and colourful

letters to portray emotions such as sadness, happiness, anger and fear as well as

background sounds. Silverman & Fels (2002) also used multiple icons and a number of

text styles to manipulate the dialogue and present non-dialogue sound information. The

emotive captioned content was viewed by deaf and hearing participants. The researchers

found that most of the participants (ten of eleven) thought that the emotive captions

increased their understanding of the content. The primary complaint received from the

16

viewers was that the comic book style was somewhat juvenile and ill-suited to more

serious content (Silverman & Fels, 2002).

Beyond using static elements such as graphics, icons and colours, animating the

caption text could be a new way of infusing text with an added dimension of information

without adding more text descriptions. Animated or kinetic text has long been used in the

entertainment industry to entertain, inform and evoke emotions. The emotive power of

kinetic text could be applied to captions to potentially create animated captions that

convey emotions, music and sound effects.

Section 1.03 Kinetic Typography

(a) Elements of Typography

Typography encompasses all the design choices embedded in selecting a type on a

webpage, printed page or on the television screen (Ernst, 1984). Type can be determined

by such elements as font (or typeface), size, spacing (between lines of type) and contrast.

Common kinds of typefaces are categorized as Serif and Sans Serif. A serif typeface has

small lines projecting from the ends of the letters, while a Sans Serif typeface does not

have these lines (Ernst, 1984). Most fonts are proportionately spaced, where larger

characters occupy more space than smaller ones (Ernst, 1984). With a mono-spaced font

(type used in closed captioning), each character takes up the same amount of space. The

size of type is measured in vertical height or points (72 points in one inch) and the size is

measured from common baseline. The weight of type refers to the density of letter (e.g.

bold, light, and heavy) (Ernst, 1984).

17

According to Ambrose & Harris (2006) manipulating the type elements can

ensure that a piece of text is readable and legible. Readability refers to the ease with

which readers can comprehend the body of text and takes into account the overall layout

of the text (Ambrose & Harris, 2006). In captioning, pop-on captions are considered more

readable than roll-on captions, as viewers read in collections of words as opposed to one

word at a time (RIT, 2000). Legibility, unlike readability, concerns the finer, operational

detail of a typeface (e.g. size, weight, and typeface). A legible typeface, for instance, can

be rendered unreadable by making it too wide (Ambrose & Harris, 2006). Leading or

spacing between two letters is another factor that affects readability (Ambrose & Harris,

2006). Lack of white space between letters makes text less readable.

(b) Kinetic Typography

Kinetic typography, also known as animated text, is essentially text that moves

over time. Kinetic typography has recently emerged as a powerful tool for expressing

emotion, mood, personal characteristics, and tone of voice in the creative media industry

(Forlizzi, Lee & Hudson, 2003). According to Woolman (2005), kinetic type has

intrinsic, embedded meaning and the ability to inform, entertain and emotionally affect

the audiences. One of the first examples of kinetic typography can be found in the title

sequence of the Alfred Hitchcock film Psycho (Lee, Forlizzi & Hudson, 2002), where

erratic lettering and movement of text on screen successfully communicated the

unsettling nature of the classic horror film.

In recent years, the film and television industry has invested heavily in animated

text for title or credit sequences (perhaps to emotionally prepare the audience for the

impending viewing experience) (Geffner, 1997). Another widely acclaimed use of kinetic

18

typography is found in the 1995 film horror film Se7en. In the opening sequence of Se7en

trembling, high-contrasting letters with a scratchy typeface are used to convey a sense of

terror that emotionally prepares the audiences for this disturbing film (Geffner, 1997).

Other popular examples of kinetic text is found in the title sequences from television

shows like 24 and Shark Wee, illustrating the power of kinetic text on influencing

audience emotions (Woolman, 2005). The title sequence for Shark Week (Figure 3)

conveys the dangers associated with sharks with only floating, eerie letters without

showing the sharks themselves (Woolman, 2005).

Figure 3 Shark Week Titling Sequence

In contrast to the film/artistic industry, where much work is being done to exploit

the potential of kinetic typography, little formal research is being conducted in the

academic community to understand the impact of kinetic type on audiences/users. The

few kinetic typography researchers include Wang, Predinger & Igarashi (2004), who

have explored the impact of kinetic typography in expressing the affective or emotional

state in online communication using a library of pre-made animations. Wang et. al.

(2004) used a library that included around twenty different animations, some of which

were animations to represent emotions (happy and sad) and others to signify emphasis.

Using galvanic skin response sensors to detect emotional arousal among the participants,

19

Wang et. al. (2004) established that the degree of emotional response increased when

animated text was used, implying that animated text can communicate emotions.

However, Wang et. al. (2004) did not try to evaluate the relationship between specific

emotions and animations or see whether different animation properties had different

effects on the users.

Bodine & Pignol (2003) conducted a similar study to Wang et. al. (2004),

evaluating the emotional impact of kinetic typography in instant messaging

communication. The researchers concluded that “kinetic typography has the capacity to

dramatically add to the way people convey emotions” (Bodine & Pignol, 2003, p. 2).

Forlizzi et. al. (2003) agreed with the findings of Body & Pignol, that kinetic typography

has the capacity to express affective content. Similar to Wang et. al. (2004), Bodine &

Pignol (2003) and Forlizzi et. al. (2003) did not analyze the relationship between specific

animation properties and emotions. As a result, all three research groups allowed kinetic

properties of the text to be driven by the interpretation of the creators of the animations

rather than the emotions themselves.

In addition to affective content, kinetic typography can create characters and

direct the attention of the audiences (Forlizzi et. al., 2003). Ford, Forlizzi & Ishizaki

(1997) discovered that readers of kinetic typography messages attached tone of voice to

the messages they observed. Ford et. al. (1997) define tone of voice as “variations in

pronunciation when segments of speech such as syllables, words and phrases are

articulated” (1997, p.269). The researchers also suggest that kinetic typography can take

on “personality” of the people composing the animated messages, where the message

readers begin to assign distinguished personas to the writers of the kinetic dialogue.

20

The research initiatives mentioned thus far has demonstrated that kinetic

typography could be used to enhance the emotional interpretation of written words;

however, only a few of the studies attempted to determine which properties of the

animation excited the particular emotions. In order for captioners to effectively apply

animations to captions, they need to convey the specific emotions in a consistent and

accurate way.

Lee, Forlizzi & Hudson (2002) are some of the few researchers in the academic

community to look at the relationship between properties of text animation and emotion.

They are also some of the few to attempt to create reusable animations. In their study

using kinetic text with instant messaging, Lee et. al. (2002) show that kinetic text can be

used consistently with pre-defined patterns to communicate specific emotions. They

argue that kinetic typography parameters correspond to prosodic features of voice that

express emotions such as rate of speech and volume of voice. They suggest that

animation properties of text, such as increase in size or upward or downward movement,

relate to voice characteristics such as pitch, volume, and speed of delivery. They report

that sweeping upward or downward motions of text can suggest rising and falling pitch

and increase and decrease of text size can express loudness of voice (Lee et. al., 2002).

They also discovered that short up and down movements are able to mimic fear and

vibration can express anger in animated text. An example of this would be to

communicate loudness with larger font size, and express quietness with smaller font size.

As shown in Figure 4, a) represent exuberance with large letters while b) represents

lowness in feeling. The findings of Lee et. al. (2002) suggest that emotions that are

expressed using the prosodic elements of voice may possibly be represented using

21

animation properties that correspond to those elements. If in anger, for example, the voice

increases in volume and vibrates, then the animation representing that anger would

increase in size and vibrate.

Lee et. al. (2002) make some intriguing observations; however, their techniques

are geared towards generating hundreds of different types of animations. These

animations are contained within a behaviour library, where some animations represent

affective states; others represent nothing more than the typographer’s creative state of

mind.

Figure 4 Example of font size representing volume (Lee et. al. 2002)

Minakuchi & Tanaka (2005) agree with Lee et. al. (2002) that the motion of the

kinetic text itself has meaning. They classify this into three sub-classes: physical motion,

physiological motion, and body language. Kinetic type that follows physical motion,

copies the natural pattern of movement such as bouncing or falling. Kinetic type

mimicking physiological motion, mirrors human reactions such as turning red when

angry. Finally, animations that follow body language replicate such motions as

shrugging. Minakuchi & Tanaka’s findings further support the close relationship between

22

animation and emotions, and suggest that specific motion of text can be related to

specific emotions. The discovery of these relationships means that particular animation

patterns can be applied consistently to text to produce repeatable expression of emotions.

However, Minakuchi & Tanaka (2005) are focused on developing a kinetic typography

engine that automatically animates a variety of words, whether emotional or otherwise.

Also, the researchers have done little work in the area of validating how

accurately/effectively the animations they have produced represent the meaning of the

words/emotions. Therefore, the application of Minakuchi & Tanaka’s research to closed

captioning is quite limited considering the broad range of caption users.

Overall, kinetic typography needs to be explored further in the context of captioning,

especially to visually express the emotional aspects of sound information such as music,

tone of voice and sound effects. Unless the animations correctly characterize the

emotions, sound effect, or music, their application will be a detriment to the users

because they will also add clutter, confusion and distraction to the captions.

The examples from the film industry reveal that emotional impact can be made using

kinetic typography, but individual viewer reactions or the exact nature of the viewer

reaction to kinetic text has not been formally studied. Furthermore, application of kinetic

text in the captioning industry has many limitations. A film titling sequence (lasting only

minutes) can have a budget that exceeds $30,000 (Geffner, 1997) and can involve an

entire artistic team. Comparatively, the budget for captioning is around $1000 for a one

hour show and typically involves one captioner (RIT, n.d.). Also, a captioner lacks the

necessary graphic design skills to manipulate animation properties to create effective

animations. As a result, a simplified model of animation based on a limited number of

23

emotions is required. The simplified model will allow a captioner to identify emotions of

words and phrases and easily apply the appropriate animations to them in a cost-effective

way. Eventually an information system needs to be built that uses kinetic typography to

create animated captions that convey the missing sound and emotional elements.

Section 1.04 A Model of Basic Emotions for Captioning

What constitutes emotions is constantly being debated by psychologists and

psychology theorists, as the human emotional system is complex and difficult to study.

While very few concrete conclusions have been drawn about emotions, one of them is

that emotions are crucial to interpersonal communication. In fact, Ekman (1999) found

that those who are unable to convey emotions through facial expressions and speech

prosody have tremendous difficulty communicating and maintaining personal

relationships.

Theories about emotions can be loosely separated into two main streams of

thought. One theory maintains that all emotions are the same in essence and differ only in

intensity and pleasantry (Ortony & Turner, 1990). The second theory claims the existence

of a set of discreet, basic emotions, which are fundamentally dissimilar to each other.

Evaluating the basic emotion models proposed by the range of researchers, Ortony and

Turner claim that most proposed basic emotion models can be combined to form all other

emotions (1990). Even within the group that agrees on the existence of basic emotions,

few concur on the number of basic emotions and their identity/definition.

Examining the research on basic emotions, it appears that the number of basic

emotions falls somewhere between two and several dozen. Mowrer, for example,

proposes only two basic emotional states: pleasure and pain (1960). Frijda identifies 18

24

different emotions in his model which includes interest, wonder, humility, indifference

and desire, in addition to the more commonplace ones like anger and happiness (1986).

Many of these researchers have their own distinctive reasons or criteria to determine

basic emotions. Mowrer, for example, only includes pleasure and pain as basic emotions

because they are unlearned emotional states (1960).

Among the plethora of conflicting theories and vast number of suggested basic

emotions, Ortony and Turner (1990) observe some congruency. They claim that the most

commonly identified basic emotions include anger, happiness, sadness, and fear (Ortony

& Turner, 1990). Ortony and Turner (1990) also point out that often the basic emotion

theorists are not proposing emotions that are different; they are merely using different

words to describe the same emotion. Words like frustration, irritation, aggression, and

resentment, for example, are less or more intense forms of anger (Ortony and Turner,

1990).

The most well-established psychological models of basic emotion include those

proposed by (Plutchik, 1980) and (Ekman, 1999), which suggest that all emotions can be

reduced to a set of five to eight primitive emotions. Based on cross-cultural facial

expressions Ekman and Friesen (1986) derived a five emotion model that includes the

four, overlapping, common emotions plus surprise. Their justification suggests that basic

emotions are episodic and brief in nature, felt as a direct result of an antecedent, and each

unique emotion results in a common set of facial expressions that are shared across

cultures. Plutchik (1980), like Ekman & Friesen (1986), acknowledge the existence of a

set of emotions that combine to derive other emotions. Plutchik’s emotion model includes

acceptance, anger, anticipation, disgust, joy, fear, sadness, and surprise (1980).

25

Due to resource constraints, such as lack of money and artistic talent, and the fast

turn-around time required of captioners, the process of adding emotive animations to

captions needs to be simple and easy to use. As a result, for my research, a simplified

model of emotion to characterize the emotional elements present in television and film

dialogue is used. A four-emotion model using the four common emotions suggested by

Ekman was considered initially but a disgust category (suggested as a primitive emotion

by Plutchik, 1980) was added to the emotion model because it appeared once (but

prominently) in the example video content and required a unique animation style.

26

Chapter 2. Model of Emotion Used in this Thesis

The literature on the relationship between specific animation and emotions is very

limited, especially when applied to captioning. Thus, a design-oriented, intuitive

approach was chosen to uncover the linkage between animation properties and emotion.

The intuitive approach was spearheaded by an experienced graphic designer and a design

team. The design process consisted of generating a set of alternatives, selecting one

concept based on group consensus, applying of that concept to captioning content,

refining the concept and then creating a final production. There were natural limitations

to this creative process; however, it seemed to be an appropriate method at this initial

stage.

The experienced graphic designer specialized in animated text and its ability to

evoke emotions. His interest in this particular project was partly due to his own hearing

loss. My role in the research at this initial stage was to utilize captioning, animated text,

psychology and typography literature to support the creative process. I was also

responsible for the more technical aspects of the initiative, such as using multimedia

software to actually construct the animations and applying them to video.

Some animation movements have been found to suggest specific emotions,

especially those emotions that can be easily conveyed by voice intonations. For instance,

Juslin & Laukka, (2003) found that rising vocal contours were associated with anger, fear

and joy. Lee et. al. (2002) also found that emotions that are expressed using the prosodic

elements of voice can be represented using animation properties that correspond to those

elements. According to limited existing literature, animation motions that followed these

voice patterns appeared to consistently suggest the same emotions. These shared

27

properties created a good starting point for the generation of animation ideas for the

graphic designer and the research team.

At first, the research team focused on generating an array of alternative example

concepts. Each concept was roughly sketched to produce many simple “test” animations.

Some of the “animations” were fully rendered using software, while other concepts

remained in mock-up form. During the idea generation period, my role was to take the

designer’s mock-ups and create the test animations. I also used literature to ensure that

the animation examples were readable, legible and comprehensible as captions. The

period of idea generation lasted three weeks.

Next, the example concepts were critically examined and only the strongest concepts

were chosen for further development. My role at this critical period was to evaluate and

reject test concepts based on preferences of caption users according to literature. Several

iterations later, the chosen examples were applied to video samples and the animated

clips were assessed by the full research team consisting of me and two other individuals:

an experienced caption researcher and a deaf consultant from the Canadian Hearing

Society. Critical input from the research team guided the selection of the final two

concept styles.

Once agreement was reached on which alternatives to adopt, they were applied to the

text captions of two one to two minute examples of video content. At this point, my job

was to create the animations using software and ensure that the animations worked as

captions. During the animation creation process, it appeared that some emotions did not

lend themselves easily to animation. For instance, the motion representing sadness was

not consistently rendered and accurately interpreted by the group. In one instance, for

28

example, motion was used to move type “up” on the page, to complement dialogue which

said “we’re all going to die!” said with an increasing pitch in the voice. The motion was

meant to convey that sense of “dying” but was too tenuous a link for understanding. To

correct it, linguistic cues were relied upon and resulted in the sadness animation moving

downward mimicking sadness and falling intonation. These and other ideas were then

developed and applied by the designer and me and evaluated by the research team until

there was agreement on the final animations.

Section 2.01 Framework of emotive captions

Once two content pieces were captioned using the expert’s approach, a framework

was created, relating animation properties to the set of basic emotions as described by

Ekman (1999) and Plutchik (1980). At this point I was involved in extracting the

designer’s reasons behind the animations and matching them with the emotions. The

emotion surprise, anticipation and acceptance did not produce unique animation

properties but existed within the other five. As a result, only four emotions, anger, fear,

sadness, and happiness comprised the framework. There was one instance of disgust

identified but the animation for that instance followed the physical motion of the

character. While creating the framework, disgust was left out because the animation for it

appeared to be unique to a specific situation and unrepeatable.

The animation properties consist of a set of standard kinetic text properties. These

are: 1) text size; 2) horizontal and vertical position on screen; 3) text opacity or how see-

through the text is; 4) how fast the animation moves/appears/disappears (speed/rate); 5)

how long it stays on the screen (duration); 6) a vibration effect; and 7) movement from

29

baseline. Vibration is defined as the repeated and rapid cycling of minor variations in size

of letters and appears as “shaking” of the letters.

All of the properties are defined for three stages of onscreen appearance; 1) onset

or when the text first appears or begins to animate; 2) “the effect” occurring between the

offset and onset stages; and 3) offset or when the animated text stops being animated or

disappears from the screen. In addition, the intensity of the emotion can affect all of the

properties. For example, high intensity anger has a faster appearance on the screen and

grows to a larger text size in the effect stage than text that shows a lower intensity anger

condition (as suggested by Lee, et al., 2002). For high intensity anger, the text will also

seem considerably larger than the surrounding text. Most of these properties are defined

relative to the original size of the non-animated text and they would be identified as a

default value by the captioning standard that is applied.

Table 1 summarizes the property descriptions for each primitive emotion. Where

an animation property is not defined, a default value would be used. For example, where

onset speed is not specifically defined, this would be the default onset speed defined by

the captioning software. The framework also allows for other emotions to be added and

the related animation properties defined accordingly.

Table 1: Summary of the relevant animation properties for anger, fear, sadness, happiness

Emotion Relevant Properties (Effect) Intensity of Effect

Fear

Size: Repeated expansion and contraction of

the animated text for the duration of the

caption.

Rate: Expansion and contraction occur at

constant rate (e.g. 5 times per second).

Vibration: Constant throughout the effect.

Low: Size of animated text is the

same as non-animated text at onset.

Low level vibration.

High: Size of animated text is larger

than non-animated text at onset.

Higher the intensity, the larger the

30

text size up to 150% of original size.

High level of vibration.

Anger

Size: Contract to smallest size (e.g. 10%),

then expand to largest size (e.g. 150%), then

expand and contract until original size is

reached.

Vibration: Occurs at largest size in cycle.

Onset: Faster onset than default onset rate.

Duration: Pause at largest size in cycle and

vibration occurs here.

Low: Size of effect is smaller.

Slower onset.

High: Size of effect is larger. Faster

onset.

Sad

Size: Vertical scale of text decreases to 50%

Position: Downward vertical movement

from baseline.

Opacity: 50% decrease in opacity

Offset: Slower offset than default offset rate.

Low: faster offset, faster downward

vertical movement

High: slower offset slower

downward vertical movement.

Happy

Size: Vertical scale of text increases

Position: Upward vertical movement from

below baseline. Follows a curve (fountain-

like).

Low: Slower onset. Offset text size

is smaller.

High: Faster onset. Offset text size

is larger.

Figure 5 and Figure 6 illustrate these concepts for a high intensity anger caption

and a low intensity fear caption with the most salient properties identified. The

animations would appear during the time that the captions were present in the video

material.

31

1. Onset: Starting size

2. Size: Vertical scale of text decreases

3. Size: Vertical scale of text increases

4. Size: Vertical scale of text decreases

Figure 5: Example of high intensity anger over four frames.

1. Size: Expansion of text.

Vibration: Constant

2. Size: Contraction of text.

Vibration: Constant

Figure 6 Example of low intensity fear. Initial text size is default size. Text size then expands and

contracts rapidly for the entire duration that the text is on the screen.

32

Animations created using this framework could be applied to words, sentences,

phrases or even single letters (grain size) in a caption to elicit the desired emotive effect.

The intended emotive effects of the animations seem to be consistently applicable for a

range of different typefaces (e.g., Frutiger, Century Schoolbook).

33

Chapter 3. System Perspective

The system for creating captions is a manual process driven by the choices of the

captioners and broadcasters. As can be seen from Figure 7, the conventional captioning

model has visuals and audio as inputs and visuals, audio and text captions as outputs. The

audio input includes all the components discussed in the literature review that combine to

make the television viewing experience complete: speech prosody, sound effects,

dialogue and music. In order to convert inputs to outputs, the captioner transforms the

inputs by transcribing, interpreting, describing and summarizing the dialogue and some

of the other sound elements into text captions (DCMP, n.d.). The resulting output

includes the text equivalent of the dialogue and does not include most of the other sound

components. The dashed lines outline the components of audio that could be missed by

hearing impaired viewers or those viewers unable to access sound (because they are at the

gym or a bar).

My hypothesis is that those caption users unable to fully access non-dialogue

audio component of the output can compensate for the missing information with the

introduction of animated captions to the conventional system (as shown in Figure 8). In

the proposed system, the inputs are altered to include the input of the director/film-

maker/content-creator and the emotional model outlined in Chapter 2. The captioner can

then use the two added inputs to create the animations that can supplement the missing

audio elements. The emotion model can allow the captioner to create meaningful

animations, which would reduce the amount of descriptive work.

34

Figure 7 Conventional Captioning System

Figure 8 Captioning System, Proposed

35

Figure 9 a) schematically represents the tasks the captioner carries out to convert

the inputs into the outputs, based on the captioning process described in (DCMP, n.d.).

The captioner first creates a copy to work with from the Master copy of the show, he then

watches the program, and transcribes the video. Following that, the captioner works on

the transcript to prepare the captions for broadcast by:

a) Paraphrasing or truncating the dialogue/text

b) Adding text descriptions and

c) Eliminating redundant/unnecessary text.

Once the captioner has completed the process, the captions are timed with the

dialogue based on the time code and set at a pre-determined presentation rate. Once the

captions and video are matched, it is also the captioner’s job to review his work, merge

the video and the captions and produce a copy for broadcast.

Figure 9 a) indicate the steps where the captioner makes crucial, editorial

decisions during the captioning process. Firstly, the captioner decides how the transcript

for a show will be broken up into phrases and presented to the viewers. Then he also

determines whether dialogue need to be paraphrased or “unnecessary” words must be

removed to fit into a short timeframe. The captioner decides which sound effects, music

or prosodic elements are vital enough to be described and the words that should be used

to describe them (DCMP, n.d.). These decisions can often be very important ones and

have the power to change the meaning of a particular show. Incorrectly paraphrasing

fast-paced dialogue can dramatically alter the tone of a show. Similarly omitting a

36

particular sound description, because it is deemed unnecessary by the caption creator, can

hugely impact the story.

The directors of captioned shows and movies have failed to realize that captions

are not merely tools that aid understanding of dialogue. Captions are representations of

the director’s work. Those viewers accessing sound via captions interpret a large portion

of the message and meaning of the show or film based on the decisions made by the

caption houses. It is possible for audiences to dislike or misinterpret a movie based on

substandard caption work.

Right now closed captioning is relatively simple, thus decision making by caption

houses are not as extensive. If the animated caption model is implemented in the current

captioning system, the responsibility of the captioner would increase considerably. In

addition to all the decisions the captioner is currently making, he would have to decide on

the emotions felt by the characters. The only effective and accurate way of incorporating

emotive animations into captioning would be to involve the show creators into the

process.

Figure 9 b), the proposed or potential future process is described, where the

animated captions are incorporated into the output. Unlike the current process, the

proposed process requires much more input from the director/creator of the show. The

show creator (or the show’s subject matter expert like script writer) would advise the

captioner on how to break up the text into phrases, provide notes on the

emotions/intensity ratings of dialogue, sound effects and music, and indicate how sound

effects and music is to be described. The captioners in the proposed system would use the

director’s notes and the emotion model to create the appropriate animations. The rest of

37

the proposed process is similar to the current process except that at one point the director

or a representative of the director will review the captions with the captioner before

broadcast.

Broadcasters are legally required to provide captioning in order to renew their

licenses (Downey, 2007), thus their focus is on providing captions as cheaply as possible.

If show creators can appreciate that approximately 100 million caption users (National

Captioning Institute, 2003) are accessing and interpreting their works through inferior

caption work, then they might be more willing to participate in the captioning process.

38

Figure 9 Captioning Process: Conventional and proposed

9 a) Current captioning process

9 b) Proposed captioning process

39

Chapter 4. Method

In order to begin to understand the impact of kinetic typography in captioning and

to verify and refine the framework outlined in Chapter 2, the following research questions

are proposed:

1. Can viewers interpret/understand/appreciate the animated text captions?

2. Can emotions be represented using animated text in captions?

3. What properties of animation can be used to create emotive animated text?

The first question seeks to determine if typical viewers of captioning, either hearing

or hearing impaired, are able to watch and understand a show captioned with animated

text. Before implementing a novel form of captioning, it is imperative to know if

audiences actually understand, accept and enjoy watching the new style.

The purpose this research is to improve access to the intended emotional

information contained in non-dialogue sound of television by expressing those emotions

through movement of text. As such, I want to attempt to measure the level of

understanding of those emotions within the context of an actual narrative or show. It

should be noted that measuring understanding of emotions is a difficult and complex

process. The animations for a show do not exist by themselves, as the animations are

supplemented by facial expressions and residual sounds that interfere with people’s

understanding. Furthermore, the study attempts not to discover how the participants are

feeling while watching a show, but to measure their understanding of the character’s

emotions—a difficult task to accomplish.

40

It is still important to attempt to measure the participant’s understanding of the

character’s emotions because the main limitation of the kinetic typography research thus

far has been to study reaction to animated text out of context of television/film. My

research endeavours to put the participants understanding of emotions through animation

in the context of watching television or film.

The second research question relates to whether the same properties of animation

applied to different words and phrases can produce a consistent emotional effect on

viewers. For instance, I tried to determine if “angry words” with the same angry

animation properties are consistently identified as angry by the participants. I also tried to

see if variations in intensity of those properties can produce the identification of that

same emotion.

Finally, the last question examines which animation properties (and with what value)

can be used to create the desired animations. The focus of this question is on the

relationship between animation properties like size, duration, vibration rate (Lee et. al.,

2002) and specific emotions. The intent of this question is to construct and refine an

emotion model that can be used to generate animations for captioning.

In order to evaluate and refine the proposed emotional model and to analyze

user’s opinions of animated captions, a 2x2x3 factor empirical study involving hearing

impaired and hearing participants, two different pieces of content and three different

caption styles was carried out. One of the caption styles was conventional closed

captioning, which was used as the control for the study.

This study applied a mix of quantitative and qualitative methods, as suggested in

Creswell (2003). The study used questionnaires, verbal protocol and probing interviews

41

to determine user opinions, understanding and interpretation of the emotions expressed

through the captions, and attitudes and preferences regarding the animated captioning

techniques. Data analysis consisted of thematic analysis of verbal protocol and interview

data, and statistical analysis of the written questionnaire responses. Ethics approval was

sought and received prior to commencing the study and the approval letter can be found

in Appendix A. The deaf and hard of hearing (HOH) test participants were recruited via

the hard of hearing association listserv and they were paid a convenience fee of $30 for

their participation.

Since the study was exploratory and the population of people who are HOH is small

and relatively difficult to recruit, the number of HOH subjects was limited to 15. Ten

hearing subjects were recruited for comparison. Ten of the participants were male and 15

were female. The ages of the participants ranged from children to seniors with two under

the age of 18, three between 19 and 24, five between 25 and 34, four between 35 and 44,

seven between 45 and 54 and four participants between 55 and 64. The distribution of

ages is shown in Figure 10. Nineteen of the participants had college or university

training, two (the under 18s) completed elementary school, and four completed high

school.

42

Age Distribution

0

1

2

3

4

5

6

7

8

Under 19 19-24 25-34 35-44 45-54 55-64

Age ranges

Nu

mb

er

of

Part

icip

an

ts

Figure 10: Age distribution

Although five of the participants were deaf, only one of the participants was oral deaf.

Fourteen of the 15 participants were able to communicate through speaking. Since all of

the participants (except the two children under the age of 18) had completed high school,

their literacy level was considered to be high enough to understand verbatim captioning.

Section 4.01 Content used in this study

marblemedia (sic), a production company partner, provided a children’s television

program called Deaf Planet to be used for the captioning study. Deaf Planet is about an

average man from earth named Max, who crash lands on a planet inhabited by a

population of deaf individuals. On this deaf planet, Max befriends a deaf female named

Kendra. Each episode of Deaf Planet shows Max and Kendra having different adventures

where they discover facts about science. Since Kendra is deaf, she uses American Sign

Language (ASL) to communicate with the other characters in the show. Max’s robot,

named Wilma, interprets between the hearing and deaf characters. Deaf Planet was

chosen as the test material for two primary reasons. Firstly, since marblemedia creates

43

shows for the hard of hearing and deaf audiences, they are more willing to incorporate the

new caption model outlined in Chapter 3. They were willing to work with the research

team to provide guidance about the character’s emotions and review the finished

animations. Secondly, copyright laws restricted the type of content that could be used in

the study. The permission of the content creator was required before showing the clips to

the participants. marblemedia, as a partner in previous research endeavours, permitted the

use of Deaf Planet in the study and for publications. In addition to copyright issues, I

needed a high resolution copy of a piece of content in an uncompressed format because

each time the clip was rendered with layers of text, the quality would deteriorate. The

research team had a well-enough connection with marblemedia to procure such high

quality video.

The sign language content of the show necessitated the exclusion of all

participants trained in ASL, since they will likely rely on the signing instead of the

captions for comprehension. As a result, only non-signing, deaf and HOH participants

were invited to take part in the study.

Two different clips of Deaf Planet were chosen for animated captioning. The

clips were chosen based on having a representative range of emotional content and sound

effects as determined by a committee (including the artist who subsequently created the

animations). The content needed to contain at least five of the basic emotions identified

by Ekman (1999) and Plutchik (1980), with a minimum of one example each. Two

seasons of Deaf Planet episodes (26 in total) were viewed before two short clips from

two different episodes (To Air is Human and Bad Vibrations) were selected.

44

In “To Air” (TA) the main characters are swallowed by a giant fish and

experience varying degrees of fear, anger, sadness, happiness and disgust. In “Bad

Vibrations” (BV) two male characters compete for the attentions of a female character

and experience emotions such as fear and anger. BV also contains three sound effects:

whirring (the start of a space-craft engine), rumble (the earth shaking) and plop (snow

falling on a character). As an informative, learning-oriented show, geared towards young

children, Deaf Planet uses fairly simple and exaggerated expressions of emotion and

sound effects. As a result, it was relatively simple to identify and categorize the emotions

in the test clips. Before proceeding with the animations, the categorized emotions were

verified by marblemedia.

The clips were captioned using two kinetic typography techniques labelled

enhanced captioning and extreme captioning. Combined with conventional closed

captioning, the total number was six videos. Each of the test clips was short in duration,

lasting between 1 minute 30 seconds and 2 minutes. More about enhanced captioning and

extreme captioning will be discussed further on in this section.

A short, 30-second training clip was also produced using closed captioning to

orient the participants to the study. The training clip introduced the characters and the

premise of the show and allowed the test participants to adjust their environment

(volume, proximity to the television) according to their needs. Comments made by the

participants during the training clip segment were not used during data analysis.

The caption styles chosen for testing were divided into three categories: enhanced

captions (EC), extreme captions (XC) and conventional closed captions (CC).

Animations were applied to the captions to emphasize the emotional content of words or

45

phrases, to indicate background noise or sound effects and to denote emphatic words. The

emotive animations were developed through an intuitive approach (described in Chapter

2), where an experienced graphic designer and artist created animations that he thought

(in his artistic experience) would convey the five basic emotions: anger, fear, sadness,

happiness and disgust. Many of the artist’s choices corresponded with literature that

connected prosodic features (e.g. pitch and volume) to animation properties (e.g. size and

vibration). The animations were iteratively reviewed and revised by a committee until

consensus was reached regarding their effectiveness in portraying the emotions. The steps

in developing the emotive captions are outlined in detail in Section 2.01 in Chapter 2.

Figure 11 shows examples of each type of caption style used in this study. In the

enhanced captioned version, the animations were intended to be unobtrusive and all

words were placed at the same location on the screen at the bottom. The static placement

of the captions represents the typical location of conventional captions.

The extreme captions included the same animations as the enhanced captions but

the animated words and phrases were placed dynamically on the screen. In addition, the

extreme caption animations were much larger than the enhanced version. The graphic

designer chose this stylistic interpretation to convey his own artistic expression and

emulate a new style of text display found in advertising.

Conventional closed captions were also produced for the videos as a control

measure to compare the emotive styles of captioning to what is currently being used in

the industry. The conventional captions were produced using the services of an online

captioning company and are seen in Figure 10c

46

a. Enhanced Caption b. Extreme Caption c. Conventional Closed Captions

Figure 11: Example of Enhanced Caption and Extreme Caption

Section 4.02 Data Collection

(a) Pre-Study Questionnaire

Before viewing the videos, the participants were asked to complete a pre-study

questionnaire. The questionnaire was divided into three basic parts. The first section

contained five questions and documented demographic information: hearing level

(hearing, HOH or deaf), gender, age, educational level, and. The demographic

distribution was collected to assess whether age, education level and occupation of the

participants affected their reaction to the animated captions.

The second part of the questionnaire had four questions and sought to ascertain

the participants’ television viewing habits. The first question revealed how much

television/movies they watched in a week, with six choices with the lowest being less

than 1 hour and the highest being more than 20 hours. Question two and three asked

whether the participants watched television by themselves or with others using a five

point Likert scale with categories of Always, Frequently, Sometimes, Seldom, and Never.

Question four asked how often the participants discussed the television program they

were watching with others using the same Likert scale

47

The third section of the questionnaire focused on how the participants used

conventional closed captioning and asked their opinions on the current state of

captioning. Section three contained six questions in total. The first question asked how

often the participants watched television with captioning. The second question asked the

participants to check off a list of features they liked about captioning: rate of speed of

display, verbatim translation, placement on screen, use of text, size of text and colour.

The third question provided the same features as question two but asked the participants

to check off which ones they disliked. The fourth question provided the participants with

a list of potential elements missing from captioning and asked them to check those ones

they believed were important and wanted to see in the future. The elements for question

four were a) emotion in speech, b) background music/notes, c) speaker identification, d)

text displayed at the same as the words being spoken and e) adequate information in

order to understand jokes/puns correctly. The fifth question in section three asked the

participants to select five characteristics of closed captioning that needed to be changed

from a list of fourteen including a) faster speed of displaying captions, b) text description

for background noise, c) use of colours and d) use of different font and text sizes. The

final question in the questionnaire was open-ended and asked the participants to add any

comments not captured by the previous question.

The questionnaire was evaluated prior to the study by two people in order to

eliminate ambiguity and confusion. The pre-study questionnaire can be found in

Appendix B.

48

(b) Data collection during viewing of the content

The study was designed as a within-subjects 3 (caption styles) x 2 (different clips

of Deaf Planet) factor design. The order of viewing of the clips was randomized for each

test participant in order to reduce learning biases resulting from watching the same clip

multiple times. Randomizing the viewing ensured that multiple users were viewing each

clip for the first time at least once.

After the first viewing of the first version of each episode, participants were given

a checklist and asked to identify which emotions were present in the clip. Next, they were

asked to identify the checked emotions during a second viewing of the same version of

the show. At this point, they could also add new emotions recognized in the show but not

recalled during the checklist task. Since I wanted to avoid a learning effect and since

people usually watch a program once, I wanted to obtain people’s initial reaction and

understanding of the emotive content without much time to reflect.

Participants then watched the remaining versions of the episode and commented

on the captioning style, the information presented and the show. The participants were

encouraged to talk out loud their thoughts while watching the clips. All sessions were

video taped as well as transcribed. After each set of three viewings for each episode,

participants were asked to complete a post-viewing questionnaire to provide reactions to

and opinions of the three different captioning styles.

(c) Post-study questionnaire

The post-study questionnaire contained ten questions relating to the participant’s

viewing experience. The first question asked participants to summarize their

understanding of the test clip. Next, on a Likert scale ranging from 1 to 5, the participants

49

were asked to rate the likability of various caption attributes for the enhanced captions

(color, size of text, speed of display, placement on screen, movement of the text, how

well the animations portrayed the character’s emotions, and text descriptions) where 1

was “Liked very much” and 5 was “Disliked very much”. They were then asked to rate

their affinity for the size, speed, screen placement and text descriptions of the

conventional captions using the same Likert scale. Participants were asked to rate

whether the enhanced captions contributed to their understanding of the show and the

emotions, their level of confusion about the show, and their level of distraction. Finally,

participants were asked for any concluding commentary or to expand on specific

comments made during the show that they thought were particularly noteworthy. The

post-study questionnaire was evaluated by two people before commencing the study in

order to reduce ambiguous or confusing wording. The post-questionnaire can be found in

Appendix C Post-Study Questionnaire.

(d) Video Data

During the viewing of the various clips, the participants were encouraged to think

out loud and provide commentary and opinions about the content, and captioning. After

watching each clip, participants were given further opportunity to comment on the

captioning styles and the content. All the studies were video taped to capture participants’

comments.

50

(e) Experimental setup and time to complete study

The pre and post study questionnaires required five to ten minutes to complete

and the total viewing time for the clips, including comments, was 15 to 20 minutes. In

total, all studies were completed within 45 to 60 minutes.

The test environment simulated a home television viewing experience. As a

result, all the captions were viewed on a television set instead of a computer screen. The

participants were allowed to adjust their proximity to the television and the volume level

prior to starting the study.

(f) Data Collection and Analysis

To determine user opinion of and to compare the different caption styles, the video

data was subjected to a thematic analysis proposed by (Miles & Huberman, 1994). First,

themes were identified and defined by three independent reviewers through viewing a

representative sample of the video data. These themes were then distilled into five main

themes (see Table 2) by consensus of the three reviewers. Each theme was further

divided into positive and negative sub-categories.

Two independent raters then evaluated two sample videos using the themes and sub-

categories and the results were analyzed using an intraclass correlation (ICC) to

determine reliability of the theme definitions. The single measures ICC between the two

raters for all categories were 0.6 or higher. The remaining video data was then analyzed

by a single rater.

Table 2: Description of the themes

Theme Description Example Positive Example Negative

51

Comments Comments

Like/dislike

Enhanced

Captioning

Specific direct or

implied comments

about preferring/ not

preferring enhanced

captions.

“I like this” “this

helps me understand

better”

“I don’t like this” “this is

confusing”

Like/dislike extreme

captioning

Specific direct or

implied comments

about preferring/not

preferring extreme

captions.

“this helps me know

who is speaking”

“this is too imposing”

Comments about

closed captioning

Any comments about

closed captioning.

“I like closed

captioning better than

the other two”

“This doesn’t provide me with

all the information”

Technical comments

or comments

unrelated to caption

style

Captures all

comments unrelated

to the captions.

“This is an interesting

show”

“Captions block off too much

of the screen”

Impact of

captioning

Impact of the captions

as a whole.

“Captions are useful” “Captions are distracting”

Impact on Genre Any comments related

to the appropriateness

of animated caption to

a specific genre.

“This would be great

for comedies”

“This would never work for

dramas”

Impact on people Comments about how

captions would impact

other people.

“Kids would love

this”

“Hearing people will find it

distracting”

Sound Effects All comments related

to the sound effects

“I like the rumbling

noise sound effect”

“The word music should be

replaced with notes”

52

The emotion identification portion of the video data was categorized separately

using signal-detection theory as a basis for the design of the categories (Abdi, 2007). The

categories were designed to take into account the correctly detected stimuli (“hit”),

undetected stimuli (“miss”), incorrectly detecting the absent stimuli (“false alarm”) and

missing or absent stimuli (“correct rejection”) (Abdi, 2007).

As mentioned in Chapter 2, the artist who created the animations for each of the

emotions also categorized the emotions in the content with the help of a small committee.

The classified emotions were then approved by Deaf Planet’s production company,

marblemedia, to verify the validity of the classification. A “hit” or correct identification

was recorded when the user identified the correct emotion (as classified by the artist and

approved by the production company) experienced by the characters within a two second

timeframe of the occurrence of that emotion. The two-second window was selected to

account for reaction time. The human information processor model proposed by Card,

Moran & Newell (1983) suggests that reaction time involves the time it takes for an

individual to see (70-700 msec), perceptually process (50-100 msec), cognitively process

(25-170 msec) and respond with a physical action (30-100 msec) to a visual stimuli (a

total of 175 msec - 1.1 sec). Unlike the model-human processor where the task was to

press a button when a correct stimulus appeared on a screen, the participants in this study

completed a more complex series of tasks: watch a program, process the emotions and

verbalize the emotions aloud. As a result, I increased the window of acceptable reaction

time to two seconds.

Incorrect identification was noted when the user identified the “wrong” emotion

(compared to what was classified) but within the two second timeframe (+ or – 2

53

seconds). No identification or “miss” was noted when the viewer did not indicate that

there was emotion when it was present in the captioning. Extra emotion identification or

“false alarm” was noted when the viewer identified an emotion outside of the classified

emotion window. Correct rejection was not applicable in this case because the

participants were not asked to identify instances where there was no emotion. As a result

D-prime, the index of discriminability of a signal, could not be calculated and future

studies might consider a mechanism to incorporate this element. The D prime statistic in

future studies can be used to compare hit, miss, false alarm and correct rejection rates

between enhanced caption style and regular caption style as perceived by the users.

Table 3: Emotion identification categories

Category Definition

Exact Correctly identified emotion within the correct time frame.

Incorrectly

identified

Incorrectly identified emotion within the correct time frame.

No

identification/Miss

Did not identify the emotion at all within correct time frame

Extra emotion

identification

Identified emotion outside of the designated time frame

54

Chapter 5. Results

Section 5.01 Questionnaire Data

Twenty-five participants completed one pre-study and two post-study

questionnaires during the session. The pre-questionnaire results revealed that 11 of 25

(44%) of the participants spent one to five hours watching television per week, five of 25

(20%) watched six to ten hours television per week, six of 25 (24%) watched 11 to 15

hours of TV, one of 25 (4%) watched 11 to 15 hours of TV and two of 25 (8%) watched

more than 20 hours of TV a week.

Thirteen of 25 participants (52%) watched television alone either always or

frequently. Eight of 25 participants (32%) said they watched television alone sometimes

and four of 25 (16%) said they seldom watched television alone. Fifteen of 25 (60%) of

participants have conversations with friends and family while watching television, and

40% seldom or rarely have conversations with friends and family while watching

television. Twenty of 25 (80%) reported using closed captioning either always or

frequently and five of 25 (20%) said they never used captioning when watching

television. The participants were asked what they liked most about captioning and 40%

claimed they liked the rate of display, 40% chose verbatim translation of the dialogue,

12% chose use of text, eight percent chose use of black and white colours and four

percent chose the “other” category (they were allowed to choose more than one option).

When reporting what elements the participants thought were missing from closed

captioning, the most popular answer, with 44%, was the lack of emotions in speech,

55

followed by speaker identification with 16%, inadequate information for jokes and puns

with 32%, lack of background noise/music with 12% and lack of synchronization with

16%.

Finally, the participants were asked to choose five characteristics of existing

closed captions that they would like to see changed. Most participants (84%) chose the

use of different fonts and text sizes, and 68% chose speaker identification. Less popular

were speed of captions with 44%, floating captions and use of colours with 36% and text

description for emotions with 32%.

A chi-square (χ2) analysis of response categories comparing responses with an

expected frequency of equal distribution among all answer categories for all questions in

the post-study questionnaire for both episodes, To Air is Human (TA) and Bad Vibrations

(BV), showed some interesting similarities. There was a significant difference between

response categories at the p < 0.05 level for most questions regarding the “likability”

ratings for the enhanced captions attributes for both episodes. The “likability” ratings, as

mentioned in Section 4.02, ranged from “Liked very much” (score of 1) to “Disliked very

much” (score of 5) on a 5-point Likert scale and were applied to colour, size, speed,

placement, movement and emotiveness of the captions. Table 4 and Table 5 show the

CHI-square results and the means and standard deviations (SD) for each attribute

(“likeability” rating) that showed significance for episodes TA and BV respectively.

Table 4: χ2, degrees of freedom (df), mean and standard deviation for enhanced

captioning attributes for To Air.

Attribute Χ2 df Mean SD

56

Colour of text 12.7 4 2 1.1

Size of text 21.6 3 1.8 0.7

Speed of caption display 14.8 3 1.8 0.8

Placement on screen 14.8 3 1.7 0.8

Movement of text on screen 15.8 3 1.9 1.0

Portrayal of emotions 17.2 4 2 1.0

For To Air, 17 of 24 (71%) rated the colour as positive or very positive, five of 24

(21%) rated the colour as neither liked nor disliked and two of 24 (8%) reported disliking

the colour. Twenty-three of 25 (92%) reported the size as liked or very much liked but

eight percent said they were neutral or they disliked the size. Twenty-two of 25 (88%)

liked or very much liked the speed while 12% reported being neutral or negative towards

the speed. Twenty-two of 25 (88%) reported placement of text as positive or very

positive but 12% reported placement to be neutral or negative. Twenty-one of 25 (84%)

reported the movement of the text on the screen as positive or very positive while 16%

rated movement as neutral or negative. Finally, 20 of 25 (80%) reported the portrayal of

the character’s emotion using movement of the text on the screen as positive or very

positive, 12% remained neutral in this category while 8% said they either disliked or

disliked very much the portrayal of emotions.

Table 5 χ2, degrees of freedom (df), mean and standard deviation for enhanced

captioning attributes for Bad Vibes.

57

Attribute χ2 df Mean SD

Colour of text 10.5 2 2.0 0.6

Size of text 15.5 3 2.0 0.8

Speed of caption display 19.6 3 1.9 0.8

Placement on screen 12.2 3 1.9 0.9

Portrayal of emotions 14.8 4 2.0 1.0

Text description 16.8 3 2.2 0.8

For Bad Vibes, 19 of 23 responses (83%) rated the colour as positive or very

positive and 17% rated the colour as neutral or negative. Twenty-one of 25 (84%)

reported the size as positive or very positive, while 16% reported being neutral or

negative towards the size. Twenty of 25 (80%) liked or very much liked the speed while

20% were neutral or negative towards the speed. Twenty-one of 25 (84%) reported

placement as positive or very positive while 16% were either neutral or negative towards

the placement. Nineteen of 25 (76%) reported portrayal of the character’s emotion using

movement of the text on the screen as positive or very positive, while 16% reported being

neutral and 8% reported being negative or very negative. Finally, 19 of 25 (76%) reported

the text descriptions as positive or very positive while 24% reported being either neutral

or negative towards text descriptions.

Also, the χ2 results between response categories for the question regarding

whether participants would be willing to discuss the show with hearing friends or family

showed a significant difference for both episodes [χ2(4, 25) = 13.9 for TA and χ2(4,

58

25)=12.8 for BV, p < 0.05]. The mean and SD for TA was 4.0 and 1.1 respectively on the

Likert scale of 1 to 5 where 1 was “not willing to discuss show” and 5 was “very

willing”. Eighteen of 25 (72%) respondents said that they were either willing or very

willing to discuss the show with hearing friends. The mean and SD for BV was 3.9 and

1.1 respectively. Eighteen of 25 (72%) of respondents said that they would be willing or

very willing to discuss the show with hearing friends.

For Bad Vibes, a χ2 analysis between responses categories showed a significant

difference for the question regarding the level of distraction caused by the enhanced

captions [χ2(4, 25) = 14.8, p < 0.5] with the mean = 3.0, SD = 1.3. Thirteen of 25 (52%)

of respondents said that they were either very or slightly distracted by the enhanced

captions while 10 of 25 (40%) said that they were not or only slightly distracted by the

enhanced captions.

When the descriptive statistics for the extreme captions were examined there were

some conflicting results. First, 19 of 28 (68%) responses very much liked or liked how

the emotions were portrayed with the captions. However, a majority of participants 16 of

28 (57%) disliked the movement of the captions

A t-test analysis comparing two independent samples was carried out between the

hard-of-hearing and hearing subject groups for each caption style of each different

episode for the questionnaire variables related to captioning attributes. For To Air, a

significant difference was found in the enhanced caption text descriptions attribute [t(23)

= -3.43, p< 0.05] where the average responses for the hard of hearing (HOH) group was

1.64 (SD = .63) and hearing group was 2.45 (SD = 0.52). A significant difference was

also found in the extreme caption colour attribute [t(12) = -2.31, p< 0.05), where the

59

average responses of the HOH group was 1.67 (SD = 0.52) and hearing group was 2.50

(SD = 0.76).

For Bad Vibes there was one significant difference, in the enhanced caption

placement attribute [t (23) = -4.29, p< 0.05] between the two groups. Hard of hearing

group very much liked/liked the placement of the enhanced captions (m = 1.57, SD =.85),

while the hearing group liked/felt neutral about the enhanced captions placement (m=

2.27, 0.79).

A second t-test analysis was carried out to determine differences in the caption

attributes between the two episodes, "Bad Vibes" (BV) and "To Air Is Human" (TA) for

all participants. The movement of the text for the enhanced captions between these

episodes was found to be significant [t (48) = -2.2, p < 0.05] where mean = 1.7, SD = 1.0

for TA and mean = 2.3, SD = 1.1 for BV. The text description for conventional

captioning was also found to be significantly different [t(48) = -2.2, p < 0.05] for TA

(mean = 2.3, SD = 0.9) and for BV (mean = 2.8, SD = 0.9).

A t-test analysis between both episodes for hearing subjects only was also carried

out. There were no significance differences between any of the enhanced or conventional

caption attributes for the two episodes.

A final t-test was carried out between both episodes for only the hard-of-hearing

subjects. The movement of text attribute for the enhanced captions was significantly

different between the two episodes [t (26) = -2.5, p < 0.05] (TA mean = 1.4, SD = 0.8 and

BV mean = 2.3, SD = 1.1). The text description of enhanced captioning was found to be

significantly different between the episodes [t (26) = -2.3, p < 0.05] (for TA mean = 1.6,

SD = 0.6 and for BV mean = 2.3, SD = 0.8). The text description of conventional

60

captioning was also found to be significantly different between the episodes [t (26) = -

2.3, p < 0.05] (for TA mean = 2.0, SD = 0.88 and for BV mean = 2.9, SD = 1.1).

Crosstab analyses are applied to nominal data in order to uncover correlations

between two variables. Crosstab analysis were carried out on all of the participant data to

examine whether there was any correlation between the ratings of understanding

(provided on a 5-point Liker scale) with the various attributes of the enhanced captions.

For this analysis, data were separated by the participant groups of HOH and hearing and

aggregated over both episodes—to ensure enough data points.

For the hearing group, there were six significant crosstab Pearson correlations found

for the enhanced caption attributes of:

1) Size of the text and the overall understanding of the show [χ2(12, 22) = 24.8, p<0.05]

where 13 of 22 responses rated their level of confusion as low and liking the size of

the text;

2) Size of the text and level of distraction [χ2 (12, 22) = 27.7, p<0.05] where 8 of 22

respondents rated the level of distraction of the enhanced captions as not distracting

or slightly distracting and liking or very much liking the size. However, 7 of 22

respondents reported an inverse relationship where the enhanced captions somewhat

or greatly distracted them but they still liked or liked very much the size of the

captions;

3) Movement of the text on screen and understanding of the show [χ2(16, 22) = 34.3,

p<0.05] where 12 of 22 responses found their level of understanding increased or

greatly increased and they liked or liked very much the animated text;

61

4) Movement of the text on screen and distraction [χ2(16, 22) = 36.2, p<0.05] where 8 of

22 responses found they were only slightly or not distracted and they liked or liked

very much the animated text;

5) Portrayal of the emotions using the animated text and understanding of the show

[χ2(12, 22) = 30.5, p<0.05] where 14 of 22 responses found their level of

understanding increased or greatly increased and they liked or liked very much the

animated text; and

6) Portrayal of the emotions using the animated text and distraction [χ2 (12, 22) = 27.9,

p<0.05] where 9 of 22 responses found they were slightly or not distracted and they

liked or liked very much the animated text. However, 8 of 22 respondents reported an

inverse relationship where the enhanced captions somewhat or greatly distracted by

the enhanced captions they still liked or liked very much the emotions portrayed in

the captions.

For the HOH group, there were six significant crosstab Pearson correlations found

for the enhanced caption attributes of:

1) Size and the overall understanding of the show [χ2(8, 28) = 18.9, p<0.05] where 15 of

28 responses rated their level of understanding of the video increased or greatly

increased and liking the size of the text;

2) Size and level of understanding of the character’s emotions [χ2(8, 28) = 21.0, p<0.05]

where 17 of 28 respondents rated the level of understanding of the emotions as

greatly increased or increased and liking or very much liking the size of the text;

62

3) Speed of the captions appearing on the screen and confusion with the show [χ2(8, 28)

= 21.6, p<0.05] where 10 of 28 responses found their level of confusion with the

show as less or much less and liking or liking very much the speed of animated text;

4) Speed of the captions appearing on the screen and understanding of the character’s

emotions [χ2(8, 28) = 21.0, p<0.05] where 15 of 28 responses found their level of

understanding increased or greatly increased and liking or liking very much the speed

of animated text;

5) Speed of the captions appearing on the screen and distraction [χ2(8, 28) = 20.5,

p<0.05] where 17 of 28 responses found they were not distracted or slightly distracted

and liked or liked very much the speed of animated text;

6) Placement on the screen and understanding of the show [χ2(12, 25) = 24.9, p<0.05]

where 16 of 28 responses found their level of understanding increased or greatly

increased and they liked or liked very much the placement of the animated text on the

screen; and

7) Placement on the screen and understanding of the character’s emotions [χ2(12, 28) =

23.0, p<0.05] where 18 of 28 responses found their level of understanding increased

or greatly increased and liking or liking very much the speed of animated text.

Section 5.02 Video Data

As described in Chapter 4, the video data was coded using a thematic analysis

approach according to the themes and sub-categories in Table 2. The comments made by

participants were recorded and then analyzed using the pre-defined themes. Many

participants commented negatively and positively at various times regarding the same

63

theme. Repeated measures ANOVA and t-tests were applied to analyze the video data to

compare the mean data among the various groups. ANOVA and T-tests were appropriate

for this data set because they met the homogeneity of variance and independence of cases

criteria. However, as the data set was not normally distributed and non-parametric tests

such as Kruskall-Wallace and Mann-Whitney were also applied to the data set. Since the

results were identical between the parametric and non-parametric tests, the more

common, parametric results were considered for reporting. Due to the number of tests, a

Bonferroni adjustment was applied to the 0.05 to derive an adjusted alpha value of 0.02

to reduce type one errors of (error of incorrectly detecting a significant difference).

A t-test was carried out to determine whether there was a difference between the

hearing and HOH participants for each theme. A significant difference was found

between the hearing and HOH group in the Impact on People Positive category [t (7) = -

3.0, p <0.02]. The results of video comments analyses somewhat mirrored results from

the questionnaire data where no significant differences was found between the hearing

and the HOH in most categories.

In order to determine whether the participants had a preference regarding the style

of captioning, the number of positive and negative comments for the three captioning

styles was analyzed using Repeated Measures Analysis of Variance (ANOVA). There

was a significant difference among the three caption styles in the positive [F (2, 39) =

9.32, p< 0.02] and negative categories [F (2, 41) = 4.11, p<0.05]. A Tukey post-hoc test

revealed significance for:

1) Positive comments between enhanced captioning and extreme captioning (p =

0.002).

64

2) Positive comments between enhanced captioning and closed captioning (p =

0.005).

3) Negative results between extreme captions and closed captions (p = 0.018).

The Repeated Measures ANOVA revealed no significant differences for:

1) Positive comments between extreme captioning and closed captioning

2) Negative comments between enhanced captions and extreme captions.

Table 6 and Figure 12 compare the number of positive and negative comments

made by the HOH and hearing group for each category. The most number of total

comments were in the enhanced caption positive category (75 comments made by HOH

and H participants combined). This was, however, closely followed by the number of

total negative comments for the extreme captions (total of 68 negative comments by

HOH and H combined). The fewest number of total comments were for the extreme

caption positive sub-category with only eleven comments between the HOH and H

participants. This was followed by 12 total comments for the Impact on People negative

sub-category.

As shown in Table 6, Table 7 and Table 8, 12 of 15 (80%) of the HOH (m = 2.8,

SD = 1.71) and 10 out of 10 (100%) hearing participants (m = 4.2, SD = 1.93) made at

least one positive comment about the enhanced captions. The total number of positive

comments about the enhanced captions made by the HOH group was 33 and total number

of positive comments made by the hearing group was 42.

Four of 15 (27%) HOH participants (m = 2.8, SD = 1.25) commented negatively on

enhanced captions and one of 10 (10%) of hearing participants (m = 4.0, SD = 0)

65

commented negatively about the enhanced captions. The total number of negative

comments related to enhanced captioning was 11 for HOH and four for hearing.

Seven of 15 (46.67%) HOH participants (m = 1.4, SD = 1.13) and four of 10

(40%) of the hearing participants (m = 1.8, SD = 0.96) made positive comments about

closed captioning, with total positive comments for closed captioning of 10 and seven

respectively. Ten of 15 or (66.67%) of the HOH participants (m = 2.2, SD = 1.14) and

nine of 10 (90%) of hearing participants (m = 2.0, SD = 1.11) made negative comments

about closed captioning, with a total of 22 and 18 negative comments respectively.

Extreme captioning received the fewest number of positive comments out of all

three styles with only four of 15 (26.67%) HOH (m = 1.3, SD = 0.50) and five of 10

(50%) of participants (m = 3.6, SD = 1.69) commenting positively. The total number of

positive comments for extreme captioning was five for HOH group and six for hearing

group. Conversely, extreme captioning had the most negative comments out of the three

styles. Eleven of 15 (73%) HOH participants (m = 3.6, SD = 1.69) and nine of 10 (90%)

of hearing participants (m = 3.1, SD = 1.76) commented negatively about extreme

captions. In total, the HOH group made 40 negative comments and the hearing group

made 28 negative comments about extreme captions.

The technical category amalgamated comments about a number of elements: the

content (Deaf Planet), the use of the translucent background for the captions, comments

regarding colour, size or speed of the captions and the television screen. Eight of 15 or

53% of HOH participants (m = 1.8, SD = 1.03) made positive comments about the

technical category with 14 positive comments in total. Only one out of 10 (10%) hearing

participant (m = 1.0, SD = 0) made one positive comment about the technical category.

66

The HOH group made 13 negative comments in the technical category with six of 15

(40%) of the participants commenting (m = 2.2, SD = 1.33). The hearing group made 17

negative comments in the technical category with 7 of 10 (70%) of participants

commenting (m = 2.4, SD = 1.40).

Comments regarding the sound effects were divided almost equally between

positive and negative. Six of 15 (40%) of HOH participants made 16 positive comments

in total about the sound effects (m = 2,7, SD = 1.63). Five of 10 (50%) of hearing

participants made seven positive comments about the sound effects (m = 1.4, SD = 0.54).

Eight of 15 (53%) HOH participants commented negatively about sound effects with a

total of 16 negative comments about the sound effects (m = 2.0, SD = 1.07). Four of 10

(40%) of hearing participants made eight negative comments about the sound effects (m

= 2.0, SD = 1.41).

Comments in the Impact of Captioning category were regarding participant’s

opinions about how the captions contributed to their television viewing experience. Eight

of the 15 (53%) HOH participants commented favourably about the overall impact of the

captioning on their entertainment (m = 2.3, SD = 2.05) and seven of 10 (70%) of the

hearing participants commented positively (m =1.6, SD = 1.13) on impact of captioning.

Conversely, nine of 15 (60%) of HOH participants commented unfavourably about the

impact of captioning (m =, 1.6 SD = 0.72) while five of 10 (50%) of the hearing

participants commented negatively on impact of captioning (m = 1.4, SD = 0.55).

Impact on People category represented data on how the participants thought the

animated (enhanced and extreme) captions would affect other people. Six of 15 (40%)

HOH participants commented that the animated captions would positively affect other

67

viewers (m = 1.7, SD = 4.1), with a total of seven positive comments. Three of 10 (30%)

hearing participants said that the animated captioning would affect others positively (m =

2.7, SD = 1.20), with total positive comments of eight. Six of 15 (40%) HOH participants

said the animated captioning would affect others negatively with 11 negative comments

in total (m = 1.8, SD = 1.33). Only one of 10 (10%) of hearing participants made one

comment that the animated captions would affect others negatively with 1 comment in

total (m = 1.0, SD = 0).

The final category measured was Impact on Genres, designed to determine

whether the participants thought that the emotive captions would be suitable for all

genres. Eleven of 15 (40%) of the HOH participants suggested that the emotive captions

could be applied to all genres (m = 1.6, SD = 0.92) and five of 10 (50%) of the hearing

participants suggested that these captions would be appropriate for all genres (m = 1.8,

SD = 1.10). Six of 15 (40%) of the HOH participants thought emotive captions would not

be appropriate across many genres (but not all) (m = 1.8, SD = 1.17) and four of 10

(40%) of the hearing participants thought emotive captions were not appropriate for all

genres (m = 1.5, SD = 1.00).

Table 6: Number of positive/negative comments by groups for the video data

Themes Hard of Hearing Hearing Total

Comments

Number

of People

Total

Comments

Number of

People

Total

Comments

Enhanced Captions Positive 12 33 10 42 75

Enhanced Captions Negative 4 11 1 4 15

Closed Captions Positive 7 10 4 7 17

68

Closed Captions Negative 10 22 9 18 40

Extreme Captions Positive 4 5 5 6 11

Extreme Captions Negative 11 40 9 28 68

Technical Positive 8 14 1 1 15

Technical Negative 6 13 7 17 30

Sound Effects Positive 6 16 5 7 23

Sound Effects Negative 8 16 4 8 24

Impact of Captions Positive 8 18 7 11 29

Impact of Captions Negative 9 13 5 7 20

Impact on People Positive 6 7 3 8 15

Impact on People Negative 6 11 1 1 12

Impact on Genre Positive 11 18 5 9 27

Impact on Genre Negative 6 11 4 6 17

Table 7: Descriptive statistics for HOH group in all categories of the video data analysis

N Mean SD

Enhanced Captions Positive 12 2.75 1.71

Enhanced Captions Negative 4 2.75 1.26

Closed Captions Positive 7 1.43 1.13

Closed Captions Negative 10 2.20 1.14

Extreme Captions Positive 4 1.25 0.50

Extreme Captions Negative 11 3.64 1.69

Technical Positive 8 1.75 1.04

69

Technical Negative 6 2.17 1.33

Sound Effects Positive 6 2.67 1.63

Sound Effects Negative 8 2.00 1.07

Impact of Captions Positive 8 2.25 2.05

Impact of Captions Negative 9 1.44 0.73

Impact on People Positive 6 1.17 0.41

Impact on People Negative 6 1.83 1.33

Impact on Genre Positive 11 1.64 0.92

Impact on Genre Negative 6 1.83 1.17

Table 8: Descriptive statistics for hearing group in all categories of the video data

N Mean Standard Deviation

Enhanced Captions Positive 10 4.20 1.93

Enhanced Captions Negative 1 4.00 .

Extreme Captions Positive 5 1.20 0.45

Extreme Captions Negative 9 3.11 1.76

Closed Captions Positive 4 1.75 0.96

Closed Captions Negative 9 2.00 1.12

Technical Positive 1 1.00 .

Technical Negative 7 2.43 1.40

Sound Effects Positive 5 1.40 0.55

Sound Effects Negative 4 2.00 1.41

Impact of Captions Positive 7 1.57 1.13

70

Impact of Captions Negative 5 1.40 0.55

Impact on People Positive 3 2.67 1.15

Impact on People Negative 1 1.00 .

Impact on Genre Positive 5 1.80 1.10

Impact on Genre Negative 4 1.50 1.00

Comparing Captioning Styles

75

1511

68

17

40

0

10

20

30

40

50

60

70

80

Positive Negative

Comments

Nu

mb

er

of

Co

mm

en

ts

Enhanced

Extreme

Closed

Figure 12 Comparison of the three captioning styles by number of comments

Section 5.03 Emotion ID

Participants were asked to identify the emotions present in the first clip of the six

that they watched. These identifications were then analyzed according to five different

categories (four for the conventional captions): 1) correct identification within a two

71

second window from the time the emotion occurred (to account for response time

processing); 2) identification of a different/incorrect emotion within the two second

window; 3) missed identification; 4) emotion identified outside of the two second

timeframe of when the animation occurred, and 5) emotions identified for non-emotion

animation (e.g., for sound effects).

A t-test was used to assess differences in emotion ID between the hearing and HOH

groups. A significant difference was found between hearing and HOH participants for the

emotion ID of fear (t (23) = -2.11, p<0.05) where the HOH made an average of 1.7

identifications of fear (SD = .82) and the hearing identified a mean of 2.9 identifications

of fear (SD = 1.7). No other significant differences were detected.

As shown in Table 9, the HOH group identified an emotion correctly 55% of the

time and the hearing group 50% of the time. The HOH group incorrectly identified an

emotion 6% of the time, same as the hearing group. The HOH group missed

identification 13% of the time, while the HOH group missed identification 23% of time.

The HOH group identified an emotion outside the animation boundary 12% of time,

while the HOH did so 13% of time. Finally, the HOH group identified an emotion during

other animations 13% of time, while the hearing group identified an emotion during other

animations 7% percent of the time.

Table 9 Average number of Emotions identified in each category

Total

Identified

# of Correct

emotion/correct

time (% of

total)

Incorrect

emotion/corr

ect time

Missed

emotion

completely

Emotion

identified

outside of

animation

window

Emotions

identified

during other

animations

72

Hard of

Hearing

67 37 (55) 4 (6) 9 (13) 8 (12) 9 (13)

Hearing 125 63 (50) 8 (6) 29 (23) 16 (13) 9 (7)

A Repeated Measures ANOVA was applied to detect any differences between type

of emotion and their identifications. Bonferroni adjustment was applied to the alpha level

to derive adjusted alpha level of 0.02. Significance was detected in the correct emotion

identification category [F (4, 58) = 12.23, p< 0.02]. A post-hoc Tukey analysis revealed

the following:

1) Correct identification of fear versus anger (P= 0.005).

2) Correct identification of fear versus sadness (P = 0.000).

3) Correct identification of fear versus happiness (P = 0.000).

4) Correct identification of fear versus disgust (P = 0.000).

As can be seen from Table 10, anger was correctly identified 50% of the time

mistakenly identified 4% of the time, missed completely 13% of the time, identified

outside of the animation window from non-caption related cues (such as facial

expressions) 23% of the times and identified during other animations (such as sound

effects) 10% times.

Fear was correctly identified 67% of the time, mistakenly identified 7% of the time,

completely missed 15% of the time, identified from non-caption related cues 7% of the

time and from other animations 5% of the time. Sadness was detected correctly 36% of

the time, mistaken for another emotion 14% if the time, missed 32%,perceived based on

non-caption cues 4% of the time and identified during other animations 4% of the time.

73

Happiness was correctly identified 37% of the time, mistakenly identified 7% of the time,

missed entirely 41 % of the time, not identified outside of the animation window, and

identified during other animations 15% of the time. Finally, disgust was identified 42%

of the time, not mistaken for another emotion, missed 8% of the time, detected outside of

the animation window 29 % of the time and identified during other animations 21% of

the time.

Of all of the emotions, fear appeared to be correctly identified more often than the

other emotions (67% of the time fear was correctly identified when it occurred), while

happiness was missed more often than the other emotions (41% of the times it occurred it

was missed). Disgust appears to be more often identified based on non-caption related

cues (e.g. facial expressions, gestures or voice) than any of the other emotions (it was

identified based on non-caption cues 21% of the time).

Table 10: Emotions versus type of identification

Total

Identified

# of Correct

emotion/corr

ect time (%

of total)

Incorrect

emotion/corre

ct time

Missed

emotion

completely

Emotion

identified

outside of

animation

window

Emotions

identified

during other

animations

Anger 52 26 (50) 2 (4) 7 (13) 12 (23) 5 (10)

Fear 61 41 (67) 4 (7) 9 (15) 4 (7) 3 (5)

Sadness 28 13 (36) 4 (14) 9 (32) 1 (4) 1 (4)

Happiness 27 10 (37) 2 (7) 11 (41) 0 (0) 4 (15)

Disgust 24 10 (42) 0 (0) 2 (8) 7 (29) 5 (21)

74

A Repeated Measures ANOVA analysis was applied to determine possible

differences in emotion understanding between the caption styles. No significant

differences were found. The conventional captioning clips did not have animations; as

such the fifth identification category (emotions identified during other animations) shows

no counts. Table 11 shows the total percentage for each type of emotion identification

listed by captioning category (percentage shown in brackets). Emotions were correctly

identified most often for conventional captions (57%) but also most often incorrectly

identified (9% of the time). Emotion identification was missed most often for the extreme

captions (27% of the time).

As shown in Table 11, Conventional Captions were correctly identified 57% of the

time, incorrectly identified 9% of the time, not identified 16% of the time, identified

outside of animation window 20% of the time and not identified during other animations

(since conventional captioning did not contain any animations). Enhanced captions were

correctly identified 50% of the time, incorrectly identified 2% of the time, not identified

19% of time, identified outside of animation window 11% of time and identified during

other animations 19% of time. Finally, Extreme captions were correctly identified 47% of

time, incorrectly identified 7% of time, not identified 26% of time, identified as a result

of non-animation related cues 5% of time and identified during other animations 14% of

time.

Table 11: Caption category versus type of identification

Total

Identified

# of Correct

emotion/correct

time (% of

Incorrect

emotion/correct

time

Missed

emotion

completely

Emotion

identified

outside of

Emotions

identified

during

75

total) animation

window

other

animations

Conventional

Captions

81 46 (57) 7 (9) 13 (16) 15 (20) 0

Enhanced

Captions

54 27 (50) 1 (2) 10 (19) 6 (11) 10 (19)

Extreme

Captions

57 27 (47) 4 (7) 15 (26) 3 (5) 8 (14)

76

Chapter 6. Discussion

Animated captions were completely novel to all the participants, except in the

form of animated titling in film and television. Therefore, the results of this study were

unexpected and generally surprising. In addition, since the method used to generate the

captions was artistic, audience reactions could not be predicted in advance. One desired

outcome of the study was to establish some future research directions to pursue on how

non-dialogue sound information can be better represented to users of closed captioning.

The audiences’ ability to understand the intention of the animated captions and their level

of acceptance of this approach were critical factors in achieving this outcome.

The first objective of the user study was to gain an understanding of the level of

acceptance of animated captions by the hearing and hearing impaired audience members.

The video taped comments and questionnaire data were examined closely to gain insight

into this objective. The hard of hearing (HOH) and hearing viewers seemed to prefer the

enhanced captions to the conventional or extreme captions for both episodes. As seen in

Table 6, 80% of (22 of 25) participants made at least two positive comments about the

enhanced captions with a combined total of 75 positive comments. Comparatively, only

44% (11 of 25) commented positively about conventional captioning and 36% (9 of the

25) commented positively about extreme captioning with totals of 17 and 11 comments

respectively. The Repeated measures ANOVA of the number of positive comments made

about each of the captioning styles was statistically significant. A Tukey post-hoc showed

that there was a significance differences in user preference for enhanced captions for

conventional captioning (p= 0.005) and extreme captioning (p= 0.002). In the interviews

77

following completion of the questionnaire, participants made comments such as enhanced

captions “added a new dimension” and “I want to see more captioning like

this…someone should market it” that reinforced the results from the video and

questionnaire data. Participants liked that enhanced captions showed the inflection of

voice and emotions and that it “wasn’t flat” because it “added excitement.”

In contrast to the enhanced captions, the extreme captions and to a lesser extent

conventional captions were criticised far more. Nineteen of 25 participants commented

negatively about the conventional captions and 21 of 25 participants commented

negatively about extreme captions, generating a total of 40 and 68 negative comments

respectively. Comparatively, only five people made negative comments about enhanced

captions with a total of 15 negative comments. The Repeated Measures ANOVA

uncovered significance in the number of negative comments among the three captioning

style categories. A Tukey post-hoc revealed that there was a significant difference

between the extreme captions compared to conventional captions (p = 0.18). Participants

commented in the open discussion session that the extreme captions “does not add value”

and exclaimed “what was the point of that” while watching. They went as far as to make

comments such as “Deep 6 aka bury this one.” Comments about the conventional

captions were more moderate but showed noticeable preference for enhanced captions

through complaints like “it won’t be easy going back to the regular [conventional]

captions”. The extreme captions were described as overwhelming, distracting and too

bouncy by a majority of participants. The conventional captions were found to be less

exciting and less informative than the enhanced captions.

78

For the HOH group, the ratio of positive to negative comments for the enhanced

captions was 3 to 1 and for the hearing group, it was 10.5 to 1, providing additional

evidence of the preference trend. For the conventional and, particularly, the extreme

captions, the ratio of positive to negative was far less than 1, meaning many more

negative than positive comments were made. Given that 73% of the HOH participants

reported using conventional captions on a regular basis, it is surprising to find such strong

support for the enhanced captioning style.

The questionnaire results support the findings from the video data and tell a

similar story. Enhanced captions received a very positive response in all of the “caption”

attributes (Colour, Size of Text, Speed of Display, Placement on Screen, Text

Description, Movement of Text, How moving Text Portrayed Emotion and Text

Descriptions) with high ratings of positive or very positive by more than 70% of all

participants. The questionnaire data also revealed that many of the participants in the

HOH group reported they were less confused (54%), less distracted (61%) and more

perceptive (64%) of the character’s emotions in the show while watching enhanced

captions. The open ended comments and discussion after each show also support these

findings. Participants made comments such as “liked emphasis on words and bouncing to

match actions.” and “dramatic improvement over traditional captioning, specifically

when trying to convey emotions in a show”.

Sixty-seven percent of the people reported liking the extreme captioning style

overall in the questionnaires; however, 57% of people reported disliking the movement of

extreme captioning text. The extreme captioning was bold, colourful and blended in well

79

with the children’s program. It is possible that the participants enjoyed the aesthetics of

the captions as an artistic piece but found the shaking letters difficult to read.

On the whole, video data, questionnaire data and free form comments indicate

that enhanced captions are perceived to be well-understood and accepted by the

participants. Enhanced captions are also identified by the participants as an improvement

over the traditional captions (at least in short examples). Additionally, people described

themselves to be willing or very willing to discuss the show with hearing friends. The

willingness of the HOH participants indicate enough confidence in their level of

understanding of the show’s content to discuss it with hearing friends who can access the

sound through the audio channel. Communication difficulties can often exist between

caption user and non-caption user groups regarding show content. Hearing impaired

caption users often believe that they do not necessarily have sufficient information about

the show, and thus they are reluctant to discuss it with hearing viewers (Fels et. al.,

2005). Captions that seem to increase user comfort level with the content are an

important improvement towards more equal access to television and film content.

Another objective of my study was to uncover possible differences between the

HOH and hearing group in their reactions to the animated captions. A large portion of

caption users are not hearing impaired. Also, many hearing impaired viewers use captions

in the presence of hearing friends and family. Thus, analyzing and comparing the

reactions of the two groups was important to consider. Surprisingly, only the category of

“Impact on People” showed a significant difference between the two groups.

The data revealed that six of 15 HOH participants thought that hearing people

considered all captions to be disruptive (not just animated captions) to their television

80

viewing experience. The six HOH users had particular reservations about the acceptance

of the animated (enhanced and extreme) captions by their hearing peer. One HOH

participant said, for example, that hearing people would be greatly bothered and

distracted by the animated captions. Contrary to what the HOH participants believed,

100% of the hearing users reacted very positively to the enhanced captions (but not to the

extreme captions). The questionnaire data supported the video data and showed that in

the hearing group 61% of the people found that their level of understanding of the show

increased with the addition of the animated text. Meaning, the moving text did not

impede the understanding of the hearing participants of the show but rather enhanced it.

These types of assumptions can colour people’s ability to empathize with others, in this

case hearing television viewers with hard of hearing viewers and vice versa. This

discrepancy in perception may be worthwhile exploring in future research.

The questionnaire data showed another significant difference between the hearing

and HOH groups in the placement of the text on the screen for the enhanced captions.

The placement of the enhanced captions was near the bottom of the television screen,

which closely resembled that of closed captions. The HOH group seemed to like it very

much (m = 1.5, SD = 0.75, where 1 is liked very much and 5 is disliked very much) and

the hearing group did not like it as much—but the mean still fell close to the liked

category (mean = 2.1, SD= 0.83). This difference in preference can be explained by

examining the caption viewing habits of the two groups. None of the hearing participants

reported watching television with captions always, only “occasionally” (60%) or “never”

(40%); however, 73% of the HOH group reported always watching television with

captions. Since the HOH participants are more accustomed to this position than the

81

hearing participants, their preference for the familiar placement would naturally be

higher.

All the questionnaire responses were analyzed separately using crosstab analyses

to understand possible correlations between participant ratings of caption attributes and

comprehension attributes. These crosstabs were aggregated over two episodes and

grouped by hearing status to ensure enough data for each category and to detect any

differences between the two groups.

In the HOH group, those who liked the speed of the enhanced caption

presentation also found that they were less confused (36%); less distracted (61%) and had

a better understanding of the character’s emotions the show (54%). Standard captioning

speed is recommended at 145 words per minute (Jensema, 1997) and in this study, this

standard was used as the rate at which enhanced captions appeared. It appears that this

reading speed was optimal for this participation demographic. For the hearing group, the

crosstab analysis showed that 63% of hearing participants liked the portrayal of emotions

using enhanced captions and that this portrayal also increased their level of understanding

of show. This positive correlation indicated that hearing viewers appeared to derive value

from viewing television shows and films with enhanced captions.

Having captions at the bottom of the screen is a conventional display style. It is

expected that regular users of captions would show a preference for this style. The

crosstab results showed that the placement of captions at the bottom of screen was

correlated with the participants’ understanding of the show and of the character’s

emotions. Participant comments showed that the familiar placement prevented the

viewers from hunting around the screen for the caption. Participants made strong

82

suggestions for placement by saying “leave the captions at the bottom”. In addition,

animated text captions that are placed in different locations on the screen were found to

be generally undesirable. Participants commented that “a little word here and there is

distracting” and the unusual placement of words is causing “competition for attention.”

Fels et al. (2005) reported similar preferences expressed by participants in a study with

graphic captions. Fels et. al. (2005) found that deaf and HOH users strongly disliked

having captions located near the speaker because captions located close to the face of a

speaking person were interpreted by viewers as being told to lip read that person.

The more likely cause of this dislike of captions in different screen locations is

that more extensive eye movement or work is required to process/read the dynamic

captions. In my study, not only were extreme captions placed in different locations on the

screen but also only some of the words from the captions were placed in those different

locations. In order to read the whole caption, the viewer had to focus on the full text at

the bottom of the screen and move their eyes to the moving text that was placed in the

different screen locations. Because the text was animated, it easily attracted/distracted the

viewer’s attention; a common reaction to moving objects (Peterson & Dugas, 1972).

Having only some words animated at different locations on the screen forced the user’s

attention to be diverted from reading the caption sentences. Consequently, the user’s eyes

moved back and forth between the moving words and the static captions, increasing the

work the eye and the cognitive process must do. Therefore, I would recommend that

moving text in captions should be contained within the full caption text and be located at

the bottom of the screen.

83

The crosstab analysis also showed a positive relationship between the size of the

captions and participant’s level of understanding of the show. In their study on people’s

preferences for captions, The National Centre for Accessible Media found that people

preferred “medium” sized captions, where medium was defined as “roughly equivalent to

the 26 lines of the TV picture” or 26 pixels (NCAM, n.d.c, section Test Element: Caption

Size, paragraph 1). The size of the enhanced captions in this study began at 25 pixels and

generally increased with the various emotions, closely following recommended font

sizes. For emotions such as anger, afraid and happy the animation became larger, only

sadness became smaller in vertical size. It is not surprising then that this size was

positively correlated to participant’s levels of understanding. No comments were made on

the impact of changing the caption sizes. Further research is required with more examples

of animated captions to determine what size attributes are causing the positive

relationship between understanding emotions and caption size.

A final objective of my research was to measure the participants’ understanding

of the emotions experienced by the characters of the show as a result of the animated text.

In my attempt to attain this objective, I realized that reliably measuring people’s

understanding levels about a piece of content is tremendously difficult, particularly when

one group has a communication disability. No standard measures exist for establishing

equivalency between HOH and hearing groups because most tests of understanding

involve questions developed by hearing people, which are often inappropriate for

comparison. Thus, the design for measuring the understanding of emotions had

drawbacks which will be discussed in the limitation section of this thesis.

84

The emotion analysis portion of the study showed that, hearing participants

accurately identified more emotions than the HOH participants. However, the only

significant difference between the groups was for fear, where the HOH group made on

average 2.0 correct identifications while the hearing group made 2.7 correct

identifications.

The fear emotion occurred more often than any of the other emotions and fear was

identified by more people than the other emotions. Thirty-two percent of all emotion

identifications were for fear compared with 27% for anger, 15% for sadness, 14% for

happiness and 13% for disgust. Fear was also identified correctly more than all other

emotions (67%), and had a low percentage of incorrect (7%) and missed identifications

(15%). Out of all the emotions, fear was significantly identified more accurately [F (4,

58) = 12.23, p< 0.02] for both groups compared to the other emotions. The results then

indicate that the animation for fear represented the emotion to the participants with a high

degree of clarity.

Anger also showed a similar trend to that of fear, although the results were not

significant. Fifty percent of the time the anger animation was correctly identified, only

4% of the time it was incorrectly identified, and 13% of the time missed completely. The

high number of correct identifications (and low number of incorrect identification)

indicate that the animation for anger could correctly represent the emotion. However,

further research will need to be carried out in order to validate the anger animation,

especially using content that contains a range of anger emotions.

As seen in Table 10, happiness, disgust and sadness seem less understandable and

more difficult to interpret, although sadness had a 46% correct identification rate.

85

Possible explanations for this result are: 1) the animation mapped somewhat

unsuccessfully to the emotion; or 2) too few instances of these three emotions were

present compared to anger and fear in the content. Instances of anger and fear occurred

three times each in the clips while happiness, disgust and sadness occurred only once.

The mixed results in the case of happy, sad and disgust indicate that more research is

definitely required in order to create more reliable animations. However, it is important to

note that the main purpose of this exploratory study was to establish if animated captions

are a worthwhile avenue to pursue, not to establish categorically consistent and accurate

emotion animations.

One surprising result of the study was that the emotions were correctly identified

the most number of times in the conventional captions (57%), more often than the

enhanced (50%) or extreme captions (48%)—although these differences were not

significant. Also, emotions were missed 16% of the time for conventional captions, 19%

for enhanced and 26% for extreme. Emotions were incorrectly identified for the

conventional captions more than the other types (9% compared with 2% for enhanced

captions and 7% for extreme captions). When compared to how well enhanced caption

were rated compared to extreme and conventional, the identification data seem difficult

to rationalize. It is possible that enhanced and extreme captions increase the intrusive

nature of captioning, thus making it difficult for the participants to understand what is

going on. However, this seems unlikely since 64% of participants found that their

understanding of character’s emotions was increased by enhanced captioning. Thus, the

more possible explanation for the contradictory results can be attributed to methodology

and the difficult task of measuring understanding of television content. Many events were

86

occurring simultaneously when participants were asked to view a new piece of content,

with new style of captioning and then verbalize the emotions of the characters. These

events included watching the content, understanding and interpreting the emotions,

looking at the checklist to keep track of the available options and speaking out loud the

appropriate emotion. Since participants were already familiar with the closed captioning,

the task of identifying and verbalizing the emotions may have been less difficult than

performing this task with enhanced captioning. Animated text is inherently more

distracting than static text, as a result enhanced captioning, with its added level of

distraction may have hindered the viewer’s ability to correctly verbalize rather than

correctly understand the emotion.

Other categories of interest measured in the study were the technical category and

the sound effects category, although neither showed any statistical significance between

the HOH and hearing groups or positive versus negative comments. The technical

category amalgamated several factors related to captioning, such as the translucent bar

against which the animated captions were displayed, and the content (Deaf Planet).

Viewer opinions about the translucent bar were somewhat divided. Three viewers

appreciated that the show was more visible because of the translucency, which indicates

that the translucency feature of the EIA-708 standard may be useful in enhancing viewer

experience. The negative comments about the bar (by one participant) related to the fact

that it stayed on even when the captions were not being displayed. For this particular

study, it was too time consuming to synchronize the translucent bar with the appearance

of captions. In the future this problem should be rectified so that the translucent bar

disappears if there are no captions. The comments about the content, Deaf Planet, were

87

also divided (six participants liked it, four disliked it). Six participants appreciated Deaf

Planet’s inclusive theme and enjoyed watching a program geared towards sign language

users. One of the hearing viewers was not accustomed to signing and was distracted by

signing characters that did not speak. Other participants (hearing and HOH) disliked the

qualities intrinsic to children’s programming such as extreme colours, over expressive

acting, and slower, simpler plot lines. It is imperative that further research with animated

captions be evaluated using different genres and different lengths of content, in order to

eliminate the biases (positive and negative) that are inherent with a sign language based

children’s show.

There were almost an equal number of positive (23) and negative (24) comments

made by the users in the category of sound effects. Eleven of 25 users found that

animating the sound effects was beneficial in providing more information not readily

available from the dialogue and visual cues. However, the animation of the sound effects,

particularly in the Bad Vibration segment, was thought to be too overpowering and

distracting. In particular, the seemingly innocuous use of dancing, colourful letters to

depict music was criticised by most HOH viewers as being unnecessary and offensive.

These users suggested that the word “Music” be replaced by musical notes and

mentioned that the word “music” unwittingly draws attention to their disability.

The questionnaire data mirrored what was found in the video data regarding the

animated sound effects. Seven people reported not liking the word “Music” animated to

represent the emotion in the music and suggested animating music notes instead of

letters. People also reported not liking the animation for the word “rumble” because it

was unreadable due to rapid motion and a faded appearance.

88

When the questionnaire data was grouped by episodes and analyzed, a significant

difference was found between the two episodes (To Air and Bad Vibes) for the text

description (or sound effects) category. A closer look revealed that the significance was

being caused by the HOH group. For To Air, the HOH group reported liking the text

description very much (mean = 1.6, SD = 0.63) compared with Bad Vibes where they

reported only liking the text descriptions (mean = 2.3, SD = 0.83). As Bad Vibes had

more sound effects than To Air, more text descriptions were used for them (e.g.,

“rumble” whirling” and “music”). This data seems to indicate that overusing words to

represent sound effects may not be a suitable direction to pursue.

The voiced opinions about the sound effects were decidedly split. The sound

effects were reported to be helpful but also distracting. It appeared that participants

preferred the sound effects when dialogue text was not present on screen, as opposed to

having dialogue text and description text be simultaneously displayed. Participants

complained that having dialogue and sound description at the same time is “hard to

watch,” because there is “too much going on”. Also, the participants wanted the effects to

stop showing as soon as they had a chance to read them, instead of having the description

continue as long as the sound was present. For instance, in one scene in the test clips, a

rumbling noise can be heard continuously for about 20 seconds. When the sound effect

“Rumble” was displayed throughout those twenty seconds, the participants commented

that they wanted the words to disappear as soon as they had a chance to read it. One

participant commented, for example, that “… the word rumble went on for too long”.

This is very surprising considering that one of the most desirable but missing element of

captioning is synchronization. When it comes to sound effects; however, users do not

89

seem to want the description to be synchronized with the sound—to have the sound effect

description be displayed for as long as the sound is present. Since all participants did not

comment about sound effects, it is difficult to generalize a trend. However, it seems that

viewers are having difficulty with associating the descriptive text with the effect itself.

Furthermore, animating descriptive words to mimic a particular sound effect by

manipulating the speed and amplitude of the word may render it unreadable and

ineffective. As a result, animating descriptive text to exactly match the sound effect (as

was done in this case) may not be the most appropriate technique. Animated graphics

may be one possible alternative to descriptive text, as suggested by some of the

participants, especially in the case of music. However, effective graphics can be difficult

and costly to create, and the number of possible sound effects can be overwhelming. It

would be impractical to create a library of graphics for a captioner to use containing all

possible sound effects. Caption users can be engaged in defining specific sound

categories and creating meaningful graphics through a participatory design approach

(Clement & Van den Besselaar, 2004). While the number of possible sound effects is

limitless, some commonality can be established using group consensus. Caption users can

then create sound effect animations that are most understandable to them collectively.

Another solution maybe to use an animated, graphical bar, situated at an

inconspicuous part of the screen. This bar would move according to the sound effect

(similar to the equalizer bars found on media players or sound display bars on a stereo).

In the future, tactile elements could also be introduced in conjunction with visuals to

present sound effects to users. Hot and cold sensations can be used to emulate angry or

tension building sounds, while vibrations can mimic things like rumbling effects.

90

Essentially, a lot more research is needed in this area before any concrete conclusions can

be drawn.

Overall, animated text captions appeared to be accepted and understood by the

participating viewers. The fact that enhanced captions were particularly well-liked by the

participants indicate that viewers are able to interpret meaning from these animated

captions. It also appears that, at least based on the results of this study, the fear and anger

emotions are being accurately represented by animated text. The animations for happy,

sad and disgust emotions still need to be revised. It appears that animation properties

such as variation in size, direction of movement, change in opacity and manipulation of

fonts and colours have the potential to be mapped to specific emotions and applied to

captioning.

Section 6.01 Limitations

(a) Emotion Identification Method

The most difficult part of the study was designing an accurate way of gauging the

relationship between the various animations and the emotions as understood by viewers.

Watching a completely new piece of content with animated captions (a style never before

seen) is difficult for the participants in the first place. The HOH viewers were

accustomed to watching the conventional captioning and the results of the study may

have shown a novelty effect. Conversely, as the hearing viewers were not regular users of

any captioning, having to look at captioning (especially in a study environment) could

have also been somewhat disconcerting. None of the participants had seen the show

“Deaf Planet” prior to the study, so they were completely unfamiliar with the characters,

the setting, and the plot. The test clips were very short (approximately one and a half

91

minutes long) and neither one started at the beginning of the episode, causing further

confusion. In future studies, an entire episode of a show that is familiar to the participants

is recommended for captioning to reduce the confusion. The captioned episode should

also be viewed by the participant from the beginning to ensure maximum comprehension.

However, watching a half-hour show multiple times in a study environment would be

difficult to arrange with test participants, as it would likely last for several hours. I would

suggest, then, that study participants be allowed to view longer episodes at home and

possibly over several hours. Also, data capture and analysis for a long study would also

be difficult to coordinate.

The emotion identification process followed by the participants was a complex

one: watching a show, identifying the emotions the characters were feeling on a checklist

and then verbalizing those emotions using words from the checklist during a second

viewing. It should be noted here that only one of the hard of hearing participants was oral

deaf, as a result the verbal protocol method was appropriate for this study. All of the

participants had the option of writing down their responses but none (except the one who

was oral deaf) chose to do so. The addition of the new and distracting factors (new show

and new style of captioning.) to the identification process only made an already difficult

task harder. In addition, it was obvious from the video data participants seemed to want

to speak but did not as the scene changed. This occurred particularly during enhanced and

extreme captioning. The normal television viewing experience does not require the

viewer to decompose content into parts such as emotions and asking people to do that

may have been too complex of a task to achieve successfully.

92

A physiological measure such as galvanic skin response may be more appropriate,

since it would eliminate the need for the participants to articulate the character’s

emotions. However, these techniques are used to measure arousal of emotion in the

participant rather than their understanding of it [Vetrugno et. al., 2003]. Using sliders or

dials to identify emotions could have also been used, but the number of emotions was

quite high. Asking the participants to remember which dial or slider corresponded to what

emotion would have had its own limitations.

To accurately measure understanding of emotion in the context of watching

television, the participants need to watch more than just a few minutes of a show. They

also need watch the show from the beginning to fully understand what is going. In

addition, verbalizing emotions while watching content and concentrating on it is very

difficult and highly unnatural. One interesting and possibly more accurate way to

measure understanding would be to allow the participants to communicate the story of

the show after watching it through a focus group. In this way, a facilitator could

encourage all participants to make story contributions to develop a rich and thorough

discussion of the content.

(b) Communication barriers

In addition to the difficulty of designing the study, the relaying of the instructions

to HOH users was somewhat difficult as well. Many of participants had severe hearing

loss and were not expert lip readers, even though they themselves were able to

communicate articulately. Communicating the complex instructions of identifying the

emotions to some of the HOH participants while they watched the clips was at times very

difficult. The hearing participants were much more communicative during the emotional

93

identification than the HOH. In fact, several HOH emotional ID portions had to be

excluded because either the instructions were not properly conveyed to them. One HOH

participant did not understand that he was supposed to verbalize his identifications while

watching the videos; instead he continued to mark them down on the emotion checklist

during the second viewing. Another HOH participant, who was oral deaf, was only able

to communicate through writing and it was difficult for her to start and stop the video

each time she wanted to identify an emotion. Three of the HOH users also were visually

impaired adding further communication obstacles.

These barriers to communication are very difficult to mitigate in a study setting.

For future studies involving HOH users, an online chat session through which

participants and facilitator can communicate through typing maybe helpful. Also, running

through a short, but complete, practice study for training would also be helpful—

especially when complex directions are involved. Future studies should also be set up in a

usability lab specifically designed to accommodate HOH users. The lab should have

minimal background noises and high lighting levels for lip readers to see the study

facilitator clearly. If any of the HOH users (in future studies) is also a sign language user,

then interpreters will be used to ease the communication difficulties.

(c) Small Number of Participants

The animated caption study was conducted with a relatively small number of

participants, with only fifteen hard of hearing and ten hearing individuals. While the

smaller participant groups are sufficient to gauge the soundness and legitimacy of the

animated text captions, larger control and experimental groups are needed before

categorical conclusions can be drawn.

94

(d) Limited Genre

Deaf planet is not only a children’s show, it is also an atypical one. The premise of the

show is based on science-fiction, thus is a mixture of animation and live action.

Furthermore, all but two characters in the show are deaf and communicate using sign-

language, making all communication between the main characters to be conducted

through an interpreter (who is a female but imitates male voices when necessary).

Finally, the actions, colours and acting of Deaf Planet are highly exaggerated to appeal to

a younger audience. While animated captions worked well with this show, its unusual

nature makes it very difficult to predict the extent to which animated captions would be

applicable to other genres such as drama, horror or comedy. It should be noted that all

content have built-in biases that can influence the effectiveness of a study of this nature.

Thus, it is imperative that animated captions are tested with as many genres and as many

different shows as possible.

(e) Short and Limited Test Clips

The test clips were too short in length and did not provide the participants enough time

to fully understand the story or the characters. Furthermore, since Deaf Planet is a

children’s show, the range of emotions was very narrow with only a few instances of

exaggerated anger, fear and even less instances of happy, sad and disgust. Since I am

trying to establish an emotion model based on a few basic emotions combined with

intensity rating, a show with a much broader range of emotions (e.g. from sarcasm to

rage) is necessary for future research.

95

Chapter 7. Recommendation/Conclusion

Closed captioning has remained essentially unchanged since its inception and

little has been done to improve or alter it over the past few decades. One of the issues that

caption users have identified as problematic is the lack of access to non-dialogue sound

information such as music and speech prosody. The purpose of this thesis work has been

to investigate one alternate method of presenting that non-dialogue sound information, in

the form of animated text. A framework was constructed that mapped a model of

emotions, consisting of a set of basic emotions, onto properties of animation that

represent those emotions. Example video content that contained implementations of this

framework with text captions was then evaluated with hard of hearing and hearing users.

The first research question of my thesis asked if the participants are able to

understand, appreciate and interpret the animated text captions, to determine whether

caption viewers can watch and understand a show with animated text. Based on

participant reactions, I conclude that animated text shows promise as an effective avenue

for representing non-dialogue sound information because most of the participants liked

and understood this new method. Enhanced captioning was very well received by all of

the participants and was thought to be an improvement over conventional captioning.

However, extreme captioning, where animated text was located in different spots around

the screen, was shown to be distracting to the participants. Therefore, my

recommendation is that the extreme captioning style (with dynamic placement of text) is

abandoned moving forward and only enhanced captioning be used for continued

investigation.

96

The second research question asks whether the emotions can be accurately and

consistently represented by the animated text. Out of all the emotive animations, the

animation used to express fear was most obvious to the participants than any of the other

animations. Thus, I recommend that the animation for fear can be used consistently to

represent that emotion. Next, the animation for anger, although not statistically

significant, was understandable to most people (based on the descriptive statistics). I

therefore suggest that the animation used to represent anger is also acceptable. The

animations for the remaining three emotions (happy, sad, and disgust) were identified

correctly on fewer occasions than fear or anger. However, the two test clips had many

more instances of anger and fear than happy, sad or disgust. As a result, I recommend

that animation styles for happiness, sadness and disgust be re-evaluated with content

containing more of these emotions.

The final research question asks what properties of animation related to what

properties of emotion. To answer that research question, I refer to the animation/emotion

model outlined in Chapter 2. As the model shows, the fear emotion can be represented

with a constant expansion and contraction of size, combined with a vibration effect for

the duration of the animation. The anger emotion can be conveyed with an abrupt, one

cycle expansion and contraction of text, with the vibration occurring at the largest size.

Sadness is represented with text decreasing vertically, moving in a downward motion and

decreasing in opacity. Happy animation is almost opposite of sad, with a vertical increase

of text and upward movement of text from baseline.

Animating descriptions of sound effects is definitely an area to be investigated

further; however, text descriptions may need to be replaced with meaningful symbols or

97

graphics. Although, as long as the animated text descriptions are tested for readability

and understanding, they may still be used sparingly to effectively represent sound effects.

Based on this study, it is apparent that all text suggesting musical elements should be

replaced with musical notes.

Section 7.01 Future Research

The participants in my study were only able to watch two short clips (less than

two minutes each) containing animated captions for a speciality children’s programming.

In order to better establish the effectiveness of this new style of captioning, full-length

(30 to 60 minute) shows need to be captioned using this method and tested with a larger

number of participants. Furthermore, the subject matter of Deaf Planet is unusual and

directed towards young children. A more adult-oriented, conventional show need to be

captioned in order to gauge reactions of a wider audience including those who are sign

language users. A typical comment by participants was that animated captions appear

more appropriate for comedies and dramas. Further research is needed to examine the

appropriateness and acceptability of animated text for full length television and film

shows in different genres and to determine the limitations of text captions within those

genres.

The animations for sad, happy and disgust must be evaluated and reapplied to a

show that contains more of these emotions. It is possible that one or all of these

animations must be either abandoned or fine-tuned for accuracy.

The hearing participants of this study appeared to be overwhelmed by the

presence of sound and captions. Future studies need to recreate a typical environment for

hearing viewers, where sound is muffled or reduced such as in an exercise setting. An

98

improved method for measuring audience understanding of emotion that does not rely on

participants verbalizing what they think the emotions characters are feeling is required for

future evaluation.

If using the same approach for the emotion identification portion of the study,

where participants verbalize the emotions present on screen, a complete signal detection

methodology can be applied. Using the signal detection approach, participant responses

can be categorized into hit, miss, false alarm and correct rejection. The data can then be

analyzed using a D prime statistics to see the relationship among the four categories, for

each captioning type.

The primary purpose of this explorative study was to establish the viability of

animated text as captions. Since it seems that animated captions are acceptable,

expressive and effective, more research time and effort can be committed to this method.

An iterative, participatory design (Clement & Van den Besselaar, 2004) approach maybe

used to either refine or create the emotive animations. The end user of captioning can

work collectively to arrive at animations that best convey emotions from their

perspective. The participatory design approach could be most helpful in creating

meaningful animations for sound effects.

Section 7.02 Conclusion

The results of this first study are encouraging and suggest that animating captions

is one way of capturing more sound information contained in television and film. People

who are hard of hearing indicate that they want more sound information and it appears

that animated text may provide progress toward that goal. The emergence of digital

99

television should be leveraged to improve the quality of captions and it is apparent from

this research that most users are ready and willing to accept new forms of captioning.

100

Chapter 8. References

Abdi, H. (2007). Signal detection theory. Encyclopedia of Measurement and Statistics.

Thousand Oaks (CA): Sage.

Abrahamian, S. (2003). EIA-608 and EIA-708 Captioning. Retrieved September 19, 2006,

from http://www.evertz.com/resources/eia_608_708_cc.pdf

Ambrose, G., Harris, P. (2006). The fundamentals of typography. New York: Watson-Guptill

Publication

Barnett, S. (2001). Clinical and cultural issues in

caring for deaf people. . Retrieved April 3, 2007, from

http://www.chs.ca/info/vibes/2001/spring/clinicaldeaf.html

Blanchard, R. N. (2003). EIA-708-B closed captioning implementation. Paper presented at the

IEEE International Conference on Consumer Electronics, 80-1.

Bodine, K., & Pignol, M. (2003). Kinetic typography-based instant messaging. . [Electronic

version]. CHI'03 Extended Abstracts on Human Factors in Computer Systems, 2003, 914-

1. Retrieved September 7, 2006,

Boltz, M. G. (2004). The cognitive processing of film and musical soundtracks. [Electronic

version]. Memory & Cognition, 32(7), 1194-15.

Bruner, G. C. (1990). Music, mood and marketing. [Electronic version]. Journal of Marketing,

54(4), 94-11.

Canadian Association of Broadcasters. (2004). Closed captioning standards and protocol for

canadian english language broadcaster. Retrieved February 6, 2007, from

http://www.cfv.org/caai/nadh20.pdf

101

Canadian Hearing Society. (2004). Status report on deaf, deafened and hard of hearing

ontario students. Retrieved January 19, 2007, from

http://www.chs.ca/pdf/statusreport.pdf

Canadian Television and Radio Commission. (2000). Decision CRTC 2000-60. Retrieved April

7, 2007, from http://www.crtc.gc.ca/archive/eng/Decisions/2000/DB2000-60.htm

Card, S.K., Moran, T., & Newell, A. (1983). The psychology of human-computer interaction.

Hillside, NJ: Lawrence Erlbaum Association Inc.

Chion, M. (1994). Audio-Vision: Sound on Screen ed., New York: Columbia University Press.

Clement, A. Van den Besselaar, P. (2004). Proceedings of the Eighth Conference on

Participatory Design: Artful Integration: Interweaving Media, Materials and Practices, PDC

2004, Toronto, Ontario, Canada, July 27-31

Consumer Electronics Association. (2005). CEA-608-C: Line 21 data services. Retrieved

September 20, 2006, from

http://www.ce.org/Standards/StandardDetails.aspx?Id=1506&number=CEA-608-C

Consumer Electronics Association. (2006). CEA-708-C: Digital television (DTV) closed

captioning. Retrieved January 13, 2007, from

http://www.ce.org/Standards/StandardDetails.aspx?Id=1782&number=CEA-708-C

Cook, N. D. (2002). Tone of voice and mind: the connections between intonation, emotion,

cognition and consciousness. John Benjamin Publishing Company,

Creswell, J. (2003). Qualitative, quantitative, and mixed methods approaches. Thousand Oaks,

California: Sage.

Cruttenden, A. (1997). Intonation .Cambridge: Cambridge University Press

102

DCMP. (n.d.). Steps in the captioning process. Retrieved March 17, 2007, from

http://www.dcmp.org/caai/nadh29.pdf

Downey, G. (2007). Constructing closed-captioning in the public interest: From minority media

accessibility to mainstream educational technology. [Electronic version]. Info, 9(2/3), 69-

13.

Ekman, P., & Friesen, W. V. (1986). A new pan-cultural facial expression of emotion.

[Electronic version]. Motivation & Emotion, 10, 159-19.

Ekman, P. (1999). Basic emotions. In T. Dalgleish, & M. J. Power (Eds.), Handbook of

cognition & emotion (pp. 301-19). New York: John Wiley.

Ernst, S. B. (1984). ABC's of typography (Revised Edition ed.) Art Direction Book Co.

Fels, D. I., Lee, D. G., Branje, C., & Hornburg, M. (2005). Emotive captioning and access to

television. ACMIS 2005, Ohmaha

Fels, D.I., Silverman, C. (2002). Emotive captioning in a digital world. ICCHP 2002.284(7).

Ford, S., Forlizzi, J., & Ishizaki, S. (1997). Kinetic typography: Issues in time-based

presentation of text. Paper presented at the CHI '97 Extended Abstracts on Human

Factors in Computing Systems: Looking to the Future , Atlanta, Georgia. 269-1.

Forlizzi, J., Lee, J. C., & Hudson, S. E. (2003). The kinedit system: Affective messages using

dynamic texts. Paper presented at the Proceedings of the SIGCHI Conference on Human

Factors in Computing Systems , Ft. Lauderdale, Florida. 377-7. from Conference on

Human Factors in Computing Systems database.

Frijda, N. (1986). The emotions. New York: Cambridge University Press.

103

Gaulledet Research Institute. (2007). How many deaf people are in the united states.

Retrieved June 19, 2007, from http://gri.gallaudet.edu/Demographics/deaf-US.php

Geffner, D. (1997). First things first. Retrieved March 28, 2006, from

http://www.filmmakermagazine.com/fall1997/firstthingsfirst.php

Harkins, J., Korres, E., Singer, M. S. & Virvan, B. M. (1995). Non-speech information in

captioned video: A consumer opinion study with guidelines for the captioning industry.

Retrieved March 11, 2006, from http://www.cfv.org/caai/nadh126.pdf

Jacko, J., & Sears, A. (2007). The human computer interaction handbook: Fundamentals,

evolving technologies and emerging applications. ONLINE: CRC Press.

Jensema, C. J. (1997). Final report for presentation rate and readability of closed captioned

television. Retrieved June 15, 2007, from

http://www.eric.ed.gov/ERICDocs/data/ericdocs2sql/content_storage_01/0000019b/80/1

5/75/be.pdf

Juslin, P. N., & Laukka, P. (2003). Communication of emotions in vocal expression and music

performance: Different channels, same code? [Electronic version]. Psychological Bulletin,

129(5), 770-14.

Lee, J. C., Forlizzi, J., & Hudson, S. E. (2002). The kinetic typography engine: An extensible

system for animating expressive text. Paper presented at the 15th Annual ACM

Symposium on User Interface Software and Technology Paris, France. 81-9.

Lewis, M. S. J. (2000). Television captioning: A vehicle for accessibility and literacy. Paper

presented at the CSUN, Los Angeles, California.

104

Marshal, J. K. (n.d.). An introduction to film sound. Retrieved June 20, 2007, from

http://www.filmsound.org/marshall/index.htm

Miles, M. B., & Huberman, M. (1994). Qualitative data analysis: An expanded sourcebook (2nd

ed.). Newbury Park, CA: Sage.

Minakuchi, M. & Tanaka, K. (2005). Automatic kinetic typography composer. ACM International

Conference Proceeding Series; Vol. 265. 221(3)

Mowrer, O. H. (1960). Learning theory and behavior. New York: Wiley.

National Captioning Institute. (2003). New analytical study of closed captioning

finds audiences think It’s important but improvements are needed. Retrieved March 12,

2006, from http://www.ncicap.org/AnnenbergStudy.asp

NCAM. (n.d.a). International captioning project description of line-21 closed captions.

Retrieved March 15, 2006, from http://ncam.wgbh.org/resources/icr/line21desc.html

NCAM. (n.d.b.). International captioning project description of subtitles. Retrieved March 15,

2006, from http://ncam.wgbh.org/resources/icr/subtitledesc.html

NCAM. (n.d.c.). ATV closed captioning project. Retrieved June 20, 2007, from

http://ncam.wgbh.org/projects/atv/atvccpart2size.html

Ortony, K., & Turner, T. J. (1990). What's basic about basic emotions? [Electronic version].

Psychological Review, 97, 315-16.

Peterson, H. & Dugas, D. (1972). The relative importance of contrast and motion in visual

perception. Human Factors, 14, 207 (9)

Plutchik, R. (1980). Emotion, a psychoevolutionary synthesis. New York: Harper & Row.

105

RIT. (2000). Captioning. Retrieved August 13, 2007, from

http://www.netac.rit.edu/publication/tipsheet/captioning.html

Stanislaw, H., & Todorov, N. (1999). Calculation of signal detection theory measures. Behavior Research Methods, Instruments, and Computers, 31, 137-149.

Vetrugno R, Liguori R, Cortelli P, Montagna P. (2003). Sympathetic skin response: basic

mechanisms and clinical applications. Clin Auton Res. 2003;13: 256(4).

Woolman, M. (2005). Type in motion 2. New York: Thames & Hudson.

106

Appendix A To: Deborah Fels Re: REB 2004-092-1: Burnt Toast Date: July 20, 2006 Dear Deborah Fels, The review of your protocol REB File REB 2004-092-1 is now complete. This is a renewal for REB File REB 2004-092. The project has been approved for a one year period. Please note that before proceeding with your project, compliance with other required University approvals/certifications, institutional requirements, or governmental authorizations may be required. This approval may be extended after one year upon request. Please be advised that if the project is not renewed, approval will expire and no more research involving humans may take place. If this is a funded project, access to research funds may also be affected. Please note that REB approval policies require that you adhere strictly to the protocol as last reviewed by the REB and that any modifications must be approved by the Board before they can be implemented. Adverse or unexpected events must be reported to the REB as soon as possible with an indication from the Principal Investigator as to how, in the view of the Principal Investigator, these events affect the continuation of the protocol. Finally, if research subjects are in the care of a health facility, at a school, or other institution or community organization, it is the responsibility of the Principal Investigator to ensure that the ethical guidelines and approvals of those facilities or institutions are obtained and filed with the REB prior to the initiation of any research. Please quote your REB file number (REB 2004-092-1) on future correspondence. Congratulations and best of luck in conducting your research.

Nancy Walton, Ph.D.

Chair, Research Ethics Board

107

Appendix B Pre-Study Questionnaire Pre-study Questionnaire

The purpose of this questionnaire is to gather information about participants’ TV viewing habits and preferences. The information gathered here will be used in order to analyze the results of this study, the effectiveness of enhanced closed-captions for interactive television, and to improve the effectiveness emotional enhancements to closed-captions. This questionnaire will take approximately 5 minutes to complete. Please read the questions carefully. To record your answer, please check the box or write your answer in the space provided. Thank you for participating in the Burnt Toast study. Part I – Demographics

1. Do you identify yourself as (check one):

� Hearing � Hard of hearing � Deaf

2. Are you:

� Male � Female

3. Please indicate your age range:

� Under 18 � 19 – 24 � 25 – 34 � 35 – 44 � 45 – 54 � 55 – 64 � over 65

4. What is your highest level of education completed?

� No formal education � Elementary school � High School � Technical/College � University � Graduate School

5. What is your occupation? ___________________________________________

108

Part II – Current Television Patterns

6. How many hours of television do you watch in an average week?

� Less than 1 hour � 1 to 5 hours � 6 to 10 hours � 11 to 15 hours � 15 to 20 hours � more than 20 hours

7. Do you watch TV alone? Please circle your response. Always Frequently Sometimes Seldom Never 8. Do you watch TV with friends or family? Always Frequently Sometimes Seldom Never 9. If you watch TV with friends or family, do you engage in conversation about the

program: � While you are watching it � After the program is finished � Not at all � Do not watch TV with friends or family

Part III – Closed Captioning

10. Do you use closed captions when watching television?

� Always � Occasionally � Never

11. What do you like about the closed captions on television (check all that apply)

� rate or speed of display � verbatim translation � placement on screen � use of text � size of text � colour (black and white) � other, please specify __________________________________

109

12. What do you NOT like about the closed captions on television (check all that apply)

� rate or speed of display � verbatim translation � placement on screen � use of text � size of text � colour (black and white) � other, please specify __________________________________

13. What elements of closed captions do you believe are not presently available for television (check all that apply):

� Emotions in speech � Background music/ noises � Speaker identification � Text at the same speed as the words being spoken � Adequate information in order to time jokes/puns correctly � Other, please specify

_____________________________________________ 14. The following is a list of characteristics that could be added to, or changed about

closed captions. Please check five characteristics that are most important to you. ____ Faster speed of displaying captions ____ Text descriptions of background

noise. ____ Text descriptions for background

music ____ Use of graphics to represent emotion.

____ Use of overlay/floating captions ____ Use of graphics to represent music

____ Use of colour in captions for emphasis, emotion or tone

____ Use of graphics to represent background noise

____ Use of text descriptions for emotional information.

____ Use of different fonts or text size

____ Flashing captions for emphasis. ____ Use of moving text to represent emotion

____ Use of graphics or symbols to denote background elements such as applause or musical inserts.

____ Use graphics or text to identify

speaker 15. Is there anything else that has not been included in the above list that you would like

to see included in closed captions that is not presently available? _____________________________________________________________________

110

_____________________________________________________________________

111

Appendix C Post-Study Questionnaire Post-study Questionnaire The purpose of this questionnaire is to gather information about your impressions of, likes and dislikes of the six short videos that you viewed. The information gathered here will be used to analyze the effectiveness of enhanced closed-captions for interactive television, and to gather suggestions for improvement. This questionnaire will take approximately 5 minutes to complete. Please read the questions carefully. To record your answer, please check the box or write your answer in the space provided. Thank you for participating in the enhanced captioning study. 16. In a few sentences (or less), please describe what to To Air is Human was about? ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ 17. Please rate each of the following aspects of enhanced captioning version of “To Air is

Human”. Check the box that best fits your rating.

Aspect Liked

very

much

Liked Neither

liked

nor

disliked

Disliked Disliked

very

much

Colour

Size of text

Speed of display

Placement on screen

Movement of the text on the screen

How the moving text portrayed the character’s emotions

Text descriptions

112

18. Please rate each of the following aspects of extreme captioning version of “Bad

Vibes”. Check the box that best fits your rating.

Aspect Liked

very

much

Liked Neither

liked

nor

disliked

Disliked Disliked

very

much

Colour

Size of text

Speed of display

Placement on screen

Movement of the text on the screen

How the moving text portrayed the character’s emotions

Text descriptions at bottom

19. Please rate each of the following aspects of conventional captioning version of “To

Air is Human”. Check the box that best fits your rating.

Aspect Liked

very

much

Liked Neither

liked

nor

disliked

Disliked Disliked

very

much

Size of text

Speed of display

Placement on screen

Text descriptions

20. Compared with the conventional closed captioning of “To Air is Human”, rate your

level of confusion with the enhanced captioning for To Air is Human? Please circle your rating.

Much more confusing

Somewhat more confusing

not different Less confusing Much less confusing

113

21. How much do you think the enhanced captions increased your overall understanding

of the video? Did not increase my level of understanding at all.

Slightly increased my level of understanding

No difference Somewhat increased my level of understanding

Greatly increased my level of understanding

22. How much do you think the enhanced captions increased your overall understanding

of the emotions the characters were feeling in the video? Did not increase my level of understanding at all.

Slightly increased my level of understanding

No difference Somewhat increased my level of understanding

Greatly increased my level of understanding

23. How much do you think the enhanced captions distracted you from the video? Did not distract me.

Slightly distracted me

Did not notice Somewhat distracted me

Greatly distracted me.

24. If you had just watched To Air is Human with friends who were hearing, rate your

willingness to engage in conversation with them about To Air is Human. Not willing to discuss To Air is Human at all

Not really willing to discuss To Air is Human

Don’t care Somewhat willing to discuss To Air is Human

Very willing to discuss To Air is Human.

Please indicate why you selected your particular rating. ________________________________________________________________________ 25. What suggestions do you have for further improvements to captioning? ________________________________________________________________________

________________________________________________________________________ ________________________________________________________________________

114

26. Please add any additional comments. ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________

Representing Emotions with Animated Text · Representing Emotions with Animated Text Raisa Rashid...

Documents

Transcript of Representing Emotions with Animated Text · Representing Emotions with Animated Text Raisa Rashid...