SIGN LANGUAGE SUBTITLING BY HIGHLY ... LANGUAGE SUBTITLING BY HIGHLY COMPREHENSIBLE...
-
Upload
nguyendung -
Category
Documents
-
view
216 -
download
2
Transcript of SIGN LANGUAGE SUBTITLING BY HIGHLY ... LANGUAGE SUBTITLING BY HIGHLY COMPREHENSIBLE...
J. EDUCATIONAL TECHNOLOGY SYSTEMS, Vol. 35(1) 61-87, 2006-2007
SIGN LANGUAGE SUBTITLING BY HIGHLY
COMPREHENSIBLE “SEMANTROIDS”*
NICOLETTA ADAMO-VILLANI
Purdue University
GERARDO BENI
University of California Riverside
ABSTRACT
We introduce a new method of sign language subtitling aimed at young deaf
children who have not acquired reading skills yet, and can communicate only
via signs. The method is based on: 1) the recently developed concept of
“semantroid™” (an animated 3D avatar limited to head and hands); 2) the
design, development, and psychophysical evaluation of a highly compre-
hensible model of the semantroid; and 3) the implementation of a new
multi-window, scrolling captioning technique. Based on “semantic intensity”
estimates, we have enhanced the comprehensibility of the semantroid by:
i) the use of non-photorealistic rendering (NPR); and ii) the creation of a
3D face model with distinctive features. We have then validated the com-
prehensibility of the semantroid through a series of tests on human subjects
which assessed accuracy and speed of recognition of facial stimuli and hand
gestures as a function of mode of representation and facial geometry. Test
results show that, in the context of sign language subtitling (i.e., in limited
space), the most comprehensible semantroid model is a toon-rendered model
with distinctive facial features. Because of its enhanced comprehensibility,
this type of semantroid can be scaled to fit in a very small area, and thus it
is possible to display multiple captioning windows simultaneously. The
*This research is partially supported by the School of Technology at Purdue University (I3 grant –
Proposal #00006585 – http://www.tech.purdue.edu/cgt/I3/), by the Envision Center for Data Per-
ceptualization, and by the PHS-NIH grant “Modeling the non-manuals of American Sign Language”
(Award NO 5R01 DC005241-02).
61
� 2006, Baywood Publishing Co., Inc.
concurrent display of several progressive animated signed sentences allows
for review of information, a feature not present in any sign language subtitling
method presented so far. As an example of application, we have applied
the multi-window, scrolling captioning technique to a children’s video of a
chemistry experiment.
INTRODUCTION
According to the 2004 Annual Survey of Deaf and Hard of Hearing Children and
Youth (Gallaudet Research Institute, 2005), there are about 45,000 deaf school
age (K-12) children in the United States. Deaf children (who don’t know how to
read yet) don’t have access to visual information (TV, DVDs, interactive media,
etc.) with linguistic explanation. Linguistic explanation is given as speech for
hearing children, or as subtitles for non-hearing reading children (Captioning
Web, 2005). Considering that reading comprehension is significantly delayed in
deaf youngsters (the median reading comprehension of 17- and 18-year-old Deaf
is at a fourth grade level) (Holt, Traxler, & Allen, 1997), many young deaf children
are deprived of the opportunities for independent learning provided by visual
media. One solution to the problem is to use sign language as subtitles; but
traditional subtitling methods present the following difficulties.
First, we consider methods which are easily scalable so that the subtitles can fit
in a small portion of the screen. The requirement of fitting in a small area limits
the subtitles to the use of static symbols. The most advanced of such systems is
SignWriting, developed by Valerie Sutton (1974) and used worldwide for writing,
reading, and researching signed languages. SignWriting has many uses but as a
subtitling method has the disadvantage of being: 1) static, as written English; and
2) highly abstracted so that it requires significant amount of learning. Because
of these two factors, it is not easy to see the advantage over written English for
subtitling. If effort has to be invested in teaching a system of abstract symbols,
it might be more efficient to teach the Deaf how to read English and use English
for subtitles.
A more intuitive alternative to SignWriting could be the use of static images of
signers as represented, e.g., in ASL dictionaries (Flodin, 1994); however, this
method would require too many static images to follow, in real time, the messages
communicated by voice. Even in using English subtitles, the speed is often not
enough to keep up with the spoken word; and since using images for words
requires a much larger screen space, it would be impossible to follow the spoken
message by presenting such images as in comic strips at the bottom of the screen.
Turning to methods that use dynamic (i.e., moving) signing, we consider both
human signers and avatars. For a human signer, the aesthetic/emotional appeal
is not easily controlled or changed by a simple menu in software. Human signers’
appeal to different ages, genders, and ethnic groups varies and cannot be manipu-
lated. An avatar’s appearance, instead, can be easily modified in the user interface.
62 / ADAMO-VILLANI AND BENI
Moreover, human signers, unlike avatars, cannot easily be made artificially
emphatic without appearing ridiculous. Features that can be emphasized for
enhanced communication include: nails, color and size of eyes, eyebrows, lips,
etc. These types of emphasis can be realized easily in an avatar but not in a human
signer. A second difficulty with a human signer is that the background inter-
feres unless the signer is clothed in black against a black background, as in a
pantomime. But the dark background confuses the shadows at the edges of the
hands and makes the gesture less clear than if the background were light. It is
also in practice difficult to realize a very neutral darkly clothed signer on a dark
background. Some details always remain and tend to stand up and be distracting.
The third, and most challenging, problem with both human signers and full
body avatars is size. The full body avatar, like the human signer, must be cut at the
waist in order to fit in the restricted space at the bottom of the display. In doing so
two problems arise: first the trunk is still visible and remains a distractive factor;
second, the hands are positioned at a natural distance from the head and thus, in
order to be included, they require a significantly high vertical size (see Figure 1).
The objective of this research is the development of a new, improved method
of sign language subtitling which solves the majority of the above mentioned
problems. The method is based on: 1) the concept of “semantroid” (Adamo-
Villani & Beni, 2005); 2) the design, creation, and evaluation of a “highly
comprehensible” semantroid model; and 3) the development of a new scrolling
SIGN LANGUAGE SUBTITLING / 63
Figure 1. Size comparison between a full-body 3D avatar(DePaul University, 2005), left frame; and a semantroid, right frame.
technique that allows for simultaneous display of four animated signed sentences
at the bottom of the screen.
In the remainder of this article we discuss the development of a highly com-
prehensible rendering of the semantroid and its evaluation through a user study.
We describe the design and creation of a “semphace,” and we evaluate the
comprehensibility of semphace facial expressions through psychophysical
studies. The new multi-window, scrolling subtitling technique is discussed in
section 6; conclusive remarks and future work are presented in section 7.
RENDERING OF A
“HIGHLY COMPREHENSIBLE” SEMANTROID
A semantroid (Adamo-Villani & Beni, 2004) (from “semantic” and “android” is
a reduced avatar (limited to head and hands) which maximizes the semantic
content conveyed while minimizing the perceptual effort required to perceive
it. The concept has been quantified by the notion of “semantic intensity”
(Adamo-Villani & Beni, 2005). There are several advantages to using a seman-
troid versus using a human signer or a full-body avatar: 1) A semantroid, like
a full body avatar or human signer, represents naturally the signs and thus
requires no learning abstraction (in contrast with SignWriting) (Sutton, 1974).
2) A semantroid fits in much smaller space for the same meaning expressed by
either a human signer or an avatar. The semantroid, in fact, can position the hands
as close as possible to the head without significant loss of the meaning of the
gesture. This would be tiring for a human signer and is not realized in a full body
avatar since its purpose is to look as much as a human signer as possible. A
comparison is shown in Figure 1 where the semantroid image requires a 16.7%
shorter vertical dimension. It is also clear from the figure that, if necessary, the
semantroid vertical length can be reduced further by shifting the hands toward
the head without major loss of meaning. 3) A semantroid can be “optimized”
to improve semantic intensity and hence comprehensibility. “Optimization” of
semantroid comprehensibility is one of the main objectives of this research and
will be discussed next.
The rationale for the use of the semantroid instead of full avatar has been given
by Adamo-Villani and Beni (2005). The justification is based on comparing the
semantic intensity of the semantroid with the semantic intensity of the avatar.
Broadly speaking, semantic intensity is a measure of the ratio of the quantity of
“meaning” conveyed to the quantity of “effort” required to perceive such a
meaning. “Meaning” is closely related to information but is constrained by the
requirement that the information must be represented directly (visually) without
inference/abstraction analysis on the part of the perceiver. “Effort” is related to the
perceptual effort of the observer in perceiving the meaning, and it is affected by
two main factors:
64 / ADAMO-VILLANI AND BENI
1. The spatial distribution of the image. Intuitively it takes more effort to
perceive widely scattered elements than more compactly located elements.
2. The distribution of the meaning-conveying property. A measure (Adamo-
Villani & Beni, 2005) of this effort is the ratio of the variance of the possible
meanings to the variance of the nuances of meaning—i.e., of the different
values of the meaning-conveying property for a given meaning.
In the 3D to 2D (toon) shading transformation, factor (1) plays no role since the
overall spatial distribution does not change; factor (2) is the critical one.
Similarly to what we did for comparing semantroid with a full avatar, we can
determine which semantroid representation (or rendering) is the most compre-
hensible one by comparing semantic intensities. We consider three types of
rendering: i) 3D (or photorealistic rendering); ii) 2D (or toon-rendering); and
iii) hybrid (a combination of 3D and 2D renderings). A priori it is not clear
which rendering would be most effective. But we can use the following con-
siderations to form a plausible hypothesis.
Consider the rendered model as consisting of two parts: a) face and hands; and
b) air, neck, ears. In case (i) the rendering is 3D for both (a) and (b); in case (ii)
for neither (a) nor (b); and in case (iii) the rendering is 3D only for (a). Since all the
meaning is conveyed by part (a), it is clear that the semantic intensity of part (b)
increases in case (ii) and (iii)—i.e., when the rendering is 2D. This happens
because the number of variations of hues of the “same color” is reduced and the
“distance” in hue is increased. Both factors increase the semantic intensity by
reducing the perceptual effort.
For part (a) it is difficult to determine whether or not a decrease in information
offsets the reduction in perceptual effort. If it does, the semantic intensity of (a) is
reduced; hence, the hybrid case (iii) will have the largest overall semantic intensity
[hypothesis H1]. If it does not, then the 2D case (ii) will have the largest overall
semantic intensity [hypothesis H2]. A quantitative calculation of the semantic
intensity is not trivial and it is the subject of future research. In this article, we
test empirically the two hypotheses [H1], [H2].
2D and Hybrid Renderings
Non-photorealistic rendering (NPR) is any rendering technique that produces
images of simulated 3D worlds in a style other than realism. Often these styles
are reminiscent of paintings (painterly rendering), or of other techniques of
artistic illustration (sketch, pen and ink, etching, lithograph, etc.). In many
applications, such as visualization and design of effective diagrams, a non-
photorealistic rendering has advantages over a photorealistic image (Agrawala
& Stolte, 2001; DeCarlo & Santella, 2002; Gooch & Gooch, 2001; Laidlaw,
2001; Wilson & Ma, 2004). NPR images may convey information better by:
omitting extraneous detail; focusing attention on relevant features; and by clari-
fying, simplifying, and disambiguating shape (Gooch, Reinhard, & Gooch, 2004).
SIGN LANGUAGE SUBTITLING / 65
In an entertainment or artistic context, the control of image detail combined with
stylization increases the effectiveness to communicate the intent of its creator
(“amplification (of information) through simplification”) (Mcloud, 1993).
The specific NPR technique that outputs line-drawing style renderings of 3D
models is called toon-rendering. Toon rendering, also known as cartoon, comic, or
2D rendering, outputs imagery with meaningful abstraction (directed removal of
detail). The result is hand-drawn style images with bold edges and large regions
of constant color. In contrast with 3D shading models such as Gouraud’s or
Phong’s, toon rendering converts color to discrete levels and thus eliminates
smooth color shading and highlights.
Toon rendering has been used for several years in the animation industry, and
many commercial 3D software packages allow for rendering of 3D models with
flat shading and contour lines. For our work we have used Maya 6.5 software and
Mental Ray rendering algorithm. The semantroid model contours were rendered
by property difference and sample contrast. In particular, we rendered contours
around coverage (i.e., based on pixel coverage—where rendering samples detect
objects present in the scene), and between different materials. This allowed us to
clearly define the silhouette of the 3D surfaces with continuous bold edges.
Contour lines were also drawn based on normal contrast—i.e., between pixels
whose normal difference was larger than a set value. This enabled us to reveal the
curvature of the eyebrows and lips. Different flat surface shaders were applied to
different surface regions, shadows were not included. Figure 2 shows a 3D
rendering of the semantroid’s head model (b); a toon rendering produced with the
techniques described above (c); and a “hybrid” rendering (d)—hair and neck are
toon-shaded, face is 3D shaded.
“Optimization” via Color
In order to maximize the comprehensibility of all three representations, we have
chosen a particular color scheme. The choice of colors has been guided also by
semantic intensity considerations. Returning to factor (2) in the effort of meaning,
it is clear that the effort of distinguishing among the various meanings conveyed
by the hands and face is reduced by increasing the differences in color/shape of
the possible configurations of the hands and face. Because of the need to maintain
a minimum level of realism, the differences cannot be emphasized by modifying
the shapes of fingers and facial features. On the other hand, color can be used
to enhance contrast without loss of realism. This has determined our choice of a
female model so that the nails can be colored without loss of realism. Bright
colored nails make it easier to discern the finger configurations, especially on a
reduced scale. Similar considerations apply to brightly colored lips which help
discern different mouth configurations.
For the eyes, besides the brightness of the color, the choice of the hue is also
a factor that affects the perception effort. For the hue factor, we have followed
66 / ADAMO-VILLANI AND BENI
SIGN LANGUAGE SUBTITLING / 67
Fig
ure
2.
Fro
mth
ele
ft:
po
lyg
on
alh
ead
mo
del(a
);m
od
elre
nd
ere
dw
ith
3D
sh
ad
ing
(b);
mo
delre
nd
ere
dw
ith
2D
sh
ad
ing
(c);
mo
delre
nd
ere
dw
ith
“hyb
rid
”sh
ad
ing
(d).
research results in color perception (Hill & Scharff, 1997). These results were
obtained for comprehensibility of Websites with various foreground/background
color combinations, using different fonts and text sizes. The measure of com-
prehensibility was reaction time of the reader. No combination of font, size, or
color was found to be optimal, but for foreground/background the color com-
bination with fastest reaction time was green on yellow out of six possible
combinations which included yellow/blue. Based on these results, we have
selected green rather than blue eyes, since the general hue of the face approaches a
shade of yellow. Of course, this choice is very tentative and based on results
that may not be applicable; further studies should be done to determine the
highest comprehensibility of facial expression using different color combinations.
Figure 3 shows the hybrid shaded semantroid optimized for color.
COMPREHENSIBILITY EVALUATION OF 3D, 2D,
AND HYBRID RENDERINGS
The goal of this user study is to determine the comprehensibility of 3D, 2D,
and “hybrid” representations of facial expressions and hand gestures; a measure
of this comprehensibility is given by viewer’s accuracy and speed of recognition.
68 / ADAMO-VILLANI AND BENI
Figure 3. Hybrid shaded semantroid “optimized” for color.
Our hypothesis [H1] is that the hybrid representation is the most comprehensible
one (i.e., accurate and requiring the lowest recognition time). The experiment is
described in detail in Appendix A; below, we summarize and discuss the results.
Results
Results (reported in Appendix A) show that speed of recognition is highest for
the full 2D representation. Subjects were faster at recognizing 2D representations
(M = 1.60s) compared to 3D representations (M = 2.12s, p = 0.012), and compared
to hybrid representations (M = 1.91, p = 0.035). Results also show that there is
a marginal time advantage in recognition of hybrid representations versus 3D
representations (p = 0.068). Recognition accuracy is high for all three repre-
sentations, differences are negligible.
A preliminary conclusion is that facial and hand configuration recognition
appears to be quickest when face and hands are presented as illustrations, fol-
lowed by a combination of illustration and photorealistic renderings, and then
photorealistic (3D shaded) images. Thus, from the experiments we conclude that:
1) the full 3D case, as predicted, is the worst; 2) hypothesis H2 prevails over H1.
Discussion
First, let’s consider the result which shows that recognition time of the hybrid
representation is slightly lower than recognition time of the photorealistic repre-
sentation. This result can be explained with the semantic intensity considerations
presented above—i.e., part (b) of the hybrid rendering requires less perceptual
effort than the photorealistic one. The more interesting result is the superiority
of the 2D rendering. In fact, hypothesis H1, that the hybrid representation is the
most comprehensible, is not supported by the experiment. Why? And, how general
is this conclusion?
Part (a) of the 2D rendering has a higher semantic intensity because there is no
loss of meaning to offset the advantages of reduced perceptual effort. This no loss
of meaning in going from 3D to 2D cannot be general. Intuitively, it is apparent
that in many cases photorealistic rendering contains non-negligible details. But
the question may be raised as to whether or not the additional details present
in photorealistic rendering tend to disappear at small scale. In fact when the size
of the image is significantly reduced—this is our case for sign language subtitles—
variations in color and highlights in face and hands do not represent any changes
in meaning because they cannot be distinguished as meaningful facial signals or
meaningful hand deformations (for example, variations in shading produced by
deformations of the face geometry between the eyebrows and on the forehead
during frowning cannot be read as wrinkles (see Figure 4); a variation of shading
produced by reduction of the wrinkles on the knuckles during finger flexion
cannot be read as such) .When the size of the image is very small, facial signals
are conveyed almost entirely by shape and position of eyes, eyebrows, and mouth,
SIGN LANGUAGE SUBTITLING / 69
70 / ADAMO-VILLANI AND BENI
Fig
ure
4.
Lo
ss
ofm
ean
ing
ofco
lor
vari
atio
nin
sm
all
siz
eim
ag
es.F
ram
e1
co
nta
ins
a3D
mo
delo
fT
om
Han
ks’h
ead
(Bla
nz
&V
ett
er,
19
99
);fr
am
e2
sh
ow
sa
ph
oto
realis
tic
ren
deri
ng
ofth
em
od
elfr
ow
nin
g;
fram
e3
co
nta
ins
are
du
ced
scale
imag
eo
fth
esam
ere
nd
eri
ng
.In
fram
e3
vari
atio
ns
insh
ad
ing
inth
efo
reh
ead
an
deyeb
row
sare
as
can
no
tb
ere
ad
as
wri
nkle
s.
while hand configurations are mainly expressed by the outline of fingers and
nails. Therefore variations in face and hand shading represent variations in
nuances of meaning which require an additional perceptual effort, but do not
convey more meaning.
ENHANCED COMPREHENSIBILITY
VIA “SEMPHACE”
Results from the previous tests show that the mean recognition time for the
2D representation = 1.60s, the maximum recognition time = 2.53s for image
5 (I_AU15), and the minimum recognition time = 1.06s for image 2 (see
Appendix A). Our goal is to achieve a recognition time �1s considering that,
during animation playback, many facial signals are often held on the screen for
less than a second.
In order to increase recognition speed we: 1) developed a face model with
emphasized features (semphace—defined below); 2) measured speed and
accuracy of facial expression recognition of 2D, 3D, and hybrid representations
of semphace using the procedure followed in experiment 1; and 3) compared
speed and accuracy of facial expression recognition of 2D, 3D, and hybrid
representations of semphace vs. speed and accuracy of facial expression recog-
nition of 2D, 3D, and hybrid representations of the norm face.
The “Semphace” Model
A well-known finding in the literature on face recognition is that faces judged
as distinctive are more easily recognized than faces judged as typical (Benson
& Perret, 1994). The recognition efficiency of emphatic faces (faces with dis-
tinctive features) is related to a psychological phenomenon called the peak shift
effect: if a rat is rewarded for discriminating a rectangle from a square, it will
respond even more vigorously to a rectangle that is longer and skinnier that the
prototype, i.e., a rectangle that deviates more from a norm rectangle (Hansen,
1959; Ramachandran, & Hirstein, 1999). Similarly, it has been postulated that
humans recognize faces better and faster if the facial features deviate significantly
from an average face (Tversky & Baratz, 1985).
Though there haven’t been specific studies on recognition of facial expressions
of distinctive faces, both the documented ability of caricatures to augment the
communication content of images of human faces and semantic intensity con-
siderations have motivated our development of the semphace model in order to
increase recognition speed of facial stimuli. We call “emphace” (= emphatic face)
a representation of a face with some essential features emphasized for a specific
effect. Examples of emphaces are: 1) facial caricatures: representations of faces
whose essential (for individuality) features are emphasized for comic effect;
2) cartoon faces (e.g., manga) (Gravett, 2004): representations of faces whose
SIGN LANGUAGE SUBTITLING / 71
essential (for expressing feelings) features are emphasized for increased
emotional impact; and 3) semantic emphases = “Semphaces”: representations
of faces whose essential (for conveying meaning) features are emphasized for
increased understanding.
A semphace directly increases semantic intensity by reducing type (1) of
perceptual effort, i.e., by increasing the compactness of the geometric distribution
of the meaningful features—eyes and mouth. Such increase in compactness
(as measured by the ratio of the area occupied by significant features to the total
area of the face) can be significant. The challenge is to increase the area of the
significant features without compromising the perception of the face beyond a
minimum level of credibility.
In developing a distinctive face, two types of facial information can be altered:
i) featural information, which pertains to face elements (i.e., shape and size of
eyes, nose, mouth) referred to in isolation; and ii) configural information, which
relates broadly to spatial relationships among these face elements (i.e., distances
between eyes, nose, mouth) (Freire & Lee, 2000; Searcy & Bartlett, 1996). Some
research studies show that altering the spatial location of essential face elements
can often impair subjects’ recognition of faces and facial expressions (Tanaka &
Sengco, 1997) while, as mentioned previously, altering size and shape of essential
facial features can augment facial communication.
Based on these findings, when developing the semphace mode we have
altered featural information only, while maintaining a normal spatial relationship
between facial features. Considering that in ASL the most essential (for con-
veying meaning) facial features are eyes, mouth, and eyebrows (Wilbur, 2004),
we have modeled a semphace with distinctive eyes and mouth, and well-defined
eyebrows (see Figure 5). The semphace model was created as a morph target
derived from a face model with anatomically proportioned features. The extremely
enlarged eyes were produced by multiplying the scale parameters values of
the original model’s eyes by a factor of 2.5, and the enlarged mouth was pro-
duced by multiplying the scale parameters values of the original model’s
mouth by a factor of 1.8. The size of the nose was decreased by a factor of 0.5
since the nose is not essential in conveying meaning during signed com-
munication. Figure 5 shows the semphace model (a), a 3D rendering (b), a 2D
rendering (c), and a “hybrid” rendering (d). Figure 6 shows different semphace
expressions.
COMPREHENSIBILITY EVALUATION OF
2D, 3D, AND HYBRID REPRESENTATIONS
OF SEMPHACE
By semantic intensity considerations similar to those of section 3 it is expected
that the fully 3D semphace will be the less comprehensible; but it is not a priori
obvious whether the hybrid semphace will be more effective than the 2D rendered
72 / ADAMO-VILLANI AND BENI
SIGN LANGUAGE SUBTITLING / 73
Fig
ure
6.
Exa
mp
les
ofsem
ph
ace’s
facia
lexp
ressio
ns:
3D
ren
deri
ng
s,le
ft;
2D
ren
deri
ng
s,ri
gh
t.
Fig
ure
5.
Sem
ph
ace
po
lyg
on
alm
esh
(a);
3D
sh
ad
ed
sem
ph
ace
(b);
Hyb
rid
sh
ad
ed
sem
ph
ace
(c).
one. Experimentally we test two hypotheses: [H3] the 2D semphace representation
is the most comprehensible (readable) of the three semphace representations; and
[H4] the 2D semphace representation is more comprehensible than the 2D repre-
sentation of the norm face. The experiment is described in Appendix B; in this
section we summarize and discuss the results.
Results
Results show that speed of recognition is higher for the 2D semphace repre-
sentation (M = 1.06s) compared to 3D (M = 1.38s) and hybrid (M = 1.29s)
semphace representations, thus [H3] is confirmed. In addition they show that
subjects were significantly faster at recognizing 2D representations of the
semphace (M = 1.06s) compared to 2D representations of the norm face (M =
1.60s, p = 0.009), thus [H4] is confirmed as well. There is also a time advantage
in recognition of hybrid representations of the semphace (M = 1.29s) versus
hybrid representations of the norm face (M = 1.9ls, p = 0.020), and in 3D
representations of the semphace (M = 1.38s) versus 3D representations of the
norm face (M = 2.12s, p = 0.009). Recognition accuracy was 100% for all three
semphace representations.
Discussion
In summary, the experiments show that: 1) facial recognition is quickest
when the face has distinctive features such as enlarged eyes and mouth and
reduced nose (semphace); and 2) the most comprehensible representation of
a face with such distinctive features is the 2D representation. An important
parameter not tested in this experiment is the size of the image. As discussed
above, it is expected that 2D rendering has an advantage at small scales; it is
not clear whether the hybrid rendering would be more effective at large scale
but we have argued that a priori it is certainly possible. What is more difficult
to argue is whether the advantage of the semphace over the norm would exhibit
the same type of scale dependence. Actually it is not clear what advantage the
norm face would have at larger scale since no more significant details are added.
Thus, we may argue that the semphace will remain advantageous over the norm
face regardless of scale.
This is left to future research on semphaces since, besides this scale issue,
many other aspects of the concept need further exploration. Examples are: the
effect of varying the relative positions of the semantic features; the role of details
in the semantic features, e.g., eyelids, eyelashes, teeth, and tongue; the advantage
or cost of including other semantic features such as wrinkles; the effectiveness
of different gender, age, or race of the model; and most importantly, the quan-
tification of some parameter of realism to set boundaries for the size of the
significant features.
74 / ADAMO-VILLANI AND BENI
THE MULTI-WINDOW SCROLLING METHOD
The enhanced comprehensibility, and thus scalability, of the highly compre-
hensible semantroid has led to development of an efficient subtitling scrolling
method that allows for simultaneous display of multiple subtitling windows at
the bottom of the screen, throughout the entire video playback (see Figure 7). The
concurrent display of several animated sentences allows for review of infor-
mation—a feature non-existent in any sign language presentation method so far.
Traditional methods of captioning in sign language use one human signer
occupying a corner of the screen (usually at the bottom right). This method has
several drawbacks, as discussed previously, but, in addition, whether a human
signer is used or an avatar, the traditional method displays the message as a
sequence of signs so that only the last sign is visible. This is basically the method
that must, by necessity, be used by a human signer in direct sign language
conversation. But, in displaying subtitles, there is no reason to limit the display to
only the last sign which is being communicated. In fact, if the message is broken
down in signed sentences, more than one sentence could be playing on the screen
simultaneously. This method has both advantages and disadvantages.
It has the advantage that, similarly to written language, more than one sentence
is readable at a given time. This allows for review of information. Without
such review, it is very likely that, due to a lapse of attention, some meaning is
lost. The same happens in listening to a message in comparison with reading
it. Since the probability of missing one word or sentence is generally non-zero,
in a long and complex message, listening is most likely to lead to some loss
of meaning. In reading, such a loss of meaning due to lapse of attention is
prevented by the possibility of re-reading the sentence which had not been
attended to properly.
In captioned sign language, even more so than in listening to a message, it is
likely that attention will lapse. This is because the viewer must pay attention
simultaneously to both the main scene on the screen and to the captioned signs.
This situation is particularly severe when the viewer must pay attention to complex
explanations, as, e.g., in learning about science.
Thus, the possibility of re-reading signed sentences is very useful in general
and especially in cases of complex messages. And this is the basic advantage of
displaying simultaneously more than one signed sentence. On the other hand, the
method has the disadvantage of taking up a larger portion of the screen and thus
interfering with the main scene. This is certainly the case if full body avatars or
human signers were used. But even if the signers were shown only from the waist
up, as discussed previously, showing multiple signed sentences still can disrupt the
main scene significantly unless the signing display is considerably simplified.
This is achieved in our method by using the highly comprehensible semantroid
discussed above, which minimizes the screen area occupied, while maintaining a
high level of signing clarity.
SIGN LANGUAGE SUBTITLING / 75
Even with the highly comprehensible semantroid, ultimately the number of
signed sentences shown must be limited by the display area available. We have
found that a reasonable compromise is to display four signed sentences simul-
taneously. This allows for recovering from lapses of attention while at the same
time not occluding the main scene significantly. The method of displaying the
sentences is chronological from right to left so that the method of reading the
sentences is from left to right as in reading English. The scrolling method can
easily be adapted to reading from right to left as in Hebrew or from top to bottom
as in (some) Japanese writing. The latter case may, due to the screen aspect
ratio, require that the sentences displayed be reduced to three.
As a practical example, we have applied the highly comprehensible scrolling
semantroid as a subtitling tool for an educational video of a chemistry experi-
ment: the determination of pH of two common household products. A frame
extracted from the video is represented in Figure 7; the entire video is available
upon request.
76 / ADAMO-VILLANI AND BENI
Figure 7. Frame extracted from the video of the pH experiment withsimultaneous dislay of four animated signing sequences.
The arrow indicates the scrolling direction. The numbers indicatethe order of appearance on the screen.
The video is currently being used to assess the efficiency of the subtitling
method with a group of deaf children age K-3. The evaluation is carried out in
collaboration with the Indiana School for the Deaf (ISD), one of the leading
institutions in Deaf Education. We are presently evaluating the method’s success
based on: 1) deaf children’s reactions (emotional appeal of the model and
willingness to use); and 2) teachers’/parents’ feedback on the degree to which
the subtitling highly comprehensible semantroid improves deaf children’s
understanding of science concepts presented on video. In addition, the subtitling
highly comprehensible semantroid has been evaluated throughout its development
by deaf adults, Purdue faculty, and students knowledgeable in sign language and
deaf-related issues who have provided positive feedback on the comprehensibility
of the signs and the effectiveness of the method.
CONCLUSIONS AND FUTURE WORK
In this article, we have introduced a new method of subtitling motion pictures
for deaf children who cannot read and can only communicate via sign language.
The method is based on: 1) the use of highly comprehensible semantroids
(animated 3D avatars reduced to hands and head only) and “optimized” for
maximum comprehensibility of facial expressions and hand gestures; and 2) a
new multi-window, scrolling captioning technique. In order to produce a highly
comprehensible semantroid model, we have considered three variables: mode of
representation (or rendering); color; and facial geometry. In regard to mode of
representation, we have focused on three rendering techniques: 3D shading; 2D
shading; and hybrid shading (combination of 3D and 2D). We have estimated and
compared semantic intensity (and thus comprehensibility) of each representation
and then we have validated the hypotheses through a series of tests on human
subjects which measured the comprehensibility of each representation. Results
showed that, in the context of sign language subtitling (i.e., when the size of the
image is drastically reduced), the most comprehensible representation is obtained
by 2D rendering.
As for the color scheme, we have followed recent research results on color
contrast comprehensibility (Hill & Scharff, 1997) and we have produced a seman-
troid model that, while maintaining realism, has salient features with colors
that make them easily detectable on a small scale.
In regard to facial geometry, based on semantic intensity considerations, as
well as research findings on the ability of “distinctive” faces to augment com-
munication of facial expressions, we have developed a 3D model with emphasized
facial features (semphace). We have then evaluated the comprehensibility of
semphace through psychophysical studies which showed that, in restricted space,
a semphace is more comprehensible than a norm face.
Because of its enhanced comprehensibility, the highly comprehensible seman-
troid (i.e., toon-shaded, optimized for color, and with distinctive facial features)
SIGN LANGUAGE SUBTITLING / 77
can be scaled down to fit in a very small frame, thus it is possible to fit four
windows at the bottom of the screen where subtitles normally appear. Each
window displays a movie of one signed sentence; hence the subtitling windows
simultaneously display four sentences. The windows are scrolled from right to left
so that reading proceeds from left to right as in English (the order can be reversed
for other languages).
This being a first step in proposing the “highly comprehensible semantroid
subtitling method,” many aspects need to be tested and improved. For instance,
limits of scalability to smaller images need to be quantified. The semphace model
has been developed for comprehensibility testing purposes, but its aesthetic/
emotional appeal and its effects on child comprehension need to be further
investigated. Optimal number of simultaneous signed sentences is also a factor
to analyze. Could it be better to have only three sentences, but in larger size?
Because of the aspect ratio of the screen, the three sentences could be displayed
vertically. Would this affect the comprehensibility? Would this be an advantage
for persons who eventually will learn to read oriental languages which are written
from top to bottom?
Currently we are carrying out further experiments to confirm the validity of
the method. Preliminary results of a qualitative evaluation of the new captioning
method with a group of hearing and non-hearing signers confirm the effectiveness
of the new subtitling technique.
Another interesting issue, to be addressed in a future publication, is the
quantification (via semantic intensity calculations as well as empirical studies)
of the minimum image size for which the 3D representation becomes more
comprehensible than the 2D representation. For example, if only two cap-
tioning windows are displayed simultaneously, the size of the windows can
be larger and a 3D rendering of the semantroid could be more readable than
a 2D one.
Although many questions remain to be answered, this method of subtitling
in sign language helps bridge the gap between two rather different views on
presenting information to the Deaf. The two views are at opposite ends in
the scale of abstraction. On one hand (more concrete) we have human signers,
on the other (more abstract) we have symbolic representations of signs. In
between we have the method of subtitling by highly comprehensible seman-
troids, which may be regarded, in fact, as intermediate between subtitling
realized via a human signer (or a signing avatar), and subtitling realized with
an abstract system of sign representation, such as SignWriting. Table 1 shows
the relation of the present method to the other two. Thus, Table 1 shows that
the subtitling highly comprehensible semantroid fills a niche left by the two
extreme cases used so far to help the Deaf understand explanations of visual
information.
78 / ADAMO-VILLANI AND BENI
APPENDIX A
Experiment 1
Subjects
Fourteen students age 21-26 years, 8 out of 14 subjects with American Sign
Language skills.
Stimuli
Six sets of 10 images rendered from two different semantroid models; each
image has a resolution of 320 × 120 pixels. Two sets of images (one set per
semantroid model) contain 3D shaded (photorealistic) representations of the
semantroid, two sets contain fully toon rendered (2D) representations of the
semantroid, and two sets contain “hybrid” representations of the semantroid.
Every image of each set shows the semantroid in the neutral position on the
left, and the semantroid with a change in facial expression and/or hand
configuration on the right. For changes in facial expressions we considered
facial signals that are commonly produced during ASL communication based on
[A1] and [A2]: Action Units (AUs) 1, 2, and 15 from the Facial Action Coding
System (FACS) [A3], blinks, changes in pupil size, and changes in gaze direction.
Hand configurations were selected from the ASL finger-spelling alphabet
[A4]. Figure A1 shows a sample test image from one of the 3D shaded semantroid
sets on the left; a sample test image from one of the 2D semantroid sets in the
middle; and a sample test image from one of the “hybrid” shaded semantroid
sets on the right.
Procedure
Subjects were presented with images showing different facial expressions and
hand configurations. The images, randomly selected from the three sets (2D, 3D,
SIGN LANGUAGE SUBTITLING / 79
Table 1. Comparison between Subtitling Methods
Level of abstraction Motion Size
Human signer or avatar
Highly comprehensiblesemantroid
Sign Writing
Low (concrete)
Medium (parts removed +toon shading)
Very high(symbolic)
Yes
Yes
No
Large
Small
Small
80 / ADAMO-VILLANI AND BENI
Fig
ure
A1
.F
ram
e1
co
nta
ins
test
imag
e9
(R_A
U1
)fr
om
the
3D
sh
ad
ed
sem
an
tro
idset;
fram
e2
co
nta
ins
test
imag
e3
(E_lo
ok_le
ft_d
ow
n)
fro
mth
eto
on
sh
ad
ed
sem
an
tro
idset;
fram
e3
co
nta
ins
test
imag
e2
(C_b
link)
fro
mth
eh
yb
rid
sem
an
tro
idset.
Tab
leA
1.
Reco
gn
tio
nS
peed
Resu
lts
Sh
ow
ing
Mean
Tim
eO
ver
Ave
rag
eS
ub
ject
Data
for
Each
Imag
eo
fE
ach
Rep
resen
tatio
n
Re
pre
se
nta
tio
nIm
ag
e1
A_
AU
2*
Imag
e2
C_
Blin
k
Imag
e3
E_
Lo
ok_le
ft_d
ow
n
Imag
e4
I_A
U1
5
Imag
e5
N_
Dila
ted
_p
up
il
Mean
Std
.E
rro
rM
ean
St.
dE
rro
rM
ean
Std
.E
rro
rM
ean
Std
.E
rro
rM
ean
Std
.E
rro
r
2D
3D
Hyb
rid
1.1
2s
1.6
4s
1.4
5s
0.0
80
0.0
78
0.0
92
1.0
6s
1.5
4s
1.3
0s
0.0
94
0.0
89
0.0
98
1.5
4s
2.0
5s
1.8
9s
0.0
69
0.0
80
0.0
78
2.5
3s
2.9
7s
2.7
5s
0.0
91
0.0
98
0.0
99
2.4
2s
2.9
4s
2.7
2s
0.0
89
0.0
78
0.0
98
Re
pre
se
nta
tio
nIm
ag
e6
B_
No
ch
an
ge
Imag
e7
No
ch
an
ge_
No
ch
an
ge
Imag
e8
No
ch
an
ge_
Clo
sed
_sm
ile
Imag
e9
R_
AU
1
Imag
e10
V_
Lo
ok_ri
gh
t
Mean
Std
.E
rro
rM
ean
Std
.E
rro
rM
ean
Std
.E
rro
rM
ean
Std
.E
rro
rM
ean
Std
.E
rro
r
2D
3D
Hyb
rid
1.4
3s
1.9
5s
1.7
6s
0.0
96
0.0
98
0.0
69
1.2
7s
1.5
5s
1.4
1s
0.0
88
0.0
88
0.0
89
1.9
6s
2.5
2s
2.3
4s
0.0
80
0.0
97
0.0
98
1.2
3s
1.9
7s
1.6
9s
0.0
94
0.0
89
0.0
72
1.4
1s
2.0
9s
1.8
2s
0.0
96
0.0
79
0.0
77
SIGN LANGUAGE SUBTITLING / 81
Tab
leA
2.
Mean
Tim
eO
ver
Ave
rag
eS
ub
ject
Data
for
Each
Rp
ere
sen
tatio
n
Rep
resen
tatio
nM
ean
2D
3D
Hyb
rid
1.6
0s
2.1
2s
1.9
1s
Tab
leA
3.
Reco
gn
itio
nA
ccu
racy
Re
pre
-
se
nta
tio
n
Imag
e1
A_
AU
2*
Imag
e2
C_
Blin
k
Imag
e3
E_
Lo
ok_ri
gh
t_d
ow
n
Imag
e4
I_
AU
15
Imag
e5
N_
Dila
ted
_p
up
il
Imag
e6
B_
No
ch
an
ge
Imag
e7
No
ch
an
ge_
No
ch
an
ge
Imag
e8
No
ch
an
ge_
Clo
sed
_sm
ile
Imag
e9
R_
AU
1
Imag
e10
V_
Lo
ok_ri
gh
t
2D
3D
Hyb
rid
10
0%
10
0%
10
0%
10
0%
10
0%
10
0%
10
0%
99
%
99
%
99
%
99
%
99
%
99
%
99
%
99
%
10
0%
10
0%
10
0%
10
0%
10
0%
10
0%
10
0%
10
0%
10
0%
10
0%
10
0%
10
0%
10
0%
10
0%
10
0%
82 / ADAMO-VILLANI AND BENI
and hybrid) were displayed on a 21-inch computer monitor with a resolution of
800 × 600 pixels. Each subject was asked to say “yes” upon detection of a change
in facial expression and/or hand pose between the semantroid on the left and
the one on the right. The response time for each image was recorded by an
experimenter who pressed a key at the end of the response. This stopped the timer
(and the video screen capture) that was started automatically upon display of
the image. After the key press, the image was automatically removed from the
screen and the subject was asked to fill out a feedback form in which she/he
identified the type of change in facial configuration and/or hand pose. The form
was handed to a second experimenter. The procedure was repeated for each image
of each set.
Results
The results obtained during Experiment 1 are reflected in Tables A1, A2,
and A3.
References
A1. Wilbur, R. B. (1987). American sign language: Linguistic and applied dimensions.
San Diego, CA: College-Hill**.
A2. Wilbur, R. B. (2004). After 40 years of sign language research, what do we know? In
S. Bradaric-Joncic and V. Ivasovic (Eds.), Sign language, deaf culture, and bilingual
education (pp. 9-38). Zagreb: ERF.
A3. Ekman, P., & Friesen, W. V. (1978). Facial action coding system: A technique for the
measurement of facial movement. Palo Alto: Consulting Psychologists press**.
A4. Flodin, M. (1994). Signing illustrated. New York: Perigee Reference.
APPENDIX B
Experiment 2
Stimuli
For Experiment 2 we followed the same procedure as Experiment 1 (described
in Appendix A) and we utilized the same subjects. The stimuli consisted of three
sets of new images rendered from the “semphace” 3D model and representing the
same changes in facial expressions and hand configurations as in Experiment 1.
Sample images used in Experiment 2 are represented in Figure B1.
Results
The results obtained during Experiment 2 are reflected in Tables B1 and B2
(recognition accuracy was 100% for all images).
SIGN LANGUAGE SUBTITLING / 83
84 / ADAMO-VILLANI AND BENI
Fig
ure
B1
.F
ram
e1
co
nta
ins
test
imag
e2
(C_b
link)
fro
mth
e3D
sh
ad
ed
sem
an
ph
ace
set;
fram
e2
co
nta
ins
test
imag
e8
(No
ch
an
ge_clo
sed
sm
ile)
fro
mth
eh
yb
rid
sem
ph
ace
set;
fram
e3
co
nta
ins
test
imag
e4
(I_A
U1
5)
fro
mth
eto
on
sh
ad
ed
sem
ph
ace
set.
Tab
leB
1.
Reco
gn
tio
nS
peed
Resu
lts
Sh
ow
ing
Mean
Tim
eO
ver
Ave
rag
eS
ub
ject
Data
for
Each
Imag
eo
fE
ach
Rep
resen
tatio
n
Re
pre
se
nta
tio
nIm
ag
e1
A_
AU
2*
Imag
e2
C_
Blin
k
Imag
e3
E_
Lo
ok_le
ft_d
ow
n
Imag
e4
I_A
U1
5
Imag
e5
N_
Dila
ted
_p
up
il
Mean
Std
.E
rro
rM
ean
St.
dE
rro
rM
ean
Std
.E
rro
rM
ean
Std
.E
rro
rM
ean
Std
.E
rro
r
2D
Sem
ph
ace
3D
Sem
ph
ace
Hyb
rid
Sem
ph
ace
1.0
4s
1.3
9s
1.2
8s
0.0
78
0.0
81
0.0
84
1.0
2s
1.3
3s
1.2
0s
0.0
91
0.0
83
0.0
95
1.0
5s
1.2
6s
1.1
9s
0.0
71
0.0
78
0.0
70
1.0
2s
1.3
6s
1.2
4s
0.0
86
0.0
91
0.0
87
1.0
7s
1.3
9s
1.3
0s
0.0
85
0.0
78
10
.09
0
Re
pre
se
nta
tio
nIm
ag
e6
B_
No
ch
an
ge
Imag
e7
No
ch
an
ge_
No
ch
an
ge
Imag
e8
No
ch
an
ge_
Clo
sed
_sm
ile
Imag
e9
R_
AU
1
Imag
e10
V_
Lo
ok_ri
gh
t
Mean
Std
.E
rro
rM
ean
Std
.E
rro
rM
ean
Std
.E
rro
rM
ean
Std
.E
rro
rM
ean
Std
.E
rro
r
2D
Sem
ph
ace
3D
Sem
ph
ace
Hyb
rid
Sem
ph
ace
1.0
3s
1.4
5s
1.3
6s
0.0
91
0.0
87
0.0
78
1.1
2s
1.3
5s
1.2
9s
0.0
82
0.0
78
0.0
92
1.0
0s
1.4
2s
1.3
5s
0.0
80
0.0
91
0.0
99
1.1
0s
1.4
7s
1.3
5s
0.0
82
0.0
81
0.0
78
1.1
21
.41
s1
.32
s
0.0
85
0.0
72
0.0
83
SIGN LANGUAGE SUBTITLING / 85
ACKNOWLEDGMENTS
We thank Professor Ronnie Wilbur from the Department of Audiology and
Speech Sciences at Purdue University for her valuable contributions to the project.
We also thank Marie Nadolske for performing the signs and for her continuous
help and feedback on the project. In addition, we are grateful to all the signers who
have participated in the evaluation of the method and to the Indiana School for
the Deaf (ISD) for providing the testing ground for a thorough evaluation of the
ASL subtitling technique.
REFERENCES
Adamo-Villani, N., & Beni, G. (2004). Keyboard Controlled Signing Semantroid. IEEE
Proceedings of ICARCV—8th International Conference on Control, Automation,
Robotics and Vision, 741-746.
Adamo-Villani, N., & Beni, G. (2005). Semantic intensity: A measure of visual relevance.
Journal of Information, 9(3), 437-468.
Agrawala, M., & Stolte, C. (2001). Rendering effective route maps: Improving usability
through generalization. Proceedings of ACM Siggraph 2001, 241-249.
Benson, P. J., & Perret, D. I. (1994). Visual processing of facial distinctiveness. Perception,
23, 75-93.
Blanz, V., & Vetter, T. (1999). A morphable model for the synthesis of 3D faces.
Proceedings of ACM Siggraph, 1999, 187-194.
Captioning Web. Retrieved online at: http://www.captions.org/.
DeCarlo, D., & Santella, A. (2002). Stylization and abstraction of photographs. Pro-
ceedings of ACM Siggraph 2002, 769-776.
DePaul University. (2005). Sign language project. Retreived at
http://asl.cs.depaul.edu/project.info.html.
Flodin, M. (1994). Signing illustrated. New York: Perigee Reference.
Freire, A., & Lee, K. (2000). The face-inversion effect as a deficit in the encoding of
configural information: Direct evidence. Perception, 29, 159-170.
Gallaudet Research Institute. (2005). Regional and National Summary Report of Data
from the 2003-2004 Annual Survey of Deaf and Hard of Hearing Children and
Youth. Washington, DC: GRI, Gallaudet University [WWW document]. URL:
http://gri.gallaudet.edu/Demographics/2004_National_Summary.pdf
86 / ADAMO-VILLANI AND BENI
Table B2. Mean Time Over AverageSubject Data for Each Representation
Representation Mean
2D Semphace3D SemphaceHybrid Semphace
1.06s1.38s1.29s
Gooch, B., & Gooch, A. (2001). Non-Photorealistic Rendering. Wellesley, MA: AK
Peters, Ltd.
Gooch, B., Reinhard, E., & Gooch, A. (2004). Human facial illustrations: Creation and
psychophysical evaluation. ACM Transactions on Graphics, 23(1), 27-44.
Gravett, P. (2004). Manga: 60 years of Japanese comics (paperback). New York: Collins
Designs.
Hansen, M. H. (1959). Effects of discrimination training on stimulus generalization.
Journal of Experimental Psychology, 58, 3221-3334.
Hill, A., & Scharff, L. V. (1997). Readability of Websites with various foreground/
background color combinations, font types and word styles. Proceedings of the
Eleventh National Conference in Undergraduate Research, 2, 742-746.
Holt, J. A., Traxler, C. B., & Allen, T. E. (1997). Interpreting the scores: A user’s guide to
the 9th edition Stanford Achievement Test for Educators of Deaf and Hard-of-Hearing
Students. Orlando, FL: Harcourt Publishers.
Laidlaw, D. H. (2001). Loose, artistic “textures” for visualization. IEEE Computer
Graphics and Applications, 21(2), 6-9.
Mcloud, S. (1993). Understanding comics. New York: HarperCollins.
Ramachandran, V. S., & Hirstein, W. (1999). The science of art: A neurological theory
of aesthetic experience. Journal of Consciousness Studies, 6(6-7), 15-51.
Searcy, J. H., & Bartlett, J. C. (1996). Inversion and processing of component and
spatial-relational information in faces. Journal of Experimental Psychology: Human
Perception and Performance, 22(4), 904-915.
Sutton, V. (1974). Sign writing. Retrieved online at:
http://www.signwriting.org/labout/what/what02.html.
Tanaka, J. W., & Sengco, J. A. (1997). Features and their configuration in face recognition.
Memory and Cognition, 25(5), 583-592.
Tversky, B., & Baratz, D. (1985). Memory of faces: Are caricatures better than photo-
graphs? Memory and Cognition, 13, 45-49.
Wilbur, R. B. (2004). After 40 years of sign language research, what do we know? In
S. Bradaric-Joncic & V. Ivasovic (Eds.), Sign language, deaf culture, and bilingual
education, 9-38. Zagreb: ERF.
Wilson, B., & Ma, K. L. (2004). Rendering complexity in computer-generated pen-and-ink
illustrations. Proceedings of ACM Siggraph 2004, 103-111.
Direct reprint requests to:
Nicoletta Adamo-Villani
Dept. of Computer Graphics Technology
Purdue University
1419 Knoy Hall
401 N. Grant St.
West Lafayette, IN 47907
e-mail: [email protected]
SIGN LANGUAGE SUBTITLING / 87