Ieee Ants 2011 Pratham r4 Camera Ready

6
1 Exploration and Implementation of a Next Generation Telepresence System Ramachandra Budihal, Navaneeth Mohanan, Sahil A. Anand and Saish Satish Kamat Abstract—Human communication includes not only spoken language but also non-verbal cues such as hand and body ges- tures, facial expressions, etc., to communicate our thoughts and feelings and gather feedback. Telepresence systems of today use a 2-way audio and video transmission to transmit this non-verbal information. In this paper, we introduce a novel Experiential Telepresence System, which possesses cognitive intelligence and is also context-ware i.e., it is aware of the multiple components of communication and ambience in which it communicates - both verbal and non-verbal, making the telepresence experience far more immersive when compared to its peers. This is achieved using a 3-tier architecture comprising of a Humanoid Robot, a Cognitive Collective Intelligence Platform on Cloud and an Experience Centre. Towards the end, a performance analysis coupled with a qualitative analysis of user perception which in otherwords is to measure the Quality of Experience of the system - shows the acceptability and user experience of our system is far higher when compared to with traditional telepresence and video conferencing. Index Terms—Experiential Telepresence, cognition, augmented reality, context-awareness, humanoid robot, Affective Interfaces, Tele-operation, Collective Intelligence on Cloud, SLAM, Cloud Robotics, Quality of Experience (QoE, QoX), Quality of Service (QoS), User Experience(UX) I. I NTRODUCTION I NTRINSICALLY, human communication can be broken down into verbal and non-verbal communication compo- nents. Face to face communication (Fig. 1) is considered to be one of the most effective forms of communication as it propagates both components without restrictions[1]. When it comes to long distance communication, traditional channels like letters and telephones lack the latter component . This gave birth to telepresence systems. These systems en- sure non-verbal communication between individuals do not get hindered due to the limitations of the channel between them (Fig. 1). Different implementations of telepresence systems have approached this problem in multiple fashions. Companies like Cisco Systems have tackled this problem by launching products like the Cisco TelePresence in the year 2006[2]. Many more companies like Anybots[3], VGo Communications[4], Gostai[5] have ventured the path of tele- operated robots in order to add an element of user interactivity to telepresence. Ramachandra Budihal is with Wipro Techologies, Bangalore, India, e-mail: [email protected]. Navaneeth Mohanan is with India Innovation Labs, Bangalore, India, e- mail: [email protected]. Sahil A. Anand is with India Innovation Labs, Bangalore, India, e-mail: [email protected]. Saish Satish Kamat is with India Innovation Labs, Bangalore, India, e-mail: [email protected]. Figure 1. Three tools for communication However, in most current implementations, the channel of communication is a means of transmitting mostly audio and visual data over to a recipient and is interpreted by the recipient himself. Our research focuses of making the channel intelligent so as to be aware of the multiple components of communication it is transmitting and receiving. This brings us to the concept of Experiential Telepresence. In an Experiential Telepresence System, extra knowledge gathered from diverse sensing systems (sensors + smart apps) is available to the intelligent channel. This extra knowledge is augmented on top of a standard video and audio feed to convey more information than the previously mentioned telepresence systems (Fig. 1). Currently, our channel is able to interpret emotions, detect faces, recognize speakers and gather environmental information. The idea behind this Experiential Telepresence System orig- inated from a talk presented by Budihal and a team consisting of other authors of this paper at a TED conference in Mysore in 2009. The talk introduced on a new model for heritage tourism called E3iT - Engage, Entertain, Educate, immerse and Transform[6]. The model stresses the need for an immersive experience in order to convey the story and history behind a heritage site. In this paper, we shall discuss the overall architecture and implementation of our Experiential Telepresence System along with a comparison against few of its commercial counterparts. Towards the end of the paper, we have briefly mentioned the application areas of our telepresence system.

description

my research paper..

Transcript of Ieee Ants 2011 Pratham r4 Camera Ready

Page 1: Ieee Ants 2011 Pratham r4 Camera Ready

1

Exploration and Implementation of

a Next Generation Telepresence SystemRamachandra Budihal, Navaneeth Mohanan, Sahil A. Anand and Saish Satish Kamat

Abstract—Human communication includes not only spokenlanguage but also non-verbal cues such as hand and body ges-tures, facial expressions, etc., to communicate our thoughts andfeelings and gather feedback. Telepresence systems of today use a2-way audio and video transmission to transmit this non-verbalinformation. In this paper, we introduce a novel ExperientialTelepresence System, which possesses cognitive intelligence andis also context-ware i.e., it is aware of the multiple components ofcommunication and ambience in which it communicates - bothverbal and non-verbal, making the telepresence experience farmore immersive when compared to its peers. This is achievedusing a 3-tier architecture comprising of a Humanoid Robot,a Cognitive Collective Intelligence Platform on Cloud and anExperience Centre. Towards the end, a performance analysiscoupled with a qualitative analysis of user perception which inotherwords is to measure the Quality of Experience of the system- shows the acceptability and user experience of our system isfar higher when compared to with traditional telepresence andvideo conferencing.

Index Terms—Experiential Telepresence, cognition, augmentedreality, context-awareness, humanoid robot, Affective Interfaces,Tele-operation, Collective Intelligence on Cloud, SLAM, CloudRobotics, Quality of Experience (QoE, QoX), Quality of Service(QoS), User Experience(UX)

I. INTRODUCTION

INTRINSICALLY, human communication can be broken

down into verbal and non-verbal communication compo-

nents. Face to face communication (Fig. 1) is considered to

be one of the most effective forms of communication as it

propagates both components without restrictions[1]. When it

comes to long distance communication, traditional channels

like letters and telephones lack the latter component .

This gave birth to telepresence systems. These systems en-

sure non-verbal communication between individuals do not get

hindered due to the limitations of the channel between them

(Fig. 1). Different implementations of telepresence systems

have approached this problem in multiple fashions.

Companies like Cisco Systems have tackled this problem

by launching products like the Cisco TelePresence in the

year 2006[2]. Many more companies like Anybots[3], VGo

Communications[4], Gostai[5] have ventured the path of tele-

operated robots in order to add an element of user interactivity

to telepresence.

Ramachandra Budihal is with Wipro Techologies, Bangalore, India, e-mail:[email protected].

Navaneeth Mohanan is with India Innovation Labs, Bangalore, India, e-mail: [email protected].

Sahil A. Anand is with India Innovation Labs, Bangalore, India, e-mail:[email protected].

Saish Satish Kamat is with India Innovation Labs, Bangalore, India, e-mail:[email protected].

Figure 1. Three tools for communication

However, in most current implementations, the channel

of communication is a means of transmitting mostly audio

and visual data over to a recipient and is interpreted by the

recipient himself. Our research focuses of making the channel

intelligent so as to be aware of the multiple components of

communication it is transmitting and receiving.

This brings us to the concept of Experiential Telepresence.

In an Experiential Telepresence System, extra knowledge

gathered from diverse sensing systems (sensors + smart apps)

is available to the intelligent channel. This extra knowledge

is augmented on top of a standard video and audio feed

to convey more information than the previously mentioned

telepresence systems (Fig. 1). Currently, our channel is able to

interpret emotions, detect faces, recognize speakers and gather

environmental information.

The idea behind this Experiential Telepresence System orig-

inated from a talk presented by Budihal and a team consisting

of other authors of this paper at a TED conference in Mysore

in 2009. The talk introduced on a new model for heritage

tourism called E3iT - Engage, Entertain, Educate, immerse and

Transform[6]. The model stresses the need for an immersive

experience in order to convey the story and history behind a

heritage site.

In this paper, we shall discuss the overall architecture and

implementation of our Experiential Telepresence System along

with a comparison against few of its commercial counterparts.

Towards the end of the paper, we have briefly mentioned the

application areas of our telepresence system.

Page 2: Ieee Ants 2011 Pratham r4 Camera Ready

2

II. THE EXPERIENTIAL TELEPRESENCE SYSTEM

A. Overview

Our Experiential Telepresence System is a 3-tier architecture

consisting of PRATHAM (a humanoid robot), a Collective

Intelligence Platform on Cloud and an Experience Centre. All

three components are connected via the Internet (Fig. 2). The

information and knowledge gathered by multiple intelligent

agents/systems, which also include a humanoid robot, are the

primary knowledge generating sources. This shared knowledge

is made available as crowd intelligence by the Collective

Intelligence Platform.

The Collective Intelligence Platform is a knowledge portal

responsible for assimilating and dissipating knowledge from

multiple robots on a real-time basis. The knowledge generated

by the robot is transmitted across to the Experience Centre and

is responsible for creating context-awareness in the informa-

tion delivered. This forms the basis of Cloud Robotics.

Cognition and context-awareness is one of the key differ-

entiating feature of our Experiential Telepresence System. It

is built at various levels in the system. At the lowest level,

the system is aware of the network bandwidth available and

is, thus, able to scale up or down the level of immersiveness,

in order, to maintain optimal performance. The system is also

able to recognize people in its environment using facial recog-

nition, then gather specific information like age, profession

through social networking sites and deliver content in a view

most suitable to that person. This gives the user, who has

created his/her avatar in the humanoid robot and is connected

to the Experience Centre, a more immersive experience.

The following sections talk about PRATHAM and the

Experience Centre.

Figure 2. The Experiential Telepresence System

B. PRATHAM - a Humanoid Robot:

PRATHAM stands for “Personal Robot And Telepresence

Humanoid with Autonomous Mobility”. Taking cue from the

popular hypothesis of the Uncanny Valley[7], [8], [9] (Fig.

4) we decided to make PRATHAM a humanoid robot, thus,

maintaining a social and emotional connect with the people it

interacts.

Figure 3. Anatomy of PRATHAM

Figure 4. Hypothesized emotional response of human subjects followingMori’s statements

The humanoid robot, itself, consists of a three layered

architecture (Fig. 5) that generates all the necessary knowledge

primitives before transmitting it to the Experience Centre.

1) Hardware Layer: At the lowest level, the robot consists

of a system of sensors and actuators. Sensors are broadly

divided into four types - position, navigation, visual and audi-

tory. Position sensors include GPS and compass. Navigation

sensors comprise of a laser SLAM (Simultaneous Localization

and Mapping) module and ultrasound sensors. Visual sensors

include a combination of a high resolution camera and a depth

sensing camera. Finally, auditory sensors include a 6 channel

microphone system useful for sound analysis.

The robot also consists of two actuators - a mobility plat-

form and a 6DOF head motor system. The mobility platform

is a 3 wheeled system comprising of two feedback enabled

DC motors that provide differential drive and a caster. The

6DOF head motor system is a combination of 3 servo motors

connected orthogonally to each other. Together, the 2 actuators

allow a remote user to move the base and the head of the robot.

2) Middleware Layer: The middleware forms the basic

software platform consisting of ROS - Robot Operating Sys-

tem, a hardware abstraction layer, and Ubuntu Linux as our

OS. ROS - Diamondback is the subsystem used by all our

higher level software modules and by our hardware abstrac-

Page 3: Ieee Ants 2011 Pratham r4 Camera Ready

3

Figure 5. PRATHAM’s Architecture

tion layer. ROS allows development of modules in a graph

architecture where each module forms a node of the graph

and the communication between each of these node takes

place through a publish subscribe or a service (request-reply)

methodology. The hardware abstraction layer is a set of drivers

written for each hardware module. The driver, written in ROS,

is the entry point for the hardware to the ROS subsystem. The

driver also does the necessary semantic conversion of the data

to and from the hardware depending on the type and make

of the hardware. Ubuntu Linux 10.10 was chosen as the OS

keeping in mind the compatibility with ROS - Diamondback.

3) Application Layer: The application layer implements the

high level logic of Experiential Telepresence on the robot. It

consists of three subsystems - a video encoder, Experiential

Telepresence Stack and navigation stack.

a) Video Encoder: Video streaming on the robot is a

point-point transmission. We have used an open source H.264

encoder called x264 for streaming video at a resolution of

640x480. This ensures high quality video streaming over the

Internet. The high resolution camera is used to capture the

scene the robot is able to see. This camera is placed exactly

in the center between its two emotion eyes, which perhaps give

the first eye-to-eye contact between the user who has created

an avatar in this robot and the subject/person interacting with

the robot, which perhaps is very critical and important part for

the QoE measure of the communication/interaction

b) Experiential Telepresence Stack: The Experiential

Telepresence Stack is the source of several knowledge prim-

itives which get fused together at the Experience Centre. As

part of this Experiential Telepresence Stack we have imple-

mented facial recognition, emotion recognition and synthesis,

sound localization and gesture recognition in this version.

Facial recognition primitive recognizes multiple faces through

the robot’s camera. Emotion recognition can recognize the six

basic emotions[10] while emotion synthesis uses expression

leds on the face to show emotions. Sound localization uses the

six microphone array to localize the source of a speaker and

gesture recognition uses the depth camera to interpret basic

human gestures.

c) Navigation Stack: The navigation stack handles con-

trol aspects of the humanoid robot. There are two modes of

Experiential Telepresence - manual and autonomous. The two

modes are explained in detail in the following section.

C. Experience Centre:

As mentioned earlier, a user of this system logs on to the

Experience Centre in order to experience a remote location.

Fig. 6 shows a user at our Experience Centre. The Experience

Centre fuses the data from various perception primitives of the

robot and, currently, displays it using augmented reality[11].

It consists of specific external aids and a neat visual user

interface.

Figure 6. A user at our Experience Centre.

1) External Aids: To build this fully immersive experience,

we found the usage of just a desktop monitor and a mouse to

be insufficient. In order to make the user oblivious to his cur-

rent environment and to immerse him into the remote location,

we used a head gear(Fig. 6) that displays the perception of the

robot. The head tracking sensors on the head gear detect the

users head orientation, and then mimick the same using the

robot’s 6DOF head motor system.

In manual navigation mode, a joystick (Fig. 7) is used

to navigate the humanoid robot from the Experience Centre.

In addition, the robot is fitted obstacle detection sensors to

provide navigation assistance by overriding a user’s control in

case of an emergency.

The PRATHAM system also has a feature that provides a

guided tour to its user[12]. By clicking at a location on the map

provided, the robot autonomously navigates to that location by

using either the laser range-finder or the depth camera.

2) Visual User Interface: The visual user interface (Fig.

8) performs the task of fusing the addtional knowledge orig-

inating the robot. It augments this new knowledge on top

of the video feed from the robot camera to help perceive

the environment better. At the lower right corner of our

UI we show the GPS information which tells the current

position and bearing of the robot on a map. At the lower

left corner we have navigation assistance controls. As the

Page 4: Ieee Ants 2011 Pratham r4 Camera Ready

4

Figure 7. Joystick (left), Vuzix iWear VR920 (right)

Figure 8. User Interface Design

user nears an obstacle, the navigation assistance warns the

user of the direction of the obstacle so that necessary actions

may be taken to avoid the obstacle. The user interface also

shows the information regarding the temperature, wind speed

and direction at PRATHAM’s location. In addition to this,

PRATHAM’s facial recognition system augments information

regarding the people it sees through the camera. PRATHAM

also identifies buildings and structures based on the GPS

location and augments information regarding the same.

III. RESULTS

After 8 months of development, PRATHAM was success-

fully demonstrated at several locations (Fig. 9).

A. Methodology of Measurement:

Conventionally most of the systems benchmarks and mea-

surements were done by subject matter experts from engi-

neering and in variably they were concerned with network

performance, Quality of Service (system uptime, MTBF, jitter,

packet loss, BERR etc., were some of the key measurements).

Business executives starting talking about average revenue

per user and customer addition and attrition parameters by

implementing some service systems such as SLA etc., in

communication information management systems they used

to manage. Today, more analysis is sought from the user

Figure 9. PRATHAM in an outdoor environment

perspective like the famous saying by D.R.Scoggin - "The

only way to know how customers see your business is to look

at it through their eyes" perhaps prepares a base to involve

psychologists and human behavior experts to value add in term

of measurement called - Quality of Experience (QoE, QoX).

Our evaluation has two parts one from the normal engineering

perspective of measuring system performance and benchmark-

ing, which perhaps serve mostly the objective measure, but this

alone doesn’t suffice for the customer satisfaction. Customer

satisfaction involves a lot from the user/customer perception,

which perhaps is more derived by the overall experience they

are able to perceive after they are exposed to the system. These

is a pure subjective measure which expressed based on their

feelings. This is the measure of the overall value people are

able to perceive about a product/concept (classical example is

the success of apple’s iPod against various similar products

which existed even before iPod entered the market - UX and

Design elements of the product played major subjective gains

over and above other system innovations) forms the second

part of the evaluation.

Quality of Experience - There has been many definitions

to this like wikipedia states this as "a subjective measure

of a customer’s experiences with a vendor" [13], K. Kilkki

defines it as "basic character or nature of direct personal

participation or observation" [14], which he further breaks

it into multiple measures from different user perspective and

brings in the relationship with the Quality of Service and the

Fig. 10 has been taken from [14] that clearly defines it in form

of components of a communication ecosystem.

B. Performance Analysis:

1) PRATHAM’s Benchmark Specifications: Table 1 shows

the benchmark specifications that have emerged after approx-

imately 200 hours of testing.

2) Comparison Against Peers: In order to help position

our system with respect to similar systems, Table 2 shows

a comparison of PRATHAM against three popular commer-

cial telepresence systems- QB by Anybot[3], VGo by VGo

Communications[4] and Jazz by Gostai[5].

Page 5: Ieee Ants 2011 Pratham r4 Camera Ready

5

Figure 10. Key components of a measurement in a communication ecosystem

Table IPRATHAM’S BENCHMARK SPECIFICATIONS

C. Qualitative Analysis of User Perception:

A qualitative survey was performed during a workshop and

presentation of the Experiential Telepresence System. A total

of 40 people participated in the study to rate their experience

of using the Experiential Telepresence System in an indoor

as well as outdoor environment. The following (Table 3) are

details of the study participants.

The questions were answered on a seven point Likert

Scale ( from 1 to 7, where a low score depicts lower level

of engagement or quality or whatever the measure is) and

were analyzed using a method called Analysis of Variance

(ANOVA)[15], which highlights the statistically significant

difference in the means between samples from groups (Table

4). One of the outcomes of this analysis is the p measure

Table IICOMPARISON OF PRATHAM AGAINST QB, VGO, JAZZ ROBOTS

Table IIISURVEY SAMPLE DISTRIBUTION

which tells us the likelihood of this particular outcome of

differences in the group means. If p is close to 1 then there is

high likelihood that this difference would show up at random,

if p < 0.05 then there is less than caused by chance .

Table IVSURVEY RESULTS

Users were asked to rate the Experiential Telepresence

System along with exisiting systems for telepresence on a 10

point scale ( higher score = better quality of experience ). This

data was taken only from users who currently use or have used

these systems in the past (Table 5).

Page 6: Ieee Ants 2011 Pratham r4 Camera Ready

6

Table V

The above data indicates that the user perception on experi-

ence and acceptance of an Experiential Telepresence System is

far higher over a conventional telepresence systems. Few users

felt greater focus was required in the driving task initially,

but were able to adjust to the controls in a short amount of

time. The UI provided information on the obstacles present

in the scene, and the navigation assistance meant for obstacle

avoidance proved to be of great help when steering in indoor

environments. Users were happy with the overall experience

and found the activity quite engaging.

Please note, since the number of people surveyed is small,

the results are only indicative and do not necessarily prove

that our system is better than the other systems mentioned.

As we receive feedback from more users the statistics will be

more accurate.

IV. CONCLUSION

In this paper, we have described the overall architecture and

implementation of our Experiential Telepresence System. Our

research was aimed at improving the user’s experience of a

remote location through addition of context - aware visual data

over a standard telepresence system.

Our immediate focus, now, is on the implementation of

variable bit rate video transmission. The current system uses

H.264 compression of constant bitrate (250kbps). Seamless

streaming of video requires high available bandwidth and low

traffic on the network. Under scenarios where network quality

has been poor, frame loss and temporary freezing of the video

feed has been observed. Such issues are not desirable in a good

telepresence system, and can be improved by changing the

current compression technique from constant to variable which

alters the compression ratio of the video stream keeping in

mind a maximum bit rate (limited by the available bandwidth)

so that it smoothly plays over the network without frame loss

or freezes.

In our current implementation of the Experiential Telep-

resence System, we have only touched upon the auditory

and visual senses of a user. In order to immerse the user

further, we need to tap into other senses like olfactory, touch

and gustatory as well. Thus, concepts like haptics[16], mixed

reality are some of the future additions to our Experiential

Telepresence Stack. This shall allow the user to feel the real

physical environment at the remote location through touch and

at the same time interact with virtual elements perceived by

the robot.

Experiential Telepresence Systems have a wide variety of

application areas. At India Innovation Labs, our primary

study is in the area of digital tourism. The concept allows

tourists to remotely experience a tourism site through our

experiential system. In addition to digital tourism, Experiential

Telepresence may be used for distance education, hospitality

at large office campuses[17] and at retail outlets.

We, thus, believe that our architecture will serve as a

platform for the next generation telepresence systems and

continue to improve a user’s experience.

ACKNOWLEGMENT

We would like to thank the Board of Trustees of India

Innovation Labs for their support especially Mr. NAPS Rao

and Prof. Prahladacharya. We would like to acknowledge

our core team including Mr. Viswanath Buravalla, I. Vijay

Kumar, V. R. Venkatesh and B. D. Vijaya for their unfailing

encouragement. A special acknowledgement also goes to our

well wishers from Wipro Technologies especially Mr. Anant

C. D. and Dr. Anurag Srivastava, CTO. Finally, we would

like to acknowledge our colleagues Ms. Aarushi Khanna, Mr.

Maruthi R. and the students of R.V. College of Engineering

and National Institute of Technology, Karnataka who have

been associated with the development of PRATHAM over the

course of the last one year.

REFERENCES

[1] A. Chapanis, “Interactive human communication,” in Scientific Ameri-

can, vol. 232(2), March 1975.[2] H. S. Lichtman, “A brief history of telepresence,” February 2007.

[Online]. Available: http://www.telepresenceoptions.com/[3] Anybots, “Introducing anybots, qb, telepresence robot!!” [Online].

Available: https://www.anybots.com/[4] V. Communications, “Introducing vgo secure, simple, affordable,” 2010.

[Online]. Available: http://www.vgocom.com/[5] Gostai, “Robotic telepresence.” [Online]. Available: http://www.gostai.

com/[6] “The buzz: Ramachandra budihal augments reality,” November 2009.

[Online]. Available: http://blog.ted.com/2009/11/06/the_buzz_ramach[7] “The truth about robotic’s uncanny valley - human-like robots and

the uncanny valley,” Popular Mechanics, January 2010. [Online].Available: http://www.popularmechanics.com/technology/engineering/robots/4343054

[8] Saygin, I., Chaminade, T., Ishiguro, and H., “The perception of humansand robots: Uncanny hills in parietal cortex,” CogSci, 2010.

[9] Mori and Masahiro, “Bukimi no tani / the uncanny valley,” in Energy,1970, pp. 33–35.

[10] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. M. Lee,A. Kazemzadeh, S. Lee, U. Neumann, and S. Narayanan,“Analysis of emotion recognition using facial expressions, speechand multimodal information,” in Proceedings of the 6th international

conference on Multimodal interfaces, ser. ICMI ’04. NewYork, NY, USA: ACM, 2004, pp. 205–211. [Online]. Available:http://doi.acm.org/10.1145/1027933.1027968

[11] R. T. Azuma, “The challenge of making augmented reality workoutdoors,” in In Mixed Reality: Merging Real and Virtual. Springer-Verlag, 1999, pp. 379–390.

[12] K. M. Tsui, M. Desai, H. A. Yanco, and C. Uhlik, “Telepresence robotsroam the halls of my office building,” HRI Workshop, 2011.

[13] “Quality of experience.” [Online]. Available: http://en.wikipedia.org/wiki/Quality_of_experience

[14] K. Kilkki, “Quality of experience in communication systems,” Journal

of Universal Computer Science, vol. 14, pp. 615–624, 2008.[15] D. J. Weiss, Analysis of Variance and Functional Measurement. Oxford

University Press, October 2005.[16] A. Ansar, D. Rodrigues, J. P. Desai, K. Daniilidis, V. Kumar,

and M. F. M. Campos, “Visual and haptic collaborative tele-presence,” Computers & Graphics, vol. 25, no. 5, pp. 789 – 798,2001. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0097849301001212

[17] K. M. Tsui, M. Desai, H. A. Yanco, and C. Uhlik, “Exploringuse cases for telepresence robots,” in Proceedings of the 6th

international conference on Human-robot interaction, ser. HRI ’11.New York, NY, USA: ACM, 2011, pp. 11–18. [Online]. Available:http://doi.acm.org/10.1145/1957656.1957664