Modeling YouTube QoE based on Crowdsourcing and Laboratory ...crowd:... · Modeling YouTube QoE...

ModelingYouTubeQoEbasedonCrowdsourcingandLaboratoryUserStudies

Dr. Tobias Hoßfeld University of Würzburg, Institute of Computer Science, Chair of Communication Networks

Research Group on “Future Internet Applications & Overlay” Am Hubland, 97074 Würzburg, Germany

tobias.hossfeld@uni‐wuerzburg.de http://www3.informatik.uni‐wuerzburg.de/research/fia/

COST Action IC 1003 “European Network on Quality of Experience in Multimedia Systems and Services

(QUALINET)”

STSM title Modeling YouTube QoE based on Crowdsourcing and Laboratory User Studies

Short name YouTubeCrowd

Reference COST‐STSM‐IC1003‐8227

Period from 2011‐08‐15 to 2011‐09‐30

Beneficiary Tobias Hoßfeld, University of Würzburg

Location FTW Forschungszentrum Telekommunikation Wien GmbH, Wien, Austria

Host Dr. Raimund Schatz, FTW Forschungszentrum Telekommunikation Wien GmbH

TableofContents1 Motivation of the STSM ............................................................................................................................. 2 2 Purpose of the STSM ................................................................................................................................. 3 3 Description of Work ................................................................................................................................... 3 4 Main Scientific Results ............................................................................................................................... 5

4.1 Crowdsourcing Task and Experiment Design .................................................................................... 5 4.2 YouTube QoE..................................................................................................................................... 7 4.3 Comparison of Crowdsourcing and Laboratory Results .................................................................... 8

4.3.1 Qualitative Comparison Crowdsourcing vs. Laboratory Studies .............................................. 8 4.3.2 Quantitative Comparison regarding YouTube QoE ................................................................ 12

4.4 Impact of Unreliable Users on QoE ................................................................................................. 14 4.4.1 Rating Behavior of Fake Users ................................................................................................ 14 4.4.2 Statistical Measures and Fake User Ratings ........................................................................... 14

5 Summary of Relevance for QUALINET ..................................................................................................... 16 6 Outcome of the STSM .............................................................................................................................. 17 7 References ............................................................................................................................................... 18

1 MotivationoftheSTSMVideo streaming dominates global Internet traffic and is expected to account for 57% of all consumer Internet traffic in 2014 generating over 23 Exabytes per month [1]. The most prominent video streaming portal is Youtube which accounts for more than two billion video streams daily. YouTube videos are streamed over the Internet by means of the HTTP protocol which is responsible for the major part of the residential broadband Internet traffic [6]. However, many users face volatile performance of the service, e.g. due to bad network conditions, temporary overutilization of shared network resources or congested media streaming servers. Such adverse conditions are the main causes for bad online video QoE. Network and service providers need to be able to observe and react upon quality problems, at best before the customer takes notice of them. Therefore, appropriate QoE models and metrics are required, like those provided by this STSM on YouTube video streaming.

In the domain of video streaming, traditional UDP‐based services like IPTV or Real Media streaming typically do not guarantee packet delivery. Thus, congestion in the network or at the multimedia servers leads to lost packets causing visual artifacts, jerky motion or jumps in the stream, forms of degraded media quality which have been extensively studied in previous video quality research. In contrast, delivery of YouTube video to the end user is realized as progressive download using TCP as transport protocol. The usage of TCP guarantees the delivery of undisturbed video content since the protocol itself cares for the retransmissions of corrupted or lost packets. Further, it adapts the transport rate to network congestion, effectively minimizing packet loss. However, if available bandwidth is lower than the video bit rate, video transmission becomes too slow, gradually emptying the playback buffer until an underrun occurs. If rebuffering happens, the user notices interrupted video playback, commonly referred to as stalling. In this respect, YouTube QoE is different from traditional UDP‐based video streaming, since with TCP only the video playback itself is disturbed while the transmitted audiovisual content remains unaltered. Consequently, this STSM focuses on YouTube video streaming, setting out to develop a YouTube QoE model that takes into account characteristic effects like stalling. This in‐depth investigation of temporal impairments in the context of the most popular TCP‐based online video service (YouTube) is novel in the field of QoE modeling for video streaming.

Due to the lack of QoE models explicitly addressing stalling effects in the context of TCP‐based online video, subjective user studies need to be performed. Such studies are typically carried out by a test panel of real users in a laboratory environment. While many and possibly even diverging views on the quality of the media consumption can be taken into account – entailing accurate results and a good understanding of the QoE and its sensitivity – lab‐based user studies can be time‐consuming and costly, since the tests have to be conducted by a large number of users for statistically relevant results. Costs and time demands further increase if the design and the execution of the tests as well as the analysis of the user ratings are performed in an iterative way. This means that the YouTube QoE model is developed through repeated cycles of design, implementation, and statistical analysis of the tests. This iterative approach is unavoidable when touching new QoE aspects like stalling effects. For this purpose, crowdsourcing seems to be an appropriate alternative approach. Crowdsourcing means to outsource a job (like video quality testing) to a large, anonymous crowd of users in the form of an open call. Crowdsourcing platforms in the Internet, like Amazon Mechanical Turk or Microworkers, offer access to a large number of internationally widespread users in the Internet and distribute the work submitted by an employer among the users. The work is organized at a finer granularity and jobs are split into cheap micro‐tasks that can be fast performed by the crowd.

With crowdsourcing, subjective user studies can be efficiently conducted at low costs with adequate user numbers for getting statistically significant QoE scores [8]. In addition, the desktop‐PC based setting of crowdsourcing provides a highly realistic setting for scenarios like online video. However, reliability of results cannot be assumed due to the anonymity and remoteness of participants. Some subjects may submit incorrect results in order to maximize their income by completing as many jobs as possible; others just may not work correctly due to lack of supervision. Therefore, it is necessary to develop an appropriate methodology that addresses these issues and ensures consistent behavior of the test subjects throughout a test session and thus obtain reliable QoE results.

2 PurposeoftheSTSMResearch interests. The STSM applicant’s research interests are mainly concerned by the following four aspects, 1) statistical analysis and characterization of QoE measurements; 2) identification of key influence factors of video streaming services with stalling and appropriate QoE models; 3) subjective test design with respect to stalling as well as with respect to crowdsourcing; and 4) automated detection of unreliable user ratings. Beyond this, the research interests cover QoE models for web and cloud applications and services as well as the application of the QoE models for improved existing and emerging multimedia systems and services. Collaboration. The goal of this STSM collaboration is to foster collaboration between both institutions in the context of modeling the influence of stalling on video streaming as well as in the development of appropriate subjective testing methodologies. Thereby, the applicant would benefit from FTW’s experience in QoE measurements [15][16][17][18][19][20], while FTW would take advantage of the applicant in crowdsourcing [12][13] and modeling of QoE for different applications [8][21][23][29][30][31]. First, a comparison and the combination of the different QoE assessment methodologies, i.e. crowdsourcing and laboratory user studies, will improve the quality of the QoE testing procedures. Second, both partners have already performed user studies in the context of YouTube. To this end, the measurement results will be exchanged and provide the fundamentals for a QoE model for YouTube. Third, different modeling approaches were applied so far, since both partners have a different background and approach to describe QoE. Based on psychophysics, the QoE is described via stimuli resulting in a logarithmic description [17], while the relationship between QoE and quality degradation is derived as an exponential interdependency [11][22][23]. Thus, the STSM will lead to interesting discussions and exchange of knowledge.

Exchanging expertise. Currently, Tobias Hossfeld is leading the FIA group on “Future Internet Applications and Overlays” at the Chair of Communication Networks led by Prof. Phuoc Tran‐Gia at the University of Würzburg. One of the major research topics of the FIA group is related to QoE which is reflected in several research projects like the BMBF funded G‐Lab [7], the FP7 Network of Excellence Euro‐NF [3] and the FP7 STREP SmoothIT [5]. Therefore, the STSM will foster collaboration between different research projects, as the host institution is involved in the COMET project ACE and ACE2.0 [1] (headed by Dr. Raimund Schatz) and the FP7 STREP project Optiband [4]. The collaboration is also of interest for the User‐centered Interaction and Communication Economics group at Telecommunications Research Center (FTW). In FTW’s laboratory, several studies on QoE have already been conducted and are of interest for this joint activity. The STSM applicant, Tobias Hossfeld, has profound experience and knowledge in different aspects of this research proposal. The identification of influence factors on the QoE for video streaming suffering from imperfect network conditions like packet loss were discussed in [25][30][31]. Subjective user studies and their statistical evaluation were conducted for web traffic and cloud gaming [14][27]. The anatomy of a crowdsourcing platform, the demographics of its users, as well as cheat detection mechanisms to obtain reliable results were investigated in [12][13][26]. A measurement‐based characterization of YouTube videos and YouTube traffic patterns were outlined in [24].

3 DescriptionofWorkThis STSM was designed to fulfill three research objectives (S1‐S3) that address the goals of the COST QUALINET action in the following ways:

(S1) The development of a YouTube QoE model that quantifies key influence factors, i.e. temporal impairments like stalling frequency and duration, leading to a robust metric for mapping these factors to QoE. This outcome is related to the QUALINET objectives Q21 and Q32.

1 Q2: Identification of measurable features which are relevant for the subjectively perceived QoE. 2 Q3: Development of effective and robust metrics to objectively measure the quality of a multimedia presentation, as perceived by a human observer with emphasis on multimedia communication applications

(S2) Application of the QoE model for optimized media delivery by the service provider. In particular, the dimensioning of the server capacity and the deployment of the server scheduling strategy are investigated, related to QUALINET objective Q43.

(S3) Comparison of crowdsourcing and laboratory user studies in order to develop an improved methodology for subjective testing based on crowdsourcing. This methodology includes a proper test design and filtering mechanisms to obtain reliable results and is directly related to objective Q14.

However, the development of an appropriate methodology for user studies with crowdsourcing (S3) requires a comparison with similar user studies conducted in laboratory environments such as the studies performed at FTW. The applicant has designed and implemented a framework for YouTube QoE tests in the Microworkers crowdsourcing platform [12]. In parallel, user studies for assessing YouTube QoE have been conducted in FTW’s i:Lab. Thus, the user ratings of both tests were exchanged and statistically analyzed in order to identify the key influence factors of YouTube QoE. In order to fulfill three research objectives, the following tasks (T0)‐(T5) were conducted.

(T0) Design and execution of subjective user studies. The influence factors on YouTube QoE were identified and representative parameters (like the length of stalling or the frequency of stalling) and stalling patterns derived that are suitable for subjective testing. Additional candidate parameters included the resolution of video contents, the type of video contents, and video motion patterns. Furthermore, the reliability of user ratings had to be secured for the crowdsourcing tests, due to the anonymity of participants. This necessitates additional measures such as consistency‐checking questions as well as monitoring of the user’s activity during the test, e.g. by monitoring the focus time of the web browser displaying the YouTube video. Accordingly, the test framework was implemented for usage in a crowdsourcing platform.

(T1) Statistical analysis of YouTube QoE based on subjective studies. The measurement results of the YouTube user studies being independently performed by FTW and University of Würzburg were exchanged and statistically analyzed. The statistical analysis included an analysis of variance (ANOVA), higher‐order moment analysis, and investigations on user diversity for both test studies.

(T2) Identification of key influence factors. For the purpose of YouTube QoE modeling, the most relevant/dominating influence factors on QoE were identified. This includes a) network level parameters like delivery bandwidth or packet loss causing certain stalling patterns, b) psychological influence factors like type of content or memory [27], or c) influence factors on application level like video resolution. The identification of these influence factors were performed by means of machine learning algorithms, like support vector machines or decision trees, and by statistical tools like Design of Experiments (DoE) plots to visualize main effects and interactions between parameters.

(T3) Comparison of different subjective user study approaches. The results from the crowdsourcing tests and the laboratory tests were compared with respect to reliability of user ratings and the influence of test method on user ratings. Causes for differences in terms of user rating behavior were analyzed in depth.

(T4) QoE Modeling for YouTube. The identification of the key influence factors and the comparison of the different subjective user studies allowed formulating a robust QoE metric. This metric maps the relevant parameters like the stalling pattern to a certain QoE value in terms of Mean Opinion Scores on a certain rating scale. Since the crowdsourcing user studies have to be simple as we learned throughout the conduction of the tests (T0), we used a 5‐point absolute category scale [28].

(T5) Application of the YouTube QoE model. Finally, the derived YouTube QoE model was applied to optimize the media delivery by the service provider. To compensate any QoE degradation during playout due to insufficient network resources, we analyzed the video buffer by means of event‐

3 Q4: Development of tools for optimization of existing and emerging multimedia systems, products and services, based on the subjective perception of quality as induced by cross‐modality, saliency, and masking effects, prevalent in multi‐modal content presentation. 4 Q1: Development of methodologies to systematically assess the subjectively perceived QoE for multimedia presentations, resulting in reproducible and reliable scores.

discrete simulations. If the video is buffered long enough in the beginning, no stalling will occur at cost of an increased initial delay. The trade‐off between initial delays and networking costs in terms of bandwidth was quantified.

In addition to these main steps of the STSM, some other topics and issues emerged in discussions with Dr. Raimund Schatz, Alexander Platzer, and Sebastian Egger during the research stay:

(a) Intensive literature study on

existing HTTP video streaming and YouTube QoE models,

popularity and characteristics of YouTube videos,

traffic characteristics of YouTube streams, its implemented application flow control, as well as server selection strategies.

(b) Subjective test design for modeling YouTube QoE for web applications with remote participants, e.g. from social networks.

(c) Quality of experience and quality acceptance studies. (d) Influence of transport protocols on YouTube QoE.

These topics and issues will be investigated in an ongoing collaboration between the host and the beneficiary. In particular, related publications are envisioned addressing these topics.

4 MainScientificResultsThe main scientific results are shortly summarized in this section. In the additional slide set attached to this report, more detailed results and figures can be found. First we take a closer look how to design subjective tests in the context of crowdsourcing for identifying reliable users. After that, numerical results on YouTube QoE and its key influence factors are presented. Then, the results from the crowdsourcing tests and laboratory tests conducted at FTW’s i:Lab are compared qualitatively and quantitatively. Finally, the impact of fake user ratings, i.e. ratings from unreliable users, on statistical measures like MOS, confidence intervals, standard deviation of opinion scores, etc. is analyzed.

4.1 CrowdsourcingTaskandExperimentDesignThe test methodology developed throughout this work allows experimenters to conduct subjective user tests about the user perceived quality of Internet applications like YouTube by means of crowdsourcing and to evaluate the impact of network impairments on QoE. For the necessary quality assurance of the QoE test results themselves including the identification of unreliable user ratings, we apply different task design methods, before the subjective users tests are conducted by the crowd. Different user study campaigns are designed according to the influence factors under investigation. Only a single stalling parameter is varied, while the others are kept constant. This strict separation helps for a proper QoE analysis and deriving adequate QoE models. For obtaining realistic stalling patterns, application‐layer measurements can be found in the slide set attached.

The task design methods described in the following paragraphs can be used for different crowdsourcing tasks. Nonetheless, we describe their applicability in the context of evaluating the QoE for YouTube video streaming. For the subjective crowd tests, we recommend to combine all mentioned task designs, i.e. gold standard data, consistency checks, content questions, mixed questions and application monitoring.

Gold Standard Data. The most common mechanism to detect unreliable workers and to estimate the quality of the results is to use questions whereof the correct results are already known. These gold standard questions are interspersed among the normal tasks the worker has to process. After results submission by the worker, the answers are compared to gold standard data. If the worker did not process the gold standard questions correctly, the non‐gold standard results should be assumed to be incorrect too. Since for subjective quality testing personal opinions are asked for, the gold standard data approach has to be applied with care since user opinions must be allowed to diverge. Still, in our tests we included videos without any stalling and additionally asked participants: ``Did you notice any stops to the video you just watched?''. If a user then noticed stops, we disregarded his ratings for quantification of QoE. We additionally monitored the stalling events on application layer to exclude any unwanted stops.

Consistency Tests. In this approach, the worker is asked the same question multiple times in a slightly different manner. For example, at the beginning of the survey the worker is asked how often she visits the YouTube web page, at the end of the survey she is asked how often she watches videos on YouTube. The answers can slightly differ but should be lie within the same order of magnitude. Another example is to ask the user about his origin country in the beginning and about his origin continent at the end. The ratings of the participant are disregarded, if not all answers of the test questions are consistent. An unresolved problem concerns subjects that are not willing to provide correct personal data and that provide inconsistent data. In that case, the user ratings are rejected, although they could provide valid quality ratings.

Content Questions. After watching a video, the users were asked to answer simple questions about the video clip. For example, ``Which sport was shown in the clip? A) Tennis. B) Soccer. C) Skiing.'' or ``The scene was from the TV series... A) Star Trek Enterprise. B) Sex and the City. C) The Simpsons.'' Only correct answers allow the user's ratings to be considered in the QoE analysis.

Mixed Answers. This method is an extension to consistency tests to detect workers using fixed click schemes in surveys. Usually, the rating scales on surveys are always structured in the same way, e.g. from good to bad. Consequently, workers using fixed click scheme might bypass automated consistency tests, as always selecting the first or the middle answer results in a consistent survey. An easy way to avoid this is to vary the structure of the rating scales. For example the options of the first quality question "Did you notice any stops while the video was playing?" has the order ``No'', ``Yes'', whereas in the following question ``Did you experience these stops as annoying?'' the order is ``Extremely'',``Fairly'',..., ``Not at all''. Now, following a fixed clicking scheme results causes inconsistencies and identifies unreliable participants. In this example, it would be “no stops noticed”, but “these stops were extremely annoying”.

Application Usage Monitoring. Monitoring users during the tasks completion can also be used to detect cheating workers. The most common approach here is measuring the time the worker spends on the task. If the worker completes a task very quickly, this might indicate that she did the work sloppy. However, it has to be noted that the reaction times of different subjects may differ significantly depending on the actual person.

A more robust method is to monitor browser events in order to measure the focus time, which is the time interval during which the browser focus is on the website belonging to the user test. In order to increase the number of valid results from crowdsourcing, we displayed a warning message if the worker did not watch more than 70 % of the video. The users could decide to watch the video again or to continue the test. When workers became aware of this control mechanism, the percentage of completely watched videos doubled and almost three times more workers could be considered reliable than without the system warning.

Implementation and Execution of Experiments. The aim of the experiments is to quantify the impact of network impairments on QoE. For YouTube video streaming, network impairments result into related stalling patterns. As the video experience should be as similar as possible to a visit of the real YouTube website, the application should run on the users' default web browser. To this end, an instance of the YouTube Chromeless Player was embedded into dynamically generated web pages. With JavaScript commands the video stream can be paused, a feature we used to simulate stalling. YouTubeӳ standard animated icon was used as visual indicator that the video is being buffered. In addition, the JavaScript API allows monitoring the player and the buffer status, i.e. to monitor stalling on application layer. In order to avoid additional stalling caused by the test users' Internet connection, the videos had to be downloaded completely to the browser cache before playing. This enables us to specify fixed unique stalling patterns which are evaluated by several users. Users had to rate the impact of stalling during video playback on a 5‐point absolute category rating (ACR) scale with the following values: (1) bad; (2) poor; (3) fair; (4) good; (5) excellent.

During the initial download of the videos, a personal data questionnaire was completed by the participant which also includes consistency questions from above. The user then sequentially viewed three different YouTube video clips with a predefined stalling pattern. After the streaming of the video, the user was asked to give his current personal satisfaction rating during the video streaming. In addition, we included gold

standard, consistency, content and mixed questions to identify reliable subjective ratings. The workers were not aware of these checks and were not informed about the results of their reliability evaluation.

Relevance for QUALINET WG 5 “Standardization and dissemination”: The developed methodology to detect unreliable subjects should be included in appropriate standards when using remote user panels, e.g. using crowdsourcing platforms or users of social networks. In an analogous way, reliability task design methods should be considered for laboratory tests, too, since rejecting unreliable users should not be done on basis of submitted user ratings.

4.2 YouTubeQoEIn the crowdsourcing campaigns, we focused on quantifying the impact of stalling on YouTube QoE and varied 1) the number of stalling events as well as 2) the length of a single stalling event, resulting in 3) different total stalling times. We also considered the influence of 4) the different crowdsourcing campaigns, 5) the test video id in order to take into account the type of video as well as the resolution, used codec settings, etc. Further, we asked the users to additionally rate 6) whether they liked the content (using a 5‐point ACR scale). We collected additional data concerning the background of the user by integrating demographic questions including 7) age, 8) gender, etc. (9‐13) as well as questions regarding their Internet application usage habits (16‐17) in the survey. Furthermore, we additionally collected data such as access network speed (14) and browser used (15) in order to identify potential influence factors on YouTube QoE. Finally, the key influence factors on YouTube QoE are identified by means of (a) correlation coefficients and (b) support vector machine (SVM) weights. We compute the Spearman rank‐order correlation coefficient between the subjective user rating and the above mentioned variables. In addition, we utilize SVMs as machine learning approach to make a model for classification. Every variable gets a weight from the model indicating the importance of the variable. However, SVMs are acting on two‐class problems only. For this, we take the categories 1 to 3 of the ACR scale to class “bad quality” and the categories 4 to 5 to class “good quality”. We choose the implementation of SMO (Sequential Minimal Optimization) in WEKA for analysis.

Figure 1 shows the results from the key influence analysis. On the x‐axis, the different influence factors are considered, while the y‐axis depicts the correlation coefficient as well as the SVM weights which are normalized to the largest correlation coefficient for the sake of readability. We can clearly observe from both measures, that the stalling parameters dominate and are the key influence factors. Surprisingly, the user ratings are statistically independent from the video parameters (like resolution, video motion, type of content like news or music clip, etc.), the usage pattern of the user, as well as its access speed to reflect the user’s expectations.

The analysis has shown that YouTube QoE is mainly determined by stalling and both stalling parameters, i.e. frequency and length. For quantifying YouTube QoE, concrete mapping functions depending on these two stalling parameters have to be derived. To this end, Figure 2 depicts the MOS values for one and three seconds stalling length for varying number of stalling events together with exponential fitting curves. The goodness of fit is quantified by coefficient of determination R2 and close to perfect match (R2=1). The x‐axis denotes the number of stalling events, whereas the y‐axis denotes the MOS rating. The results show that users tend to be highly dissatisfied with two or more stalling events per clip. However, for the case of a stalling length of one second, the user ratings are substantially better for same number of stalling events. Nonetheless, users are likely to be dissatisfied in case of four or more stalling events, independent of stalling duration.

Figure 1. Key Influence Factors on YouTube QoE.

It has to be noted that we focused in the study on a certain scenario that is a bottleneck scenario, where the videos are streamed over a dedicated bottleneck with limited, but constant network capacity. We conducted a large‐scale measurement campaign to obtain realistic stalling patterns in such bottleneck scenarios, which we then used in our subjective user studies. In particular, uniformly distributed stalling patterns were observed. In practice, however, more complex network scenarios, e.g. considering mobile YouTube users, may occur remaining an issue for future work. Other influence parameters comprise the duration of the online videos and initial delays that are to be analyzed by means of subjective user studies for sophisticated quality metrics for online video streaming. The relation between the stalling pattern and the video duration, especially for videos longer than 5minutes up to typical movie duration of 90minutes, has to be quantified with respect to QoE. Furthermore, the impact of initial delays, i.e. waiting times in the beginning before service consumption, on the overall user perceived quality is subject of future research.

Beyond QoE studies, the users were additionally asked to rate whether they accept the service quality under such conditions, i.e. stalling impairments. As a result, we found roughly a linear relationship between user acceptance and user perception, until a certain point is reach and user acceptance gets saturated. Supporting the derived QoE model, the investigation of user acceptance revealed that almost no stalling or only very short stalling is accepted by the users. However, it has to be noted that in real world, many users like to fully download/buffer a video and then start watching it to get a more fluent QoE. Thus, beyond QoE, user acceptance studies and investigation of real‐word user behavior should be consider and related – if possible – to QoE.

Relevance for QUALINET WG 2 “Mechanisms and models of human perception”: The impact of service interruptions and waiting times, similar to stalling, has a strong impact on QoE. Thus, models of human perception for various Internet and cloud applications should be extended in this dimension, since the delivery of services and applications over the Internet may experience such stalling effects.

Relevance for QUALINET WG 3 “Quality metrics”: There are further influence parameters to be analyzed by means of subjective user studies in order to provide sophisticated quality metrics for online video streaming, a) video duration, b) initial delays, and c) complex stalling patterns.

4.3 ComparisonofCrowdsourcingandLaboratoryResultsThe main differences between crowdsourcing and lab studies are compared qualitatively, i.e. considering various effects emerging in subjective studies, and quantitatively, i.e. considering the impact on YouTube QoE.

4.3.1 QualitativeComparisonCrowdsourcingvs.LaboratoryStudiesCrowdsourcing study: With crowdsourcing, subjective user studies can be efficiently conducted in short time at low costs with adequate user numbers for obtaining statistically significant QoE scores. In addition, the desktop‐PC based setting of crowdsourcing provides a highly realistic context for usage scenarios like online video consumption. However, reliability of results cannot be assumed because of the anonymity and remoteness of participants. Some subjects may submit incorrect results in order to maximize their income by completing as many tasks as possible; others just may not work correctly due to lack of supervision. Only unsupervised training phases can be conducted. Crowdsourcing tasks are typically of short nature and thus the user test should be in the order of a few minutes only. The users of crowdsourcing platforms are

Figure 2. Mean opinion score of user ratings depending on stalling conditions.

0 1 2 3 4 5 61.5

2

2.5

3

3.5

4

4.5

5

5.5

number N of stalling events

MO

S

measurementsexponential fit

event length L=1s, R2=0.941f1(N) = 3.26 e-0.37N +1.65

event length L=3s, R2=0.997f3(N) = 2.99 e-0.96N +2.01

international. However, some platforms like Microworkers.com allow specifying/restricting countries, such that only workers from these countries can conduct a certain campaign.

Laboratory study: In contrast, laboratory studies yield to reliable user rating results at the cost of higher study completion times and higher payments to the participating users. The users typically conduct a trainings phase, supervised by an experienced test moderator. The tests can last up to a few hours with breaks in between. The test environment – especially the viewing environment – is the same for all test users and under control. This includes factor light brightness or contrast of the display, viewing distance and viewing angle, environmental lightning, physical size and resolution of the display, etc.

Differences in test design. In general, we distinguish several effects for subjective testing which have to be considered in the test design to reduce biases in subjects’ ratings. These effects, their reasons and possible countermeasures are explained here shortly and compared with respect to crowdsourcing and laboratory tests.

Subject effect as human subjects are asked about their perceived quality. The subject effect occurs due to differences among subjects regarding expectations, own importance of quality features, sensitivity to impairments, etc. As a consequence, the variance of user ratings increases while the reliability of the test decreases. A common countermeasure in the test design is a training phase including explanation of the test itself and/or some warm‐up sequences. Typically, the subjects are presented the highest and lowest qualities. From a statistically point of view, the number of test users has to be large enough to get significant results. However, social and cultural differences and different user groups will still lead to different perceptions and user ratings. Therefore, in‐depth statistical analysis beyond MOS is of importance.

Uncertainty effect. Subjects are often uncertain how to assess a certain quality absolutely for a certain test condition, especially in the beginning of a test. Training phases are typically conducted to overcome this. However, training phases influence the user ratings, as the subjects are told what good quality is and what bad quality is. While training phases can be conducted in crowdsourcing and laboratory studies, the uncertainty of users can be assessed directly in lab studies when observing the test person.

Get‐tired/fatigue and get‐bored effects. Subjects get tired or bored during long tests. Therefore, breaks within the tests are recommended, e.g. every 20 minutes, and the total test duration should be limited. While the get‐tired effect is not present in (typically short) crowdsourcing tests, the get‐bored effect cannot be directly identified. By means of application monitoring, e.g. the clicking behavior, effects like get‐bored or also uncertainty may be identified only for crowdsourcing. In lab tests, an experienced test supervisor may easily identify both effects.

Scaling effect by using a measurement scale like 5‐point absolute category rating scales (ACR) as an interface with the subject. While scales with more items lead to more sensitive results, they are more difficult to use for subjects and increase the uncertainty effect. Inequalities of intervals between two categories on a rating scale also affect the quality assessment. However, due to the remoteness of crowdsourcing tests, it is recommended designing the tests as easy as possible. Thus a 5‐point rating scale seems appropriate. In contrast, lab tests may use more complex scales explained to the test persons.

Language effects. The used language of the subjective test, i.e. for the test questions, the rating scales, the considered contents, may influence the quality assessment. Since crowdsourcing platforms give access to users world‐wide, the common language is English being not necessarily the native language of the test person. For that reason, the tests and questions have to be designed also as easy as possible, that non‐expert and non‐native users understand it. However, since some crowdsourcing platforms allow restricting the country for a certain campaign, it is possible to conduct the test in native languages.

Context effect. The context effect is caused by the actual assessment situation, i.e. laboratory environment vs. more realistic settings in a crowdsourcing experiment, resulting into differences between both types of studies that cannot be overcome by test design. The context effect includes learning effects, memory effects like corpus and order effect, as well as environment conditions which we differentiate in the following.

Corpus effect. The range of quality levels or degradations throughout a test influences the subjects’ ratings. The introduction of bad quality conditions in a test leads to higher ratings of all conditions in the test, and vice versa. This is referred to as the “positive” and ”negative” corpus effect. In lab studies, this can be taken into account by presenting the entire range of quality levels to the subject. For crowdsourcing studies, short tests in the order of a few minutes are typical. Hence, it is difficult to test for example ten test conditions for a video of 5 minutes. Although a higher payment to the crowdsourcing user may encourage users to participate, the reliability of the user is not granted and the advantages of crowdsourcing (mainly many diverse users and low costs) are diminished again; a trade‐off between costs and data quality exists for crowdsourcing studies.

Environment effect. The actually used devices, the implementation of the test, e.g. a video streaming web site within a certain web browser, and the technical conditions (including unexpected failures or impairments of the video quality) will strongly influence the user ratings. Since in lab studies, the test environment is the same for all users, the influence of different environment settings (e.g. tests on different mobile devices) can be separately tested. In the crowdsourcing tests, the environment cannot be affected or only partly. In fact, it is also possible to restrict campaigns to users which fulfill certain requirements, e.g., the test has to be conducted with an iPhone device. In general, a large amount of test conditions has to be tested (e.g. in terms of different web browsers or end user devices) and the technical environment has to be monitored permanently during a subjective crowdsourcing test.

The following effects are present for crowdsourcing as well as lab studies and can be addressed in the same way.

Content effect. The contents themselves are typically sensitive to certain impairments which will directly influence the user ratings. In addition, user preferences for the content, i.e. if the like the content or not, may influence the user rating. For this reason, different contents have to be tested and the users have to be asked about their personal opinion about the content.

Order effect. The previous test condition influences the current user rating. Hence, typically random orders of conditions are presented to the individual subjects or group of subjects.

Learning effect. The subjects learn over time the possible quality levels and adjust their ratings. For that reason, warm‐up phases may be introduced to overcome this or user ratings in the learning phase are neglected.

Next, we illustrate the differences between crowdsourcing studies and laboratory studies considering the conduction of the tests themselves. Therefore, we analyze two similar subjective studies conducted via crowdsourcing and at FTW’s i:Lab in terms of costs, time, number of users, etc. In addition, we have conducted a third study via Facebook. Since the crowdsourcing platform is only used for acquiring the users while the test user survey is implemented on our own web server, the same user survey can be done with social network users. Thus, the crowdsourcing study was conducted additionally by acquiring users from a social network. The users are mainly students from our university and their friends, such that about 90% of the social network users conducting the test were from Germany. The results are outlined in Table 1.

Table 1. Differences between crowdsourcing and laboratory studies regarding various effects.

Crowdsourcing(C) Socialnetworks(S) Laboratorytest(L)

Implementationtime about2weeks;testimplementedviadynamicwebpages,applicationmonitoring

sameasfor(C)

Setuptime/instrumentation

5minutestoswitchonactualtestsettingatwebserver

sameasfor(C)

Timeforacquiringpeople

5minutes 2hours,asusers(groups)wereaskedindividually

Crowdsourcing(C) Socialnetworks(S) Laboratorytest(L)

Timeforanalyzingthedataafterwards

1daytofilteroutunreliableusers(bymeansofautomaticfilteringandmanualchecking)

sameasfor(C)

Campaignsubmissioncost

16Euro 0Euro

Subject’sreward 0.15Euro 0.00Euro

Subject’srewardforsinglevideorating

0.05Euro 0.00Euro

Advertisedpeople 100 350

Campaigncompletiontime

31hours 26days;stronglydependsonadvertisedusergroupshowever

Participatingusers 100 95

Reliableusers(verystrictfilteringofusers)

30 58 95%accor

Averagebandwidthfordownloadingtest(indicatorforuserexpectations)

226kBps;mostofparticipatingusersarefromAsiaforthiscampaign

1184kBps;mostofparticipatingusersarefromGermany

N/A

Numberofdifferentcountriesofsubjects

30 3;stronglydependsonusersgroupshowever

It can be seen that crowdsourcing studies are indeed cheaper and faster completed than laboratory studies. However, reliability of users is a big issue. Thus, the task design methods are important to be included in crowdsourcing studies. These methods allow a three level filtering of the users. The first level identifies crowdsourcing users that gave wrong answers to content questions, that provided different answers to the same rephrased consistency questions, or that often selected the same option during the test. Thus, the first level applies consistency tests, content questions and mixed answers. The second level checks additionally whether participants who watched a video with stops noticed the stalling and vice versa, i.e., gold standard data is included in the test. The third level extends the previous filter level by additionally monitoring the application usage. All users are removed that did not watch all three videos completely.

Interestingly, each filter technique reduces the number of valid crowdsourcing workers by approx. 25% on average over all campaigns. This indicates that the consistency tests are quite useful for identifying spammers clicking random answers as well as video content questions and monitoring task specific parameters (like the focus time) for identifying sloppy workers who do not watch the video carefully enough. Due to our restrictive filtering, only about one fourth of the subjective ratings were finally considered for the YouTube QoE results, as presented in the previous section. However, as we will see later in Section Statistical Measures and Fake User Ratings4.4.2, the last filtering level was too strict and unnecessarily rejected user ratings, which means that the application‐layer monitoring was not working dependably. Thus, it seems that about 50% of the user results can be used. It has to be noted, that this is ongoing work (in the domain of crowdsourcing platforms and mechanisms) to develop ‘reliability mechanisms’ for ensuring high data quality.

Table 1 further shows that social networks are also interesting for conducting subjective user studies. However, the same concerns arise as with crowdsourcing in terms of reliability of test users. It is more difficult to motivate people to participate in such surveys, since no monetary incentives are given, and the

demographics of the user cannot be restricted (as possible for crowdsourcing). However, social networks typically give access to a lot of demographical data which can be used for data analysis.

ACR rejection method. Instead of utilizing additional task design mechanisms to identify unreliable user ratings, ITU‐Recommendation ITU‐R BT.500‐11 proposes a rejection method, based on the actual user ratings. In particular, the distribution of user ratings of a certain test condition, i.e. a YouTube stalling pattern in this case, is assured to be normal by rejecting ratings from users not passing the β2 test. However, our test results showed that this rejection method does not filter out fake users (which were clearly identified by the proposed filtering mechanisms).

Intra‐rater reliability. This measure indicates the reliability of an individual subject across several test conditions. Spearman’s rank correlation coefficient can be utilized to quantify the intra‐rater reliability. However, some people might simply perceive different stalling conditions to be equally annoying. For example, three or four stalling events lead to bad user experience rated with a “1” on 5‐point scale. As a result, intra‐rater reliability decreases. Hence, it might be critical to reject such user ratings.

Relevance for QUALINET WG 3 “Quality metrics”: Psychological aspects like expectations or social context are often neglected in QoE modeling. Crowdsourcing gives the possibility to access different user groups (in terms of expectations/social background) and to quantify possible impact. Expectations for example may be measured in terms of Internet connection speed. Subjective tests could be repeated for different countries (and therewith social context) to quantify a possible influence.

Relevance for QUALINET WG 5 “Standardization and dissemination”: Currently, a lot of researchers are doing subjective user studies without knowing how to design studies and possible effects, like corpus, order, or scaling effects. It would be nice to have a web site on the QUALINET homepage as rough guideline for subjective testing, summarizing such effects, typical countermeasures and statistical measures/tools for analysis.

Rejection methods in standards should be updated taking into account additional task‐design based filtering mechanisms.

4.3.2 QuantitativeComparisonregardingYouTubeQoEIn this section, the YouTube QoE results are compared from the (filtered) crowdsourcing user ratings and the user ratings from laboratory study. Figure 3 shows the mean opinion score depending on the number of stalling events, where each stalling event has a duration of 4 seconds. It can be seen that both curves agree and lead to the same mapping functions (within the bounds of confidence intervals at a significance level of 95%).

Figure 3. Comparison of YouTube MOS obtained via crowdsourcing and laboratory studies depending on the number of stalling events (of two seconds length each).

Figure 4. Unbalanced number of ratings due to current crowdsourcing platform lead to significant differences, e.g. for stalling events of 4 seconds length.

0 1 2 3 4 5 61

2

3

4

5

number of stallings

MO

S

crowdsourcinglaboratory

0 1 2 3 4 5 61

2

3

4

5

number of stallings

MO

S

crowdsourcinglaboratory

Figure 4 shows the MOS depending on the number of stalling events which have a length of 2 seconds each. Now, we see a discrepancy between the two curves. A closer look at the figure shows that there are mainly two measurement points of the crowdsourcing study (for N=2 and N=4 stalling events) being responsible for the deviation. In fact, the other measurement points lie within the confidence interval of the MOS values for the lab study. Thus, except for these two measurement points, similar MOS values are observed.

An analysis of the two ‘outliers’ showed that the number of user ratings contributing the MOS are unbalanced. In particular, for N=4 stalling events only 16 ratings were obtained, since the other user ratings were rejected according to the filtering process after the test study. The small number of user ratings also explains the larger confidence interval. Thus, in unreliable environments like crowdsourcing, the test conditions should be repeatedly tested until a fixed number of (reliable) user ratings is available. As a further consequence, automated filtering mechanisms are required which are also implemented and processed at the web server after a user finishes a test. It would be desirable if the crowdsourcing platform provides mechanisms by its own for identifying reliable user or reliable user results.

In addition to the quality metrics, we investigated the acceptance of users. To this end, we additionally asked the users – besides rating the user perceived quality on a 5‐point ACR scale – to answer an acceptance question. Figure 5 shows the relation between acceptance and user perceived quality for the three different studies, i.e. conducted in the i:lab, using Facebook and using crowdsourcing, respectively. In particular, we considered for each study all users who rated the quality to be x (on the rating scale). Then, the ratio of these users, who rate the QoE with x and who accept the quality, is depicted on the y‐axis.

It can be seen that there are strong differences across the different studies. One reason is that the acceptance questions are different, i.e. the language effect takes place. In the i:Lab study, the users were asked (in German language): “GesamtQualität akzeptabel? Ja / Nein, ich würde die Sitzung abbrechen und später erneut probieren.” The crowdsourcing users were asked “Would you watch a video clip on YouTube if it had these stops? Yes / No.“. However, in the Facebook study and the crowdsourcing study, the users were asked the same question. Thus, the differences can in this case only be explained by differences in the demographics and related psychological influence factors like expectations.

A further observation is a linear relation between acceptance and user perception until a certain saturation point is reached. In the three studies, a linear approximation seems to be quite reasonable which is an interesting result and should be studied further in more detail.

Relevance for QUALINET WG 3 “Quality metrics”: Generic relationships between user acceptance and user perception are an important extension for further analysis and subjective user studies.

Relevance for QUALINET WG 4 “Databases and validation”: Databases considering user perception as well as user expectations are of high interest.

Relevance for QUALINET WG 5 “Standardization and dissemination”: The results from the YouTube QoE study and the comparison of different lab environments are very promising to additionally take into account crowdsourcing as appropriate vehicle for conducting subjective user studies, especially for web applications and cloud applications typically also running within a web browser.

Figure 5. Relation between acceptance and user perceived quality.

1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

user rating

ratio

of u

sers

acc

eptin

g qu

ality

at r

atin

g va

lue

i:lab studyfacebookcrowdsourcing

4.4 ImpactofUnreliableUsersonQoEIn crowdsourcing studies, unreliable users may participate in subjective user studies. The question arises to which extent unreliable users do influence QoE and statistical measures. In the following, we denote the ratings from unreliable users as ‘fake user ratings’.

4.4.1 RatingBehaviorofFakeUsersSince the fake user ratings are not correlated with the stalling conditions, the user ratings are assumed to be a sequence of independent and identically distributed (iid.) random variables. Figure 6 shows the distribution of the fake user ratings in comparison to a uniform discrete distribution. In addition, the Clopper‐Pearson interval is plotted, which is a common method for calculating binomial confidence intervals for probabilities. It can be seen that the fake user ratings are randomly distributed over the entire rating scale, close to the uniform discrete distribution. Thus, the fake user ratings may be approximated by a uniform discrete distribution. This seems to be quite reasonable, since the intention of the fake users is not to manipulate the QoE ratings, but they randomly complete the user survey, e.g. to quickly earn more money for paid crowdsourcing campaigns or due to disinterest in the current study.

Relevance for QUALINET WG 4 “Databases and validation”: It would be of interest to have a database with ‘failed’ subjective studies for further analysis of reliability issues, fake user ratings, and the impact of ‘random’ user ratings on QoE.

4.4.2 StatisticalMeasuresandFakeUserRatingsFor the analysis of subjective users studies, different statistical measures are typically applied, e.g. mean opinion scores, confidence intervals, correlation coefficients, as well as inter‐ and intra‐rater reliability measures. We consider some common measures and show the impact of fake user ratings on these measures.

Pearson’s correlation coefficient is a measure of the strength of the linear dependence between the observed user rating and the stalling pattern. In seven different crowdsourcing campaigns, we varied either a) the number of stalling events (with fixed stalling duration) or b) the duration of an individual stalling event (with fixed number of stalling events). Since there is a non‐linear relationship between user rating and stalling, but rather an exponential decay, the correlation coefficient is only around ‐0.6, even when considering reliable user ratings only, see Figure 7. The ratings from the fake users however are uncorrelated to the stalling conditions (correlation coefficient around ‐0.2). As a consequence, the results from all users, i.e. reliable and fake user ratings, are strongly influenced by the fake user ratings and all user ratings seems to be uncorrelated to stalling.

Inter‐ and intra‐rater reliability. Next, we consider inter‐rater reliability as well as intra‐rater reliability. These measures are implemented by means of Spearman’s rank correlation coefficient as a non‐parametric measure of statistical dependence between user rating and stalling conditions. Figure 8 shows that the filter levels increase the reliability significantly. There, two exemplary crowdsourcing campaigns are considered with a fixed stalling event length of two and four seconds, respectively. In particular, the inter‐rater reliability is depicted, quantifying the degree of agreement across subjects. Similar results are observed for the other crowdsourcing campaigns as well as for the intra‐rater reliability quantifying the degree of

Figure 6. Distribution of the fake user ratings and the confidence intervals for the corresponding probabilities.

1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

user rating value

CD

F

fake user ratinguniform distribution

agreement for different test conditions by a single rater. Thus, for ‘worse test conditions’, e.g. a higher number of stalling events, the user perceived quality shall not increase. In addition to the crowdsourcing inter‐rater reliability, the same measure is depicted in Figure 8 for the i:lab study. It can be seen that the laboratory study always leads to the highest reliability, close to one.

Figure 7. Pearson’s linear correlation coefficient between users ratings and stalling.

Figure 8. Inter‐rater reliability for different levels of filtering and compared to i:Lab study.

MOS and SOS values. For a certain test condition, the ratings X from reliable users and the ratings Y from fake users are observed. The user ratings X follow an arbitrary general, independent random variable with expected value E[X], standard deviation STD[X], and variance VAR[X]=STD[X]2. The fake user ratings Y are assumed to follow a uniform discrete distribution, i.e., the probability P(Y=i) that a fake user rates the quality with score i ∈ {1,2,3,4,5} is P(Y=i) = 1/5. The expected value is E[Y]=3 and the variance is VAR[Y]=2.

Then, the observed user ratings Z are the superposition of the ratings from reliable and fake users. Accordingly, the probability P(Z=i) to observe a user rating i follows as P(Z=i)=(1‐ α)P(X=i) + α P(Y=i).The ratio α denotes the ratio of fake users in this context. After some algebraic manipulations, we arrive at E[Z]=(1‐ α)E[X] +αE[Y] and VAR[Z]=11 α+(1‐ α)E[X2]‐((1‐ α)E[X]+3 α)2 for the mean value and the variance of the observed user ratings.

Considering the MOS values alone does not allow drawing any conclusions about fake users and the credibility of the presented subjective test results. However, considering the standard deviation of opinion scores (SOS) in addition to the MOS values help identifying incredible results. Since the user perceived quality is affected by the present stalling condition, the SOS values are lower than SOS values of random user ratings. Especially at some particular test conditions, e.g. no stalling, users almost agree at the user perceived quality and the SOS is close to zero. As a consequence, fake user ratings in a test increase the overall variance.

The SOS hypothesis [32] formulates a square relationship between the SOS and the MOS values which depends on a single parameter only, the so‐called SOS parameter a. This SOS parameter a lies in a certain range for an application like YouTube video streaming and quantifies the diversity of user ratings. Hence, if there are many fake users ratings included in the test, the SOS parameter will be very high which is not common for this kind of application. Thus, the SOS parameter allows additionally identifying incredible subjective studies. Figure 9 shows the

1 2 3 4 5 6 7-0.7

-0.6

-0.5

-0.4

-0.3

-0.2

-0.1

0

campaign

corr

elat

ion

coef

ficie

nt

reliable usersfake usersall users

2 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

stalling event length (s)in

ter-

rate

r re

liabi

lity

filter level 0filter level 1filter level 2filter level 3i:Lab

Figure 9. SOS hypothesis for different crowdsourcing campaigns and filter levels.

0 0.2 0.4 0.6 0.8 10.25

0.3

0.35

0.4

0.45

0.5

0.55

ratio of fake users

SO

S p

aram

eter

a

stalling length L=1sstalling length L=3s

filter level 1

filter level 3

filter level 2

SOS parameter a for different filter levels of two crowdsourcing campaigns. In addition to the (filtered) user ratings, we assume additional fake user ratings following a uniform distribution. The ratio of fake users is depicted on the x‐axis. It can be seen that the filtering of unreliable user ratings is important, i.e., filter level 1 always to a very high SOS parameter a. If there are too many fake users involved in the test, the SOS parameter is also out of scale.

Confidence intervals are often misused as measure for reliability of subjective tests. However, additional fake user ratings in a test decrease the confidence intervals, if the test conditions yield high variances of user ratings or if the ratio of fake users is above 30% roughly. Hence, small confidence intervals are not meaningful with respect to credibility of user studies.

Relevance for QUALINET WG 5 “Standardization and dissemination”: Statistical measures like reliability or SOS considerations should be included in standards to proof credibility of subjective studies.

5 SummaryofRelevanceforQUALINETA short summary of the relevance of the scientific results of this STSM is provided with respect to the different working groups in QUALINET.

QUALINET WG 1 “Application areas”:

Internet applications like YouTube video streaming as well as cloud applications are often delivered with reliable transport protocols like TCP. As a consequence, the (video) contents are not disturbed by the delivery over the Internet, but waiting times and service interruptions occur. Most of the relevant stimuli in this domain are of temporal nature.

QUALINET WG 2 “Mechanisms and models of human perception”:

Stalling and waiting times have a strong impact on QoE. Thus, models of human perception for various Internet and cloud applications should be extended in this temporal dimension with respect to stalling, waiting times, and service interruptions.

QUALINET WG 3 “Quality metrics”:

There are further influence parameters to be analyzed by means of subjective user studies in order to provide sophisticated quality metrics for online video streaming, a) video duration, b) initial delays, and c) complex stalling patterns.

Psychological aspects like expectations or social context are often neglected in QoE modeling. Crowdsourcing gives the possibility to access different user groups (in terms of expectations/social background) and to quantify possible impact. Expectations for example may be measured in terms of Internet connection speed. Subjective tests could be repeated for different countries (and therewith social context) to quantify a possible influence.

Generic relationships between user acceptance and user perception are an important extension for further analysis and subjective user studies.

QUALINET WG 4 “Databases and validation”:

Databases considering user perception as well as user expectations are of high interest.

It would be of interest to have a database with ‘failed’ subjective studies for further analysis of reliability issues, fake user ratings, and the impact of ‘random’ user ratings on QoE.

QUALINET WG 5 “Standardization and dissemination”:

Currently, a lot of researchers are doing subjective user studies without knowing how to design studies and possible effects, like corpus, order, or scaling effects. It would be nice to have a web site on the QUALINET homepage as rough guideline for subjective testing, summarizing such effects, typical countermeasures and statistical measures/tools for analysis.

The results from the YouTube QoE study and the comparison of different lab environments are very promising to additionally take into account crowdsourcing as appropriate vehicle for conducting

subjective user studies, especially for web applications and cloud applications typically also running within a web browser.

The developed methodology to detect unreliable subjects should be included in appropriate standards when using remote user panels, e.g. using crowdsourcing platforms or users of social networks. In an analogous way, reliability task design methods should be considered for laboratory tests, too, since rejecting unreliable users should not be done on basis of submitted user ratings.

Statistical measures quantifying reliability or user diversity in terms of SOS should be included in standards to standards to indicate credibility of subjective studies.

6 OutcomeoftheSTSMAs main outcome of the STSM, a fruitful and long‐term collaboration between both institutions is foreseen. Since FTW and University of Würzburg are both involved in COST TMA as well as COST QUALINET (European Network on Quality of Experience in Multimedia Systems and Services), these projects will foster joint research activities and give the opportunity of strong collaboration, e.g. by workshops and meetings or other STSMs. In addition, the outcome of the STSM is direct input for the QUALINET WG1 Subgroup “Cloud and Web Applications” led by Tobias Hoßfeld and Raimund Schatz.

The results of this STSM are aimed to be published in joint conference and journal publications as well as in workshops and meetings of the COST Qualinet action. In particular, several publications are intended or already published by the beneficiary and different researchers from FTW and UniWue:

“Quantification of YouTube QoE via Crowdsourcing” by Tobias Hoßfeld, Raimund Schatz, Michael Seufert, Matthias Hirth, Thomas Zinner, Phuoc Tran‐Gia, accepted for publication at IEEE International Workshop on Multimedia Quality of Experience ‐ Modeling, Evaluation, and Directions (MQoE 2011), Dana Point, CA, USA, December 2011.

“FoG and Clouds: On Optimizing QoE for YouTube” by Tobias Hoßfeld, Florian Liers, Thomas Volkert, Raimund Schatz, accepted at 5th KuVS GI/ITG Workshop "NG Service Delivery Platforms", at DOCOMO Euro‐Labs, Munich, Germany

“Quality of Experience of YouTube Video Streaming for Current Internet Transport Protocols” by Tobias Hoßfeld and Raimund Schatz, currently under submission at ACM Computer Communications Review; a technical report of University of Würzburg is available containing the numerical results, Technical Report No. 482: “Transport Protocol Influences on YouTube QoE”, July 2011.

" ‘Time is Bandwidth’? Narrowing the Gap between Subjective Time Perception and Quality of Experience” by Sebastian Egger, Peter Reichl, Tobias Hoßfeld, Raimund Schatz, submitted to IEEE ICC 2012 ‐ Communication QoS, Reliability and Modeling Symposium

“Challenges of QoE Management for Cloud Applications” by Tobias Hoßfeld, Raimund Schatz, Martin Varela, Christian Timmerer, submitted to IEEE Communications Magazine, Special Issues on QoE management in emerging multimedia services: YouTube is among others considered as exemplary multimedia‐intensive cloud application for personal usage. Related QoE challenges and QoE management are discussed.

“Recommendations and Comparison of Subjective User Tests via Crowdsourcing and Laboratories for online video streaming”, intended for QoEMex 2012.

“Impact of Fake User Ratings on QoE”, intended for Journal submission.

The results of this STSM inspired for a Specific Joint Research Project within the Network of Excellence Euro‐NF, which is currently under review. The consortium is led by Tobias Hoßfeld with FTW as external partner (outside Euro‐NF) and focuses on “Prospects for Realizing User‐centric Network Orchestration”. To this end, Quality of Experience (QoE) of different applications is considered within the project. In particular, we consider A) streaming services like YouTube video streaming with high video quality playback demands and B) interactive web services like social networks with high security and privacy demands.

Concerning standardization activities, a “Report on influence factors on web‐browsing QoE and proposal of new study item” for the ITU‐T Study Group 12 “Performance and quality of service” has been submitted in the context of Question 13 (QoE, QoS and performance requirements and assessment methods for multimedia including IPTV) and Question 11 (transmission planning, interworking and traffic management for networks supporting voice, data and multimedia services).

Beyond the ongoing work, a collaboration regarding subjective crowdsourcing user methodology for deriving sophisticated QoE models as well as on QoE for cloud applications has already been started, which will be discussed within the QUALINET WG1 subgroup “Cloud and Web Applications”. In addition, the application of the YouTube QoE model is of interest for the ITC community and will be investigated as concrete research topic within the intended long‐term research collaboration beyond the STSM.

7 References[1] Cisco Systems Inc. “Cisco Visual Networking Index: Forecast and Methodology, 2009‐2014”. June 2010. [2] COMET project ACE: “Advancing the Customer Experience”, http://ace.ftw.at [3] EU FP7 Network of Excellence Euro‐NF: “Anticipating the Network of the Future – From Theory to Design”,

http://www.euro‐nf.org [4] EU STREP project Optiband, http://optiband‐project.eu [5] EU STREP project SmoothIT: “Simple Economic Management Approaches of Overlay Traffic in Heterogeneous

Internet Topologies”, http://www.smoothit.org [6] G. Maier, A. Feldmann, V. Paxson, M. Allman: „On Dominant Characteristics of Residential Broadband Internet

Traffic“. IMC 2009, Rostock, Germany, 2009. [7] G‐Lab project: “National platform for Future Internet Studies”, funded by the Federal Ministry of Education

and Research of the Federal Republic of Germany, http://www.german‐lab.de [8] K. Chen, C. Chang, C. Wu, Y. Chang, C. Lei. „Quadrant of Euphoria: A Crowdsourcing Platform for QoE

Assessment“. IEEE Network, Volume 24 Issue 2, April 2010. [9] M. Fiedler and T. Hoßfeld: “Quality of Experience‐Related Differential Equations and Provisioning‐Delivery

Hysteresis”. 21st ITC Specialist Seminar on Multimedia Applications ‐ Traffic, Performance and QoE, Phoenix Seagaia Resort, Miyazaki, Japan, March 2010.

[10] M. Fiedler, K. Kilkki, P. Reichl: „From Quality of Service to Quality of Experience”. Dagstuhl Seminar 09192, Germany, May 5‐8, 2009

[11] M. Fiedler, T. Hoßfeld, P. Tran‐Gia: “A Generic Quantitative Relationship between Quality of Experience and Quality of Service”, IEEE Network Special Issue on Improving QoE for Network Services, 2010.

[12] M. Hirth, T. Hoßfeld, P. Tran‐Gia. “Anatomy of a Crowdsourcing Platform ‐ Using the Example of Microworkers.com”. To be published at the Workshop on Future Internet and Next Generation Networks (FINGNet), Seoul, Korea, June 2011.

[13] M. Hirth, T. Hoßfeld, P. Tran‐Gia. “Cost‐Optimal Validation Mechanisms and Cheat‐Detection for Crowdsourcing Platforms”. To be published at the Workshop on Future Internet and Next Generation Networks (FINGNet), Seoul, Korea, June 2011.

[14] M. Jarschel, D. Schlosser, S. Scheuring, T. Hoßfeld. „An Evaluation of QoE in Cloud Gaming Based on Subjective Tests”. To be published at the Workshop on Future Internet and Next Generation Networks (FINGNet‐2011), Seoul, Korea, June 2011.

[15] P. Fröhlich, L. Baillie, P. Reichl, R. Schatz, F. Hammer, G. Niklfeld: “The HTI lab @ ftw: user research for telecom systems”. In CHI '06 Extended Abstracts on Human Factors in Computing Systems (Montréal, Québec, Canada, April 22 ‐ 27, 2006).

[16] P. Reichl, F. Hammer, S. Egger, M. Ries: “The Well‐Tempered Conversation: On Quality of Experience of Interactive VoIP. Extended Abstract”. Proc. EuroFGI IA.7.6 Workshop on Socio‐Economic Aspects of Future Generation Internet, Blekinge Institute of Technology, Karlskrona, Sweden, May 27‐29, 2008

[17] P. Reichl, S. Egger, R. Schatz, A. d'Alconzo: „The Logarithmic Nature of QoE and the Role of the Weber‐Fechner Law in QoE Assessment”. ICC 2010, Capetown, South Africa, May 23‐27, 2010

[18] P. Reichl: “From Charging for Quality‐of‐Service to Charging for Quality‐of‐Experience Annals of Telecommunications”. Special issue on “Quality of Experience and Socio‐Economic Issues of Network‐Based Services”, December 2009

[19] R. Schatz, L. Baillie, P. Froehlich, S. Egger, T. Grechenig: „What Are You Viewing? Exploring the Pervasive Social TV Experience”. In A.R. Roibás, A. Marcus, R. Sala (Eds.): “Mobile TV: Customizing Content and Experience“, Springer Verlag, 2009

[20] S. Egger, P. Reichl, M. Ries: „Quality‐of‐Experience Beyond MOS: Experiences with a Holistic User Test Methodology for Interactive Video Services”. 21st ITC Specialist Seminar on Multimedia Applications ‐ Traffic, Performance and QoE, Phoenix Seagaia Resort, Miyazaki, Japan, March 2‐3, 2010

[21] T. Ciszkowski, W. Mazurczyk, Z. Kotulski, T. Hoßfeld, M. Fiedler and D. Collange: “Towards Quality of Experience‐based Reputation Models for Future Web Service Provisioning”. Special Issue of the Springer Telecommunication Systems Journal: Future Internet Services and Architectures ‐ Trends and Visions, 2010.

[22] T. Hoßfeld. “Performance Evaluation of Future Internet Applications and Emerging User Behavior”.Received MMB 2010 PhD award, award for outstanding PhD‐theses 2010 at University of Würzburg, and best dissertation award 2010 of the Institute of Computer Science. August 2009

[23] T. Hoßfeld, D. Hock, P. Tran‐Gia, K. Tutschku, M. Fiedler: “Testing the IQX Hypothesis for Exponential Interdependency between QoS and QoE of Voice Codecs iLBC and G.711”. 18th ITC Specialist Seminar on Quality of Experience, Karlskrona, Sweden, May 2008.

[24] T. Hoßfeld, K. Leibnitz. “A Qualitative Measurement Survey of Popular Internet‐based IPTV Systems.” Second International Conference on Communications and Electronics (HUT‐ICCE 2008), Hoi An, Vietnam, June 2008.

[25] T. Hoßfeld, M. Fiedler, T. Zinner. “The QoE Provisioning‐Delivery‐Hysteresis and Its Importance for Service Provisioning in the Future Internet”. Currently under submission.

[26] T. Hoßfeld, M. Hirth, P. Tran‐Gia. “Modeling of Crowdsourcing Platforms and Granularity of Work Organization in Future Internet”. Currently under submission.

[27] T. Hoßfeld, R. Schatz. A. Platzer, S. Egger, S. Biedermann, M. Fiedler. „The Memory Effect and Its Implications on Web QoE Modeling“. Currently under submission.

[28] T. Tominaga, T. Hayashi, J. Okamoto, A. Takahasi. “Performance Comparisons of Subjective Quality Assessment Methods for Mobile Video”. 2

nd Workshop on Quality of Multimedia Experience (QoMEX), Trondheim, 2010.

[29] T. Zinner, O. Abboud, O. Hohlfeld, T. Hoßfeld, P. Tran‐Gia: “Towards QoE Management for Scalable Video Streaming”. 21th ITC Specialist Seminar, Miyazaki, Jap, March 2010.

[30] T. Zinner, O. Hohlfeld, O. Abboud, T. Hoßfeld: “Impact of Frame Rate and Resolution on Objective QoE Metrics”. International Workshop on Quality of Multimedia Experience, 2010, Trondheim, June 2010.

[31] T. Zinner, T. Hoßfeld, T. Minash, M. Fiedler: “Controlled vs. Uncontrolled Degradations of QoE – The Provisioning‐Delivery Hysteresis in Case of Video”. New Dimensions in the Assessment and Support of Quality of Experience (QoE) for Multimedia Applications, Tampere, June 2010.

[32] Tobias Hoßfeld, Raimund Schatz, Sebastian Egger: “SOS: The MOS is not enough!” QoMEX 2011, Mechelen, Belgium, September 2011.

Modeling YouTube QoE based on Crowdsourcing and Laboratory ...crowd:... · Modeling YouTube QoE...

Documents

Transcript of Modeling YouTube QoE based on Crowdsourcing and Laboratory ...crowd:... · Modeling YouTube QoE...