The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

28
The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity Markus Luczak-Roesch (some slides based on work of Ramine Tinati) University of Southampton (UK), Web and Internet Science Group @mluczak | http://markus-luczak.de Image source: https://en.wikipedia.org/wiki/File:Compound_Microscope_(cropped).JPG, CC BY-SA 4.0

Transcript of The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

Page 1: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity Markus Luczak-Roesch (some slides based on work of Ramine Tinati) University of Southampton (UK), Web and Internet Science Group @mluczak | http://markus-luczak.de Image source: https://en.wikipedia.org/wiki/File:Compound_Microscope_(cropped).JPG, CC BY-SA 4.0

Page 2: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

The World Wide Web

Image source: screenshot taken from https://www.w3.org/History/1989/proposal.html

Page 3: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

Essential part of the data science story of the WWW: Web Observatories

Data Sources Challenges: -  Who are the providers? -  Is the service reliable/stable?

Data Collec=on Challenges: -  API Limita=ons/Restric=ons -  Data Schemas/Consistency -  Does it change over=me?

Data Storage Challenges: -  Storage approaches

(rela=onal, flat, linked?)

Data Analysis and Modelling Challenges: -  What methods/models? -  How is the data sampled?

Data Visualisa=on Challenges: -  Misrepresenta=on of data?

e.g. visualise “filtered” data

Data Querying and Transforma3on

Sta3s3cal and computa3onal analysis Methods

Data Interpreta=on Challenges: -  Are the ques=ons being

asked relevant to the data -  Are insights being fed back

into the analysis?

Add or update ini3al stored data

Update current harves3ng strategy (req. for real-3me analysis)

(a)

Image source: https://en.wikipedia.org/wiki/File:Sphinx_Observatory.jpg, CC BY-SA 2.0

Page 4: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

What to observe? Social Machines!

“Real life is and must be full of all kinds of social constraint – the very processes from which society arises. Computers can help if we use them to create abstract social machines on the Web: processes in which the people do the creative work and the machine does the administration.“ Berners-Lee, Tim; Mark Fischetti (1999). Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by its inventor. Britain: Orion Business. ISBN 0-7528-2090-7.

Page 5: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

Topic outbreaks across systems

Peakintweets

containingtopic‘x’

PeakinWikipediaviewsofar7cles‘x’

‘Lagdiffusion’7me

Page 6: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

Topic outbreaks across cultures

Page 7: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

Participation in Citizen Science projects by communication patterns

Stage CriteriaBefore Game (Q0) 30s < Game StartStart of Game (Q1) Game Start < x < 1st Quartile Game DurationDuring Game (Q2-3) Quartile Game Duration < x < 3rd Quartile

Game DurationEnd of Game (Q4) 3rd Quartile Game Duration < x < Game EndAfter Game (Q5) 30s < Game End

Table 1: Chat Message Stages: Boundary Conditions

Figure 3: Five stages of chat messages during the gaming pro-cess

represents player activity between 2012-01-19 to 2014-08-05. Thedata contains 4,409,998 game entries and 835,732 chat messages,made by 98,224 unique players. For each game, the EyeWire sys-tem records the total duration taken (in seconds) for a player tocomplete a task, and the time the game was completed. Each chatmessage contains the player’s ID, timestamp, and message text.

In order to examine the question of player chat engagement andto offer a finer level of granularity of players with similar character-istics, we extracted different sets of players related to their gamingand chatting behaviour. We initially reduced the data to includeplayers who contributed to both games and chat. we labelled thesethe ‘active’ players. Based on these players, we computed severaladditional sub-sets of players related to specific EyeWire features;for each of these sub-sets we computed a number of statistics andaggregate counts, as described in Table 2.

In addition to computing statistics for the 10,714 ’active’ playersthat participated in games and chat, we extracted the top quadrantof ‘active’ players, similar to the approach taken in other citizenscience studies of community engagement [27]. We label theseplayers as ‘highly active’. Based on a initial analysis of user re-tention, ‘highly active’ players contain individuals who sustaineda minimum duration of 30 days with respects to writing chat mes-sages and completing a game.

5. RESULTSThe results are organised as follows, we begin by presenting the

general findings from the system-level analysis, then explore therole of chat and its relationship with a players’ gaming participa-tion. We then report on the chat messages corresponding to differ-ent stages of the gaming process, the impact on game commandson gaming, and finally, examine the context of the chat messagesby using topic modelling.

5.1 General FindingsThe general analysis examined the structure and characteristics

of the EyeWire platform. We divide this section up by exploringinteraction between real-time chat and gaming. As Figure 4 illus-trates, there is a long tail distribution of chat and gaming activ-ity. 86.2% of games and 95.6% of chat messages are performed by10.9% of EyeWire players. These ’active’ players engage in bothchat and gaming. We note that in comparison to non-gamified cit-izen science platforms the proportion of ‘active’ EyeWire players

are significantly lower [27], however, EyeWire exhibits a similardistribution of player contributions.

By extracting the the ’highly active’ players (defined by thosethat are active on their account for for over 30 consecutive days),then as Table 2 shows, just over 1% of EyeWire players were re-sponsible for over 50% of the total games ( 2 million).

Comparing players that only participated in gaming (which ac-counted for 88% of EyeWire players) to those that engaged in bothchat and gaming (the ‘active’ players), we found that the averagenumber of games completed by gaming only players was signifi-cantly lower (15 games compared to 255). In addition to this, theoverall account length (the total time they were active on EyeWire)of ’active’ players was nearly 4 times longer. However, with re-spects to the frequency to which they completed a game (the deltain minutes between games) those that only participated in the gamespent on average 6 minutes between starting a new game, in com-parison to 65 minutes for the ‘active’ players.

Figure 4: Distribution of games, chat messages, and accountdurations (games and chat) for all EyeWire players.

Figure 5: Timeline of chat and gaming activity for the EyeWireplatform.

5.1.1 Player CohortsAs shown in Figure 4, the analysis of chat and gaming account

duration reveals that for gaming activity, there are many playerswhich have a short gaming duration, whereas players chat for longerperiods of time. In order to examine the retention of players withinthe EyeWire platform in greater depth, we used a cohort analysismethod as described by [18, 19]. We apply this approach to ob-tain a ‘chat’ and ‘gaming’ cohort, which corresponds to the playerswhich have had at least one recorded activity in a given month. Theanalysis encompasses the total lifetime of the project and assignsplayers to a cohort based on the month that their first activity wasidentified. Figure 6 illustrate the retention of players based on theiractivity in chat and gaming. The analysis discovered 19 chat and

Stage CriteriaBefore Game (Q0) 30s < Game StartStart of Game (Q1) Game Start < x < 1st Quartile Game DurationDuring Game (Q2-3) Quartile Game Duration < x < 3rd Quartile

Game DurationEnd of Game (Q4) 3rd Quartile Game Duration < x < Game EndAfter Game (Q5) 30s < Game End

Table 1: Chat Message Stages: Boundary Conditions

Figure 3: Five stages of chat messages during the gaming pro-cess

represents player activity between 2012-01-19 to 2014-08-05. Thedata contains 4,409,998 game entries and 835,732 chat messages,made by 98,224 unique players. For each game, the EyeWire sys-tem records the total duration taken (in seconds) for a player tocomplete a task, and the time the game was completed. Each chatmessage contains the player’s ID, timestamp, and message text.

In order to examine the question of player chat engagement andto offer a finer level of granularity of players with similar character-istics, we extracted different sets of players related to their gamingand chatting behaviour. We initially reduced the data to includeplayers who contributed to both games and chat. we labelled thesethe ‘active’ players. Based on these players, we computed severaladditional sub-sets of players related to specific EyeWire features;for each of these sub-sets we computed a number of statistics andaggregate counts, as described in Table 2.

In addition to computing statistics for the 10,714 ’active’ playersthat participated in games and chat, we extracted the top quadrantof ‘active’ players, similar to the approach taken in other citizenscience studies of community engagement [27]. We label theseplayers as ‘highly active’. Based on a initial analysis of user re-tention, ‘highly active’ players contain individuals who sustaineda minimum duration of 30 days with respects to writing chat mes-sages and completing a game.

5. RESULTSThe results are organised as follows, we begin by presenting the

general findings from the system-level analysis, then explore therole of chat and its relationship with a players’ gaming participa-tion. We then report on the chat messages corresponding to differ-ent stages of the gaming process, the impact on game commandson gaming, and finally, examine the context of the chat messagesby using topic modelling.

5.1 General FindingsThe general analysis examined the structure and characteristics

of the EyeWire platform. We divide this section up by exploringinteraction between real-time chat and gaming. As Figure 4 illus-trates, there is a long tail distribution of chat and gaming activ-ity. 86.2% of games and 95.6% of chat messages are performed by10.9% of EyeWire players. These ’active’ players engage in bothchat and gaming. We note that in comparison to non-gamified cit-izen science platforms the proportion of ‘active’ EyeWire players

are significantly lower [27], however, EyeWire exhibits a similardistribution of player contributions.

By extracting the the ’highly active’ players (defined by thosethat are active on their account for for over 30 consecutive days),then as Table 2 shows, just over 1% of EyeWire players were re-sponsible for over 50% of the total games ( 2 million).

Comparing players that only participated in gaming (which ac-counted for 88% of EyeWire players) to those that engaged in bothchat and gaming (the ‘active’ players), we found that the averagenumber of games completed by gaming only players was signifi-cantly lower (15 games compared to 255). In addition to this, theoverall account length (the total time they were active on EyeWire)of ’active’ players was nearly 4 times longer. However, with re-spects to the frequency to which they completed a game (the deltain minutes between games) those that only participated in the gamespent on average 6 minutes between starting a new game, in com-parison to 65 minutes for the ‘active’ players.

Figure 4: Distribution of games, chat messages, and accountdurations (games and chat) for all EyeWire players.

Figure 5: Timeline of chat and gaming activity for the EyeWireplatform.

5.1.1 Player CohortsAs shown in Figure 4, the analysis of chat and gaming account

duration reveals that for gaming activity, there are many playerswhich have a short gaming duration, whereas players chat for longerperiods of time. In order to examine the retention of players withinthe EyeWire platform in greater depth, we used a cohort analysismethod as described by [18, 19]. We apply this approach to ob-tain a ‘chat’ and ‘gaming’ cohort, which corresponds to the playerswhich have had at least one recorded activity in a given month. Theanalysis encompasses the total lifetime of the project and assignsplayers to a cohort based on the month that their first activity wasidentified. Figure 6 illustrate the retention of players based on theiractivity in chat and gaming. The analysis discovered 19 chat and

Tinati, R., Luczak-Rösch, M., Simperl, E., Hall, W., & Shadbolt, N. (2015, May). /Command'and conquer: analysing discussion in a citizen science game. In ACM Web Science Conference 2015.

consensus achieved during past classifications.

Figure 1: Main Interface in EyeWire

Player communications and gamification techniques are integralto the design of the EyeWire platform. As shown in 2, EyeWirecontains an embedded real-time chat that allows players to talk toeach other, view other players points and achievements, as well asuse a number of game commands which provide additional func-tionality during gaming and talking. Game commands are issuedby using a forward slash (‘/’), such as being able to mute and hidethe chat interface by using the ‘/silence’ command. Issuing playerstatistic commands are not shown on the public chat feed, unlessa player issues a command such as group message (‘/gm’), whichposts their message to a particular team, in which they first have tojoin using the ‘/team’ command.

The formation of a team is an community-driven process whichusually is a result of an ongoing competition between teams ofplayers. Competitions are either setup by the EyeWire team (usu-ally to encourage or refresh system activity), or led by the playerswho wish to compete for a specific goal or set of ’badges’.

In addition to the internal chat, the main interface links to ad-ditional communication interfaces which are not part of the game.There is the EyeWire project blog, where the community managerspromote game highlights, competitions, and challenges as well asnew or notably successful players. The players can also consult theEyeWire wiki which contains information about how to play thegame, and about the science behind ‘connectome’ mapping. In ad-dition to this, players are provided with a forum that is meant to beused for more comprehensive, asynchronous discussion on varioustopics around the game, including error reports.

4. DATA AND METHODS4.1 Methods

The analysis of the EyeWire platform involves a study of thesystem-level properties and the analysis of players’ gaming andreal-time chat activity. In order to achieve this we developed amodel that represents games and chat messages of a player, andextracted a number of features related to their activity. This is thenused to examine system-level activity, and cross-player interactionand communication.

In order to examine the activity of EyeWire, similar to previousstudies of citizen science project analysis [27], we use player churnand cohort analysis [19] which involves using time window sam-pling techniques in order to examine the churn of players within a

Figure 2: Embedded Chat Interface in EyeWire

given time frame. The cohort analysis examines monthly cohortsof players based on their first chat and game entry, and providesa measure of sustained activity. Based on the the monthly playerretention values, we are able to differentiate between different setsof users, as described in the following section.

To examine the context and discourse within the chat messages,we perform text analysis to extract the use of EyeWire game com-mands, and also perform topic modelling on the content of the chatmessages. To achieve this we use LDA [5] to derive topic modelswhich contain common vocabulary used by players. We combinethis with the different categories of chat messages in order to de-termine the context of chat during different stages of completing agame.

As we are interested in the relationship between a players gam-ing session and use of chat, we construct a model of player chatmessages which classify chat activity at different stages of when agame is being performed. As described in Table 1 and illustratedin Figure 3, we categorise the chat messages into 5 stages aroundthe process of gaming. Stages Q1 to Q4 are relative to the time ittook for the game to be completed. For example, if a game wascompleted in 10 seconds, then Q1 would represent 0-2 seconds,Q2-3 represents 3-7 seconds, and Q4 represents 8-10 seconds. Inaddition to the three stages during a game time window, we alsoconsider 30 seconds either side of the gaming time window (Q0and Q5). We chose 30 seconds as the lower and upper boundary.30 seconds was calculated as a suitable duration based upon mea-suring the distribution of chat messages that fell outside the timewindow of a game, and using the value of 1 standard deviationaway from the mean.

4.2 DataThe analysis performed uses EyeWire game and chat data, which

Page 8: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

Participation in Citizen Science projects by communication patterns

Luczak-Roesch, M., Tinati, R., Simperl, E., Van Kleek, M., Shadbolt, N., & Simpson, R. (2014). Why won't aliens talk to us? Content and community dynamics in online citizen science. Proceedings of the Eighth AAAI Conference on Weblogs and Social Media, {ICWSM} 2014, Ann Arbor, Michigan, USA, June 1-4, 2014.

Image source: David Miller, https://daily.zooniverse.org/2013/11/21/an-ever-expanding-zooniverse/

Page 9: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

Participation in Citizen Science projects by location

Page 10: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

Temporal networks of information co-occurrence for system-agnostic exploratory data analysis

Markus Luczak-Roesch, Ramine Tinati, Max van Kleek, and Nigel Shadbolt. 2015. From coincidence to purposeful flow? Properties of transcendental information cascades. In IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Paris, FR.

Page 11: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

Where is the MacroScope?

Data Sources Challenges: -  Who are the providers? -  Is the service reliable/stable?

Data Collec=on Challenges: -  API Limita=ons/Restric=ons -  Data Schemas/Consistency -  Does it change over=me?

Data Storage Challenges: -  Storage approaches

(rela=onal, flat, linked?)

Data Analysis and Modelling Challenges: -  What methods/models? -  How is the data sampled?

Data Visualisa=on Challenges: -  Misrepresenta=on of data?

e.g. visualise “filtered” data

Data Querying and Transforma3on

Sta3s3cal and computa3onal analysis Methods

Data Interpreta=on Challenges: -  Are the ques=ons being

asked relevant to the data -  Are insights being fed back

into the analysis?

Add or update ini3al stored data

Update current harves3ng strategy (req. for real-3me analysis)

(a)

Image sources: https://en.wikipedia.org/wiki/File:Compound_Microscope_(cropped).JPG, CC BY-SA 4.0 & https://en.wikipedia.org/wiki/File:Sphinx_Observatory.jpg, CC BY-SA 2.0

Page 12: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

What is the MacroScope?

Page 13: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

Other data visualization capacities

Image source: screenshot from https://www.imperial.ac.uk/data-science/kpmg-data-observatory-/technical-specifications/

Page 14: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

Other data visualization capacities

Image source: screenshot from http://approach.rpi.edu/2015/11/18/immersive-experience-the-campfire/

Page 15: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

What is the MacroScope?

“Wow, they don’t even know that this is happening!”

Page 16: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

Do we really think this is an event to be addressed in a purely quantitative fashion?

Source: United Nations Development Programme, https://goo.gl/Z1uXdV, CC BY-NC-ND 2.0

Page 17: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

A qualitative investigation of crowdsourced disaster response

•  Haiti (Ushahidi, N=298) –  requests for help from

identified local source

•  Congo (Ushahidi, N=102) –  information about the

situation but not who is responsible for this information

– more non-local sources

•  Ebola (Twitter, N=298) –  comments

•  tasteless jokes •  racist comments •  concern that the crisis could

spread and call to governments to close the borders

Joint project with Silke Roth

Page 18: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

Boundaries of crowdsourced disaster response

•  Wrong things go viral •  Crowdsourcing informativeness

of social media information not synchronized with crises

negative neutral positive

18 “When you tell a […] kid that is has got Ebola”

Page 19: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

Serendipitous discoveries in Citizen Science

Hanny’s Voorwerp Galaxy Zoo [2007]

Green Pea Galaxies Galaxy Zoo [2007]

Yellow Balls Milky Way [2009]

Circumbinary Planet Ph1b Planet Hunter [2012]

Convict Worm Seafloor Explorer [2012]

Spanish Flu Operation War Diaries [2014]

Page 20: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

From information co-occurrence to the discovery of hidden structure in Wikipedia

Metric Trigram MFNodes 18,896Links 17,004Matched identifiers 1,745Identifier roots 1,599Stubs 1,645Nodes without any links 146Avg identifier path length 11.53Shortest path (links) 2Longest path (links) 1373Average path duration (hours) 369Longest path duration (hours) 2133 (88 days)Shortest path duration (hours) 0Cascades 1,379Largest cascade (links) 8068Smallest cascade (links) 2Average cascade size (links) 13.70

Table 1: Results of the experiments. The Trigram MF matches ona 3 noun-phrase sequence.

tigated this in more detail by assessing the identifier entropy. Wefind two cascade types: (a) a significant proportion of cascades withan identifier entropy of 0; (b) the entropy for all captured cascadesis lower than 5. While (a) reflects the existence of a significantnumber of single identifier cascades again, (b) lets us conclude thatmulti-identifier cascades tend to be dominated by some identifiersresulting in an unequal distribution of the identifiers in those cas-cades. Both observations support the findings from the analysis ofthe wiener index in relation to cascade size.

Burstiness. We measured three kinds of burstiness: (1) the bursti-ness of all captured edits independent from the cascade they belongto; (2) the burstiness of all edits captured within specific fully-connected cascade networks; (3) the burstiness of all edits thatmatch a particular identifier (identifier burstiness). As described inSection 2, burstiness refers to periods of high activity in a stream ofactivity, and offers a way to detect behavior that is correlated witha particular event. In the context of the Wikipedia editing streambursts of editing activity across a set of Wikipedia articles could berelated to some external (or internal) social phenomenon such asa controversial topic, the injection of biased information, or someform of vandalism. The overall burstiness reveals only very fewperiods of significantly high activity. Naturally, the amount of ac-tivity increases as the TIC model will capture additional identifiersthe longer the edit stream is observed. This results in an increasinglikelihood to match observed edit events to older ones.

As a more fine grained indicator of bursts of related information, wecomputed the cascade burstiness by for each structurally connectedcascade network derived from the overall edit stream individually.We observe that it is possible to differentiate between cascades thatshow a similar burstiness pattern as the overall burstiness and oth-ers that are significantly different and become only visible on thismicroscopic level. TIC allow to map activity streams into a threedimensional space. In Figure 1 we zoomed into a period of 1500edits happening in about 40 minutes and highlight that within thisdense global activity we can identify various local bursts ((1) and(2) mark the most prominent two local bursts). Generally, this map-ping of Transcendental Information Cascades allows us to analyse(a) global bursts of high activity involving diverse information and(b) local bursts of significance occurrence of the same information.

Wikipedia Article Network (WAN) Comparison. We comparedthe difference in the link structure of the cascades, and the explicit(embedded) links in a Wikipedia article. We constructed two net-works, the Cascade Article Network (CAN), and the Wikipedia Ar-

Figure 1: Wikipedia edits in a three dimensional space. The di-mensions are (1) time; (2) information diversity as the chronologi-cal order in which unique identifier sets are found; (3) informationspecificity as the index for each unique identifier set which is incre-mented with each occurrence of the respective set over time.

ticle Network (WAN)3. Table 2 provides an overview of the CANand WAN. For comparative purposes, the metrics of the WAN net-work have been applied to the sub-set of articles which are con-tained within the CAN. Figure 2a provides a visual representationof the CAN structure, with three labelled strongly connected com-ponents, (A), (B), and (C).

Metric CAN WAN*Total Nodes (Articles) 7,293 5,716,808Total Edges (A-to-A) 23,560 5,705,827Avg. Edges 3.1 142Avg. Degree 6.46 343

Table 2: A Comparison of the cascade links between articles withthe Wikipedia article graph. WAN - Wikipedia Article Network.*The WAN graph metrics are based on the subset of matchingWikipedia articles, not the complete article base.

Due to the articles which reside outside the set of articles identifiedwithin the CAN, the WAN has a higher average degree and edgesper article. However, in comparison to the WAN’s structure whichcontained one large connected component of articles (within thegiven subset of articles), the WAN network featured three stronglyconnected components. As labelled on 2a, these components re-lated to articles containing content about (A) South Korea, (B) theUnited States of America (Geographic articles), and (C) Politicalarticles.

We compared the edges between articles formed by the cascadesto the edges within the WAN, and found that only 4.4% of edgesin the CAN could be identified within the WAN. Only 2 articlesfrom the CAN had a 100% overlap with the WAN. Furthermore,we found that 94.7% of articles within the CAN had a overlap ofless than 1%. These findings suggests that the article links formedwithin the CAN network may be forming article structure which isnot explicitly found within Wikipedia.

Cascade Category Co-Occurrence. In order to examine the relat-3A node represent a Wikipedia articles, and an edge represents ei-ther a matched identifier between two edits (for CAN), or an ex-plicit link within the Wikipedia graph (for WAN)

Tinati, R., Luczak-Rösch, M., & Hall, W. (to appear). Finding Structure in Wikipedia Edit Activity: An Information Cascade Approach . In WikiWorkshop 2016, co-located with WWW 2016.

Events detected: •  Edward Snowden speech at SXSW

conference •  US supreme court case on same sex

marriage

Matching identifier Associated Root Article EdgesU.S. Supreme Court Hillman v. Maretta 17,893NATO Joint Jet Fighter Pilot 13,868U.S. District Court BJU Press 5,584Mehr News Agency To the Youth in Europe and

North America2,078

U.S.Religious Land-scape Survey

Utah 1,500

Table 3: 5 highest connected cascades. Each cascade is formedby a particular identifier, and can be associated with a Wikipediaarticle where the identifier was first used (the root).

(a) Cascade Article Network (CAN): Nodes represent uniqueWikipedia articles, edges are shared edits based on a sharedidentifier matched. A force directed layout has been ap-plied, with edge path lengths determined by edge weight. Thestrongly connected component (A) contains articles associatedwith South Korean media, (B) and (C) contain articles relatedto the USA.

(b) Cascade-to-Cascade path network graph: Nodes are cas-cades, Edges are the shared articles between cascades. The cen-tral strongly connected component is established by the Identi-fiers shown in Table 3. A force directed layout has been applied,with edge path lengths determined by edge weight.

Figure 2: Article networks

edness between Wikipedia content, we used DBpedia to obtain thecategory classification labels (dct:subject) associated with a givenWikipedia article. These labels, which are machine and humangenerated provide a general classification for the subject (or topic),based on the article’s content. We then calculate the co-occurrenceof categories between nodes (articles) within a cascade path [14].Using the co-occurrence measure of a cascade provides us with away to measure the potential similarity between the subject andcontent of the articles within a given cascade. Using DBpedia, ourqueries found, 78.2% of the total articles within the WAN wereidentified with at least one category. On average, an article wasassociated with 2 categories. From the 1,745 unique cascades path-ways, 521 were found to contain at least one node (article) mappedto a set of categories, and 360 cascades pathways were identified tohave two or more articles with categories associated with them. Forthe analysis, we removed duplicate nodes within a cascade, whichwere identified as nodes related to the same Wikipedia article, astheir categories would be the same, thus skewing the results.

Based on the remaining cascades which had duplicate nodes re-moved, and two or more nodes with categories associated withthem (20% of total cascades), we calculate the co-occurrence ofcategories between articles within a given cascade. As shown inTable 4, there was an average co-occurrence of 63.6% between ar-ticle categories within a given cascade pathway. We also extractedthe top 10 categories based on co-occurrence frequency. The find-ings suggest that the articles within a given cascade tend to relateto the same subject or share similar content. We also found thatthe most frequent co-occurring topics reflect the strongly connectedcomponents found in the CAN network, shown in Figure 2a.

Metric CTC NetworkTotal Nodes (Article) 18,896Matched Article 14,776Unique Categories 1,605Avg. Category per article 2Avg. Duplicate Article per Cascade 43.7%Avg. Cascade Category co-occurrence 63.6%

Table 4: Overview of the Cascade mapping to DBpedia categories.Avg. Cascade Category Overlap is calculated on cascades with twoor more nodes that are associated with different Wikipedia articles

5. DISCUSSIONRQ1: Structural Properties The structure of Wikipedia can beconsidered as an explicit and static network of hyperlinks con-necting articles with articles, and with external resources (e.g., hy-perlinks to URLs not prefixed by wikipedia.org). We examinedwhether an underlying structure between Wikipedia articles oc-curred, and whether this complements, or mimics the explicit link-ing structure. Our analysis of the wiener index and identifier en-tropy of the resulting cascades highlights an over-representation ofcascades that are long uniform paths with only one matched iden-tifier. Such single identifier cascades can still be suited to find im-plicit links between articles and detect bursts around trending top-ics. But it means that only a small proportion of cascades is suitedto find implicit relationships between matched identifier phrases.

We conducted the analysis of patterns of burstiness in order to ex-amine the time dimension on the macro and the micro level of thecaptured edits. The TIC model is based on the principle of cap-turing elements from a stream that contain a particular informa-tional pattern and bringing subsets of these elements together asbranching and merging cascades, when a pattern matches multipleinformation in some of the elements, so that sequences are linkedtogether. As such it is a generalisation of Kleinberg’s approach pre-sented in [16]; based on flat sequences of elements from a stream,only one particular matched information occurs. While the overallburstiness does not show significant bursts from the macroscopic

Page 21: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

The MacroScope is technology External APIs

•  Twitter •  Wikipedia •  Instagram •  Google Trends •  Yahoo Trends

Pre-processing Stage:

1.Enrich Streams

2. Unify feeds

into WO JSON Format

Streaming Stage:

1. Post incoming

stream to RabbitMQ

exchange (each source has its own exchange)

Hadoop Storage Stage:

1. Apache Flume for each stream

HDFS

HTTP Streaming Stage:

1. Send Stream to Web Observatory

Server UnstructuredWebStreamsorWebScraped

Pages

WebObservatoryJSONDataSchema

RabbitMQJSONStream

Socket.IODaily Storage

Stage:

1. MapReduce Daily Results

MongoDB

MacroScope

Socket.IO

Page 22: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

•  six screens in WAIS labs

•  as part of presentations

•  as a mobile exhibit

•  as a Web application

There is more than one MacroScope

Page 23: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

Cross-disciplinary research

Scholars from discipline A Scholars from discipline B

Adaptive epistemological

framework

Page 24: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

Engagement with the general public

Scholars People from the general public

demonstrating the power and the danger of individuals sharing

information online

developing a new “situational ethics of

data”

Page 25: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

The MacroScope

Scholars The public

Page 26: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

The MacroScope

Surveys, interviews, focus groups, observations

Scholars The public

Page 27: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

A mantra for the MacroScope:�“Overview first, zoom and filter, then details-

on demand”* and capture engagement.

* Shneiderman, B. (1996, September). The eyes have it: A task by data type taxonomy for information visualizations. In Visual Languages, 1996. Proceedings., IEEE Symposium on (pp. 336-343). IEEE. Image source: screenshot taken from http://data.shopsavvy.mobi/globe

Page 28: The Web Science MacroScope: Mixed-methods Approach for Understanding Web Activity

The Web Science MacroScope: �Mixed-methods Approach for Understanding Web Activity Markus Luczak-Roesch @mluczak | http://markus-luczak.de Image source: https://en.wikipedia.org/wiki/File:Compound_Microscope_(cropped).JPG, CC BY-SA 4.0

Discover Describe Directly engage