Exploring Article Networks on Wikipedia with NodeXL

30
EXPLORING ARTICLE NETWORKS ON WIKIPEDIA WITH NODEXL

Transcript of Exploring Article Networks on Wikipedia with NodeXL

Page 1: Exploring Article Networks on Wikipedia with NodeXL

EXPLORING ARTICLE NETWORKS ON WIKIPEDIA WITH NODEXL

Page 2: Exploring Article Networks on Wikipedia with NodeXL

PRESENTATION DESCRIPTION

• With 4.8 million articles in the English version of Wikipedia, this crowd-sourced online

encyclopedia is regularly one of the top-ten visited sites online. For many, this is the go-to

source for a first read on a topic. The open-source and free Network Overview, Discovery

and Exploration for Excel (NodeXL), which is an add-on to Microsoft Excel, enables the

capture of “article networks” from Wikipedia. Such content network analysis-based data

visualizations enable the development of research leads; some understandings of public

conceptualizations of related concepts, peoples, events, and phenomena; the profiling of

Wikipedia editors (both humans and ‘bots), and other research insights. This presentation will

showcase this affordance of NodeXL and provide some ideas for practical applications of this

channel of research and knowing.

2

Page 3: Exploring Article Networks on Wikipedia with NodeXL

OVERVIEW

• Wikipedia ethos and practices

• Wikipedia

• The many Wikipedias; the English Wikipedia

• The Wikimedia Foundation

• MediaWiki and basic functionalities

• Basic article network analysis

• NodeXL and basic functionalities; automation

3

Page 4: Exploring Article Networks on Wikipedia with NodeXL

OVERVIEW (CONT.)

• http page networks on Wikipedia:

• article networks

• human author / editor networks

• robot networks

• Live demos

• Other (future) networks from Wikipedia

4

Page 5: Exploring Article Networks on Wikipedia with NodeXL

WIKIPEDIA ETHOS AND PRACTICES

• Objective, fact-based, and

research-focused

• Full research citations

• Isolating of opinions into Talk pages

• Open

• Open-access

• Open-source, public domain-released

• Crowd-sourced knowledge co-

creation; curated public data

• Crowd-funded 501(C)3; transparent

finances ($58.5 million goal for FY

2015)

• Editing via email-verified accounts

or Internet Protocol (IP) capture

5

Page 6: Exploring Article Networks on Wikipedia with NodeXL

WIKIPEDIA

THE MANY WIKIPEDIAS

• 288 Wikipedias (with 277 active)

• In order of articles: English (13.9%),

Swedish (5.6%), Dutch (5.2%), German

(5.25%), French (4.6%), Waray-Waray

(3.6%), Russian (3.5%), Cebuano

(3.4%), Italian (3.4%), Spanish (3.4%),

and Other (48.2%)

• (“List of Wikipedias” on Wikipedia)

THE ENGLISH WIKIPEDIA

• Founded in Jan. 15, 2001

• 4.8 million articles

• 25 million user accounts

• 1.347 administrators (“English

Wikipedia” on Wikipedia)

6

Page 7: Exploring Article Networks on Wikipedia with NodeXL

THE WIKIMEDIA FOUNDATION

• Objective: to encourage “the growth, development and distribution of free,

multilingual, educational content,” and to provide “the full content of these

wiki-based projects to the public free of charge”

• A range of projects: Wikipedia, Wikibooks, Wikiversity, Wikimedia

Commons, Wiktionary, Wikiquote, Wikivoyage, Wikidata, Wikinews,

Wikisource, Wikispecies, and MediaWiki (Wikimedia Foundation)

7

Page 8: Exploring Article Networks on Wikipedia with NodeXL

MEDIAWIKI AND BASIC FUNCTIONALITIES

• “wiki wiki”: “quick” or “fast” in Hawaiian

• Ward Cunningham as the developer of the first wiki software (WikiWikiWeb) in 1994 to

enable online collaborations with history versioning and rollback capabilities

• MediaWiki first created by the Wikimedia Foundation in 2002

• Magnus Manske and Lee Daniel Crocker were the initial developers of this tool using PHP

(MediaWiki)

8

Page 9: Exploring Article Networks on Wikipedia with NodeXL

A WIKIMEDIA ARTICLE INTERFACE

9

Page 10: Exploring Article Networks on Wikipedia with NodeXL

A VIEW OF THE REVISION HISTORY

10

Page 11: Exploring Article Networks on Wikipedia with NodeXL

BASIC ARTICLE NETWORK ANALYSIS

• Basics of network graphs: nodes-links, entities-relationships, vertices-edges;

undirected or directed (digraphs) graphs; networks and meta-networks;

subgraphs and clusters, motifs; network centrality

• Direct ties represented in ego neighborhoods (with a maximum geodesic

distance or graph diameter of 2); also 1.5 degree ties for transitivity (with a

maximum geodesic distance or graph diameter of 3) and 2 degree ties to

include networks of the respective “alters” (with much larger maximum

geodesic distances possible)

11

Page 12: Exploring Article Networks on Wikipedia with NodeXL

BASIC ARTICLE NETWORK ANALYSIS (CONT.)

• Entities may be individuals or groups, contents, and other elements

• Relatedness: Article networks created based on in-links and outlinks; node

“degree”

• Other types of relatedness are possible such as based on word co-occurrences, title

relatedness (same synset or “synonym set”), shared categories, and others

• Relations are conceptualized as enabling paths

12

Page 13: Exploring Article Networks on Wikipedia with NodeXL

NODEXL AND BASIC FUNCTIONALITIES; AUTOMATION

• A free and open-source add-on to Microsoft Excel available on the Microsoft

CodePlex platform

• Enables…

• Graph visualization (with datasets from UCINET, GraphML, and other types)

• Data extraction from a number of social media platform APIs; refreshed runs based on

the same parameters (macros)

• Large number of tools of graph analysis

• A number of layout algorithms and selections to represent the data visually

13

Page 14: Exploring Article Networks on Wikipedia with NodeXL

HTTP PAGE NETWORKS ON WIKIPEDIA (IN THIS CASE)

• http page links within Wikipedia, not connecting out to the Surface Web

• One-directional (outlink) directional graph of the target Wikipedia page

• May include article page networks, human page networks, robot page networks, and

others

• Networks seeded by one target title or name (as long as the string appears as a

page in Wikipedia)

• No need for an application programming interface (API) on the MediaWiki platform

14

Page 15: Exploring Article Networks on Wikipedia with NodeXL

MEDIAWIKI ARTICLE NETWORK ON WIKIPEDIA

(1 DEG., 237 VERTICES, 237 EDGES)

15

Page 16: Exploring Article Networks on Wikipedia with NodeXL

MEDIAWIKI ARTICLE NETWORK ON WIKIPEDIA

(1.5 DEG., 12,368 VERTICES AND 17,686 UNIQUE EDGES)

16

Page 17: Exploring Article Networks on Wikipedia with NodeXL

MEDIAWIKI ARTICLE NETWORK ON WIKIPEDIA

(2 DEG., 923,006 VERTICES)

17

In the first run, the software

kicked up an “out of memory”

exception error and crashed.

Another run was conducted on a

different machine with more

processing capability. The

screenshots are from that data

extraction. The data itself

involved some edge pairs (over

half a dozen) in which one of the

vertices was missing.

Page 18: Exploring Article Networks on Wikipedia with NodeXL

EXAMPLE: ARTICLE NETWORK

• Who are individuals related to a topic? Events? Years? Topics? Which of

these may be useful leads to learn more about the basic seed topic?

• Based on a real-world individual, what is he or she known for? Who are

people that this person is connected with?

• Based on a technology, when was it originated? Who originated it? What

were precursor inventions? What inventions were linked to the particular

technology?

18

Page 19: Exploring Article Networks on Wikipedia with NodeXL

EXAMPLE: ARTICLE NETWORK (CONT.)

• Based on collected lists, who is on a target list, and for what?

• Based on a particular topic, are there gaps in the information based on

“missing” article links?

• Based on a particular phenomena, event, phrase, or individual, in a foreign

context and foreign language, what may be learned?

19

Page 20: Exploring Article Networks on Wikipedia with NodeXL

WIKI ARTICLE NETWORK ON WIKIPEDIA

(1 DEG., 162 VERTICES)

20

Page 21: Exploring Article Networks on Wikipedia with NodeXL

WEB_LOG_ANALYSIS_ SOFTWARE ARTICLE NETWORK ON WIKIPEDIA (1 DEG., 13 VERTICES)

21

Page 22: Exploring Article Networks on Wikipedia with NodeXL

EXAMPLE: HUMAN (AUTHOR / EDITOR) USER NETWORK

• Based on the human user’s network on Wikipedia, what articles does he or she

tend to edit? In total, what does this network suggest about the person behind

the edits?

• (This requires the existence of a user page though.)

22

Page 23: Exploring Article Networks on Wikipedia with NodeXL

USER:LWEDEKIND NETWORK ON WIKIPEDIA (1 DEG., 9 VERTICES)

23

Page 24: Exploring Article Networks on Wikipedia with NodeXL

USER:THIS_LOUSY_T-SHIRT ARTICLE NETWORK ON WIKIPEDIA (1 DEG., 30 VERTICES)

24

Page 25: Exploring Article Networks on Wikipedia with NodeXL

EXAMPLE: ROBOT NETWORK

• Based on the approved robot user’s network, what are the interests of the

maker of the robot? What other accounts is the robot connected to?

25

Page 26: Exploring Article Networks on Wikipedia with NodeXL

USER:OGREBOT NETWORK ON WIKIPEDIA

(1 DEG., 5 VERTICES)

26

Page 27: Exploring Article Networks on Wikipedia with NodeXL

USER:EMAUSBOT NETWORK ON WIKIPEDIA

(1 DEG., 2 VERTICES)

27

Page 28: Exploring Article Networks on Wikipedia with NodeXL

ADDITIONAL APPROACHES

• Chaining from one target account to related others

• Cross-comparing information on the Wikipedia site with the extracted

networks

• Connecting the Wikipedia information with related sites on the Surface Web /

World Wide Web (WWW) and Internet

28

Page 29: Exploring Article Networks on Wikipedia with NodeXL

OTHER (FUTURE) NETWORKS FROM WIKIPEDIA

• The third-party tool to NodeXL has spaces to enable user-content (two-mode)

network extractions and the mapping of co-editing networks…but those

functions are not currently enabled (apparently)

29

Page 30: Exploring Article Networks on Wikipedia with NodeXL

DISCUSSIONS

• Questions?

• Ideas for research?

30