Repository for data crawled from multiple social networks

35
CREATIN G A RE POSIT ORY F OR DATA C R AWLED FROM MULT IPLE SOCIAL NETWOR K S ATH ENS UNIVERSI TY O F ECON OMICS AND BUS INES S Konstantinos Christofilos

Transcript of Repository for data crawled from multiple social networks

Page 1: Repository for data crawled from multiple social networks

CREATING A REPO

SITORY

FOR

DATA CRAW

LED FR

OM MULTIPLE

SOCIAL N

ETWORKS

A T H E N S UN I V

E R S I TY O

F EC O N O M I C

S AN D

B U S I NE S S

Konstantinos Christofilos

Page 2: Repository for data crawled from multiple social networks

OK. What is the problem ?In order to get data from each service you have to speak its language (API)

Page 3: Repository for data crawled from multiple social networks

What can we do about that ?We can create a repository of mixed services data and query that to produce more complex results

Page 4: Repository for data crawled from multiple social networks

How are we going to do that ?

Page 5: Repository for data crawled from multiple social networks

How are we going to do that ?Step 1 – Generate endpoints

Page 6: Repository for data crawled from multiple social networks

How are we going to do that ?Step 2 – Load data

Page 7: Repository for data crawled from multiple social networks

Step 1 – Generate endpointsImporter interface

Example (Facebook page):

The command that generates endpoints takes as input a text file with each name in a single line

Page 8: Repository for data crawled from multiple social networks

Step 2 – Load dataInput

Output

Page 9: Repository for data crawled from multiple social networks

Step 2 – Load data (Input)Example (Facebook page)

Page 10: Repository for data crawled from multiple social networks

Step 2 – Load data (Output)Neo4j

Apache Jena

Page 11: Repository for data crawled from multiple social networks

ExampleA simple example running three names: Kostas Christofilos, Iannis Kotidis, Vasilis

Spiropoulos

Page 12: Repository for data crawled from multiple social networks

Example

Neo4j database size: 240 KB

Apache Jena database size: 75.7 KB

Page 13: Repository for data crawled from multiple social networks

Scaling ?In order to be easily scaled, the application is designed to handle Inputs and Outputs as APIs. That approach gives the ability for a horizontal scale.

Page 14: Repository for data crawled from multiple social networks

ProcessGather Data

• We took the names of world’s greatest leaders from Fortune magazinehttp://fortune.com/worlds-greatest-leaders/

• Application queried the APIs for accounts that are related to these names

• It build a list of endpoints for those names• Data was fetched from those endpoints• Entities were recognized• Data was saved into two different databases (Apache Jena,

Neo4j)

Page 15: Repository for data crawled from multiple social networks

ProcessThe names that were used are the following:

Jeff Bezos, Angela Merkel, Aung San Suu Kyi, Pope Francis, Tim Cook, John Legend, Christina Figueres, Paul Ryan, Ruth Bader Ginsburg, Sheikh Hasina, Nick Saban, Huateng "Pony" Ma, Sergio Moro, Bono, Stephen Curry, Steve Kerr, Bryan Stevenson, Nikki Haley, Lin-Manuel Miranda, Marvin Ellison, Reshma Saujani, Larry Fink, Scott Kelly, Mikhail Kornienko, David Miliband, Anna Maria Chavez, Carla Hayden, Maurizio Macri, Alicia Garza, Patrisse Cullors, Opal Tometi, Chai Jing, Moncef Slaoui, John Oliver, Marc Edwards, Arthur Brooks, Rosie Batty, Kristen Griest, Shaye Haver, Denis Mukwege, Christine Lagarde, Marc Benioff, Gina Raimondo, Amina Mohammed, Domenico Lucano, Melinda Gates, Susan Desmond-Hellman, Arvind Kejriwal, Jorge Ramos, Michael Froman, Mina Guli, Ramon Mendez, Bright Simons, Justin Trudeau, Clare Rewcastle Brown, Tshering Tobgay

Page 16: Repository for data crawled from multiple social networks

ProcessThe list of the previous names after it was parsed from the application produced a list of more than 11.000 endpoints in Facebook, Instagram and Twitter with the following distribution.

Names Facebook Page

Endpoints

Twitter Endpoints

Instagram Endpoints

Total Endpoints

56 437 866 9903 11206

Page 17: Repository for data crawled from multiple social networks

ProcessThe parse of those endpoints in a single workstation* took about 14 days for the period 2016-06-04 to 2016-06-18.Most of the time was consumed in the entity recognition process

*Workstation specs: AMD FX 2-Core CPU, 4GB RAM, 120GB SSD, Linux OS with 5 concurrent processes of the application running

Page 18: Repository for data crawled from multiple social networks

ResultsThe application run a single pass over the generated endpoints witch took about 14 days in a single workstation and the generated nodes were 126,395.

Endpoints Machines Generated

Nodes

Time Average nodes/day/machine

11206 1 126395 14 days

9028

Page 19: Repository for data crawled from multiple social networks

Results

Neo4j Apache Jena60

65

70

75

80

85

90

95

100

Import (%)NER (%)

Data import time distribution (Empty database)

Page 20: Repository for data crawled from multiple social networks

Results

Neo4j Apache Jena80

82

84

86

88

90

92

94

96

98

100

Import (%)NER (%)

Data import time distribution (100,000+ nodes)

Page 21: Repository for data crawled from multiple social networks

Results

Jesus

Lebron

Stephen

Steph

Curry

0 5000 10000 15000 20000 25000

Persons mentions

Person Mentions

Page 22: Repository for data crawled from multiple social networks

Results

Padre

GSW

Santo

Cavs

NBA

0 1000 2000 3000 4000 5000 6000 7000 8000

Organization Mentions

Organization Mentions

Page 23: Repository for data crawled from multiple social networks

Results

Jordan

America

Hermosa

Venezuela

Cleveland

0 500 1000 1500 2000 2500 3000 3500 4000

Location Mentions

Location Mentions

Page 24: Repository for data crawled from multiple social networks

Conclusion

• Data are generated in vast amounts every moment. • We created an approach of linking heterogeneous data in a

single repository.• Generated data can be analyzed and produce combined

results.• Patterns can be identified from that repository.

Page 25: Repository for data crawled from multiple social networks

Future extensions

• New Inputs can be implemented (new APIs)• New Outputs can be implemented (new storage engines)• Name list can be saved in a database and accessed from all

nodes. Now it’s a local file• Queue for entity recognizer can be implemented and remove

the blocking code• Centralized logging for cluster monitoring. Now log goes to

STDOUT

Page 26: Repository for data crawled from multiple social networks

REST APIs• REST stands for Representational State Transfer• REST ignores the details of component implementation • REST is an architecture that enables applications to

communicate without knowing the underlying technology• REST APIs promote easy reusability

Page 27: Repository for data crawled from multiple social networks

REST APIS

Page 28: Repository for data crawled from multiple social networks

Graph Databases• Graph database is a database type that uses graph

structures with nodes, edges and properties to represent and store data.

• Graph databases are based on graph theory and employs nodes, edges and properties

• Nodes represent entities such as people, businesses, accounts, or any other item you might want to keep track of

• Edges, also known as graphs or relationships, are the lines that connect nodes to other nodes and represent relationship between them

Page 29: Repository for data crawled from multiple social networks

Graph Databases

Page 30: Repository for data crawled from multiple social networks

Graph DatabasesResource Description Framework (RDF)

• RDF is a standard model for data interchange on the Web and was specified by W3C

• Web is a graph, created by nodes, edges and relations

Page 31: Repository for data crawled from multiple social networks

Graph DatabasesProperty Graphs

• Property Graph databases are graph databases that contains connected entities, which can cold, any number of attributes

• Nodes can be tagged with labels representing their different roles

• Labels may also serve to attach metadata to certain nodes

Page 32: Repository for data crawled from multiple social networks

Graph DatabasesProperty Graphs

Page 33: Repository for data crawled from multiple social networks

Natural Language Processing (NLP)• Natural language processing (NLP) is an area of research

and application that explores how computers can be used to understand and manipulate natural language text or speech to do useful things

• NLP lie in a number of disciplines, information sciences, linguistics, mathematics, electrical and electronic engineering, artificial intelligence and robotics, psychology, etc

• Applications of NLP include a number of fields of studies, such as machine translation, natural language text processing and summarization, user interfaces, multilingual and cross language information retrieval, speech recognition, artificial intelligence and expert systems

Page 34: Repository for data crawled from multiple social networks

Natural Language Processing (NLP)Named Entity Recognition (NER)

Named entity recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of time etc. NER systems use linguistic grammar-based techniques as well as statistical models, i.e. machine learning.

We used Stanford Named Entity Recognizer that was created by The Stanford Natural Language Processing Group.http://nlp.stanford.edu/software/CRF-NER.shtml

Page 35: Repository for data crawled from multiple social networks

Natural Language Processing (NLP)Named Entity Recognition (NER)