Social Networks and Data Mining

81
SUSHIL KULKARNI SUSHIL KULKARNI JAI-HIND COLLEGE JAI-HIND COLLEGE [email protected] [email protected]

description

This is the lecture on Social Network and introduction to Data Minng

Transcript of Social Networks and Data Mining

Page 1: Social Networks and Data Mining

SUSHIL KULKARNISUSHIL KULKARNIJAI-HIND COLLEGEJAI-HIND COLLEGE

[email protected]@yahoo.co.in

Page 2: Social Networks and Data Mining

Social Networks : ExampleSocial Networks : Example Technology usedTechnology used What is Data Mining? What is Data Mining? DM Process & ExampleDM Process & Example DM QueriesDM Queries DM Tasks and MethodsDM Tasks and Methods Relation & Data WarehouseRelation & Data Warehouse What is ETL ? What is ETL ? Data Preprocessing Data Preprocessing

Social Networks : ExampleSocial Networks : Example Technology usedTechnology used What is Data Mining? What is Data Mining? DM Process & ExampleDM Process & Example DM QueriesDM Queries DM Tasks and MethodsDM Tasks and Methods Relation & Data WarehouseRelation & Data Warehouse What is ETL ? What is ETL ? Data Preprocessing Data Preprocessing

Page 3: Social Networks and Data Mining
Page 4: Social Networks and Data Mining

What is a Network?

node

node

node node

node

node

node

node

node node

node

node

node

node

node

node

node

Web Definition : A set of nodes, points, or locations connected by means of data, voice, and video communications for the purpose of exchange.

Link

Page 5: Social Networks and Data Mining

Social Networks

A social network is a social structure of people, related (directly or indirectly) to each other through a common relation or interest

Page 6: Social Networks and Data Mining

Social Network Analysis Social network analysis [SNA] is the mapping and

measuring of relationships and flows between people, groups, organizations, computers or other information/knowledge processing entities.

The nodes in the network are the people and groups while

the links show relationships or flows between the nodes.

Page 7: Social Networks and Data Mining

A shift in approach: from ‘synthesis’ to ‘analysis’

Problems•High cost of

manual surveys•Survey bias - Perceptions of

individuals may be incorrect

•Logistics - Organizations

are now spread across several countries.

SdfdsfsdfFvsdfsdfs

dfdfsdSdfdsfsdfSdfsdfs

`

SdfdsfsdfFvsdfsdfs

dfdfsdSdfdsfsdfSdfsdfs

`

SdfdsfsdfFvsdfsdfs

dfdfsdSdfdsfsdfSdfsdfs

`

Electroniccommunication

- Email- Web logs

Analysis

Socialnetwork

Cognitivenetwork

SocialNetwork

Employee

Surveys

Cognitive network for

B

Cognitive network for

C

Cognitive network for

A

A

B

C

Synthesis

Shift in approach

Page 8: Social Networks and Data Mining
Page 9: Social Networks and Data Mining

Technology

Various technologies that help in creating

Social Networks are:

Email Blogs Social Networking Software like Orkut,

Face Book, Flickr etc.

Page 10: Social Networks and Data Mining

SOCIAL NETWORK: Profile & Platforms

USENET

Page 11: Social Networks and Data Mining

SOCIAL NETWORK: Profile & Platforms

Social Community

Page 12: Social Networks and Data Mining

SOCIAL NETWORK: Growth

Page 13: Social Networks and Data Mining

SOCIAL NETWORK : Growth Rate

Page 14: Social Networks and Data Mining

SOCIAL NETWORK : Growth Rate

Page 15: Social Networks and Data Mining

Technology :

What is Your Network? - When your connections invite their connections, your Network starts to grow. - Your Network is your connections, their connections, and so on out from you at the center.

How do you classify users? - Your Network contains professionals out to “three degrees” that is, friends-of-friends-of-friends. If each person had 10 connections (and some have many more) then your network would contain 10,000 professionals. How do you see who is in your Network? Facebook lets you see your network as one large group of searchable professional profiles.

Page 16: Social Networks and Data Mining

SOCIAL NETWORK: Visualization

ME

FRIEND

FRIEND FRIENDFRIEND

FRIEND

Page 17: Social Networks and Data Mining

ON ANY OF SOCIAL NETWORK

NameGenderAgeBirth date/Home townSchool attendedInterests/ HobbiesPhotoesFriendsActivitiesAudio clipsVideo clips

NameGenderAgeBirth date/Home townSchool attendedInterests/ HobbiesPhotoesFriendsActivitiesAudio clipsVideo clips

YOU

FRIEND

Page 18: Social Networks and Data Mining

ON ANY OF SOCIAL NETWORK

NameGenderAgeBirth date/Home townSchool attendedInterests/ HobbiesPhotoesFriendsActivitiesAudio clipsVideo clips

NameGenderAgeBirth date/Home townSchool attendedInterests/ HobbiesPhotoesFriendsActivitiesAudio clipsVideo clips

YOU

FRIEND After making the friend, I can able to access his/ her friends, audios, videos, share information A friend may be from any remote site.

Page 19: Social Networks and Data Mining

SOCIAL NETWORK : Growth Rate

Page 20: Social Networks and Data Mining

SOCIAL NETWORK : VisualizationBetween friends: How many of them ?

Male vs. Female Young vs. Old

Thin vs. Fat

Page 21: Social Networks and Data Mining

SOCIAL NETWORK : VisualizationBetween friends: Relationships

Thick Friends Just Friends

Page 22: Social Networks and Data Mining

SOCIAL NETWORK : VisualizationBetween friends: Likes

Coffee Chocolate

Friends Friends

HOW MANY OF MADHURI DIXIT’S FRIEND LIKE ? HOW MANY OF PRASHANT DAMLE’S FRIEND LIKE ?

Page 23: Social Networks and Data Mining

FRIENDS OF A FRIENDS OF A FRIEND SHOULD KNOW

How many friends use a social network regularly?

How many friends send messages frequently?

What is the mood of your friend list? How many friends are vegetarian? How many friends are closest or far from

you? How many friends studied or studying in

your school?

Page 24: Social Networks and Data Mining

FRIENDS OF A FRIENDS OF A FRIEND FRIENDS OF A FRIENDS OF A FRIEND SHOULD KNOWSHOULD KNOW

INTERESTING PATTERNSFROM UNKNOWN DATA

Page 25: Social Networks and Data Mining
Page 26: Social Networks and Data Mining

DEFINE DATA MININGDEFINE DATA MINING

Data Mining is:

The analysis of (often large) observational The analysis of (often large) observational data sets to find unsuspected data sets to find unsuspected relationships and to summarize the data relationships and to summarize the data in novel ways that are both in novel ways that are both understandable and useful to the data understandable and useful to the data owner.owner.

Page 27: Social Networks and Data Mining

Methods for exploring and modeling relationships in large amount of data

Finding hidden information in a database

Fit data to a model

THUS : DATA MININGTHUS : DATA MINING

Page 28: Social Networks and Data Mining
Page 29: Social Networks and Data Mining

Understand the Domain

- Understands particulars of the business or scientific problems

Create a Data set

- Understand structure, size, and format of data

- Select the interesting attributes

- Data cleaning and preprocessing

Data Mining ProcessData Mining Process

Page 30: Social Networks and Data Mining

Choose the data mining task and the specific algorithm

- Understand capabilities and limitations of algorithms that may be relevant to the problem

Interpret the results, and possibly return to bullet 2

Data Mining ProcessData Mining Process

Page 31: Social Networks and Data Mining

Understand social networks.

Grow connections.

Choose appropriate built in methods to find hidden information.

EXAMPLEEXAMPLE

Page 32: Social Networks and Data Mining

Example :E-mail Communication A sends an e-mail to B

With Cc to C And Bcc to D

C forwards this e-mail to E From analyzing the header, we can infer

A and D know that A, B, C and D know about this e-mail B and C know that A, B and C know about this e-mail C also knows that E knows about this e-mail D also knows that B and C do not know that it knows about

this e-mail; and that A knows this fact E knows that A, B and C exchanged this e-mail; and that

neither A nor B know that it knows about it and so on and so forth …

A C

B

D

E

Page 33: Social Networks and Data Mining
Page 34: Social Networks and Data Mining

DB VS DM PROCESSINGDB VS DM PROCESSING

• Query– Well defined– SQL

• Query– Poorly defined– No precise query language

DataData– Operational dataOperational data

OutputOutput– PrecisePrecise– Subset of Subset of

databasedatabase

DataData– Not operational dataNot operational data

OutputOutput– FuzzyFuzzy– Not a subset Not a subset

of databaseof database

Page 35: Social Networks and Data Mining

QUERY EXAMPLESQUERY EXAMPLESDatabase

Data Mining

– Find all customers who have purchased milkFind all customers who have purchased milk

– Find all items which are frequently Find all items which are frequently purchased with milk. (association rules)purchased with milk. (association rules)

– Find all credit applicants with first name of Sane.Find all credit applicants with first name of Sane.– Identify customers who have purchased Identify customers who have purchased more than Rs.10,000 in the last month.more than Rs.10,000 in the last month.

– Find all credit applicants who are poor Find all credit applicants who are poor

credit risks. (classification)credit risks. (classification)– Identify customers with similar buying Identify customers with similar buying habits. (Clustering)habits. (Clustering)

Page 36: Social Networks and Data Mining
Page 37: Social Networks and Data Mining

ARE ALL THE ‘DISCOVERED’ PATTERNS INTERESTING?

Interestingness measures:

A pattern is interesting if it is easily

understood by humans, valid on new or

test data with some degree of purity,

potentially useful, novel, or validates

some hypothesis that a user seeks to

confirm

Page 38: Social Networks and Data Mining

DATA MINING DEVELOPMENTDATA MINING DEVELOPMENT Similarity Measures Hierarchical Clustering IR Systems Imprecise Queries Textual Data Web Search Engines

Bayes Theorem Regression Analysis EM Algorithm K-Means Clustering Time Series Analysis

Neural Networks Decision Tree Algorithms

Algorithm Design Techniques Algorithm Analysis Data Structures

Relational Data Model SQL Association Rule Algorithms Data Warehousing Scalability Techniques

Page 39: Social Networks and Data Mining
Page 40: Social Networks and Data Mining

D 1, D 2, ……, D n are domains

Relation r is a subset of a Cartesian

product D 1× D 2× ……× D n

RELATION (r)RELATION (r)

n D 2 D 1 D r n D 2 D 1 D r

Page 41: Social Networks and Data Mining

D1 = {Ram, Shyam} , D 2 = {24, 34}

D 1× D 2 = { (Ram, 24), (Ram, 34),

(Shyam, 24), (Shyam, 34)}

r is a sub set of D 1× D 2

r = { (Ram, 24), (Shyam, 34)}

EXAMPLE : rEXAMPLE : r

SUSHIL KULKARNI

Page 42: Social Networks and Data Mining

Employee

RELATION is TABLERELATION is TABLE

NAME AGE Ram 24 Shyam 34

Page 43: Social Networks and Data Mining

Instance of the relation is a tuple or row

Notation :

t < (a(1), a(2), a(3),… a(n)):

a(i) A(i); i N > Example: t < (Ram,24) >

TUPLES OR ROWS : tTUPLES OR ROWS : t

Page 44: Social Networks and Data Mining

A A 11 A A 22 A A 33 …… …… A A k k …….……. A A nn

a 11 a 21 a 31

……a k1 ……. a n1

a 12 a 22 a 32 …… a k 2 …… a n2

….. ….. …….... ………… …..

a 1i a 2 i a 3 i …… a k i …… a n3

……. ……. ……. ……. …….

a 1m a 2m a 3m a n m …… a n m

RR

tt

RELATION (r)RELATION (r)

k th attribute R of i th tuple t

Page 45: Social Networks and Data Mining

WHAT IS WHAT IS DATA WAREHOUSE ?DATA WAREHOUSE ?

Subject-oriented:

customers, patients, students, products, time.

Integrated: Gathered CENTRALLY from

1.several internal systems of records 2. sources external to the organization

Page 46: Social Networks and Data Mining

WHAT IS WHAT IS DATA WAREHOUSE ?DATA WAREHOUSE ?

Time - variant:

Use to study trends and changes.

Non - updatable:

cannot updated by end users.

Page 47: Social Networks and Data Mining

BIG PICTUREBIG PICTURE

Page 48: Social Networks and Data Mining
Page 49: Social Networks and Data Mining

The ETL Process

Capture

Scrub or data cleansing

Transform

Load and Index

ETL = Extract, Transform, and Load

Page 50: Social Networks and Data Mining

Steps in data reconciliationSteps in data reconciliation

Static extract = capturing a snapshot of the source data at a point in time

Incremental extractIncremental extract = capturing changes that have occurred since the last static extract

Capture = extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse

Page 51: Social Networks and Data Mining

Scrub = cleanse…uses pattern recognition and AI techniques to upgrade data quality

Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies

Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data

Steps in data reconciliationSteps in data reconciliation

Page 52: Social Networks and Data Mining

Transform = convert data from format of operational system to format of data warehouse

Record-level:Selection – data partitioningJoining – data combiningAggregation – data summarization

Field-level: single-field – from one field to one fieldmulti-field – from many fields to one, or one field to many

Steps in data reconciliationSteps in data reconciliation

Page 53: Social Networks and Data Mining

Load/Index = place transformed data into the warehouse and create indexes

Refresh mode: bulk rewriting of target data at periodic intervals

Update mode: only changes in source data are written to data warehouse

Steps in data reconciliationSteps in data reconciliation

Page 54: Social Networks and Data Mining
Page 55: Social Networks and Data Mining

DIRTY DATA

Data in the real world is dirty:

– incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

– noisy: containing errors or outliers

– inconsistent: containing discrepancies in codes or names

Page 56: Social Networks and Data Mining

WHY DATA PREPROCESSING?

No quality data, no quality mining results!

Quality decisions must be based on quality data

Data warehouse needs consistent integration of quality data

Required for Data Mining!

Page 57: Social Networks and Data Mining

Why can Data be Incomplete?

Attributes of interest are not available (e.g., customer information for sales transaction data)

Data were not considered important at the time of transactions, so they were not recorded!

Page 58: Social Networks and Data Mining

Why can Data be Incomplete?

Data not recorder because of misunderstanding or malfunctions

Data may have been recorded and later deleted!

Missing/unknown values for some data

Page 59: Social Networks and Data Mining

Why can Data be Noisy / Inconsistent ?

Faulty instruments for data collection

Human or computer errors

Errors in data transmission

Technology limitations (e.g., sensor data come at a faster rate than they can be processed)

Page 60: Social Networks and Data Mining

Why can Data be Noisy / Inconsistent ?

Inconsistencies in naming conventions or data codes (e.g., 2/5/2002 could be 2 May 2002 or 5 Feb 2002)

Duplicate tuples, which were received twice should also be removed

Page 61: Social Networks and Data Mining
Page 62: Social Networks and Data Mining

Major Tasks in Data Preprocessing

Data cleaning– Fill in missing values, smooth noisy data,

identify or remove outliers, and resolve inconsistencies

Data integration– Integration of multiple databases or files

Data transformation– Normalization and aggregation

outliers=exceptions!

Page 63: Social Networks and Data Mining

Major Tasks in Data Preprocessing

Data reduction– Obtains reduced representation in volume

but produces the same or similar analytical results

Data discretization– Part of data reduction but with particular

importance, especially for numerical data

Page 64: Social Networks and Data Mining

Forms of data preprocessing

Page 65: Social Networks and Data Mining
Page 66: Social Networks and Data Mining

Data cleaning tasks

- Fill in missing values

- Identify outliers and smooth out noisy data

- Correct inconsistent data

DATA CLEANING

Page 67: Social Networks and Data Mining

Ignore the tuple: usually done when class label is missing (assuming the tasks in classification)— not effective when the percentage of missing values per attribute varies considerably.

Fill in the missing value manually: tedious + infeasible?

HOW TO HANDLE MISSING DATA?

Page 68: Social Networks and Data Mining

Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!

Use the attribute mean to fill in the missing value

Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter

Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree

HOW TO HANDLE MISSING DATA?

Page 69: Social Networks and Data Mining

HOW TO HANDLE MISSING DATA?

Age Income Team Gender

23 24,200 Red Sox M

39 ? Yankees F

45 45,390 ? F

Fill missing values using aggregate functions (e.g., average) or probabilistic estimates on global value distributionE.g., put the average income here, or put the most probable income based on the fact that the person is 39 years oldE.g., put the most frequent team here

Page 70: Social Networks and Data Mining
Page 71: Social Networks and Data Mining

The process of partitioning continuousVariables into categories is called Discretization.

HOW TO HANDLE NOISY DATA? Discretization

Page 72: Social Networks and Data Mining

Binning method:- first sort data and partition into (equi-depth) bins- then one can smooth by bin means, smooth by

bin median, smooth by bin boundaries, etc.

Clustering- detect and remove outliers

HOW TO HANDLE NOISY DATA? Discretization : Smoothing techniques

Page 73: Social Networks and Data Mining

Combined computer and human inspection- computer detects suspicious values, which are

then checked by humans

Regression- smooth by fitting the data into regression

functions

HOW TO HANDLE NOISY DATA? Discretization : Smoothing techniques

Page 74: Social Networks and Data Mining

Equal-width (distance) partitioning:

- It divides the range into N intervals of equal size: uniform grid

- if A and B are the lowest and highest values of the attribute, the width of intervals will be:

W = (B-A)/N.- The most straightforward- But outliers may dominate presentation- Skewed data is not handled well.

SIMPLE DISCRETISATION SIMPLE DISCRETISATION METHODS: BINNINGMETHODS: BINNING

Page 75: Social Networks and Data Mining

Equal-depth (frequency) partitioning: - It divides the range into N intervals, each

containing approximately same number of samples

- Good data scaling – good handing of skewed data

SIMPLE DISCRETISATION METHODS: BINNING

Page 76: Social Networks and Data Mining

Binning is applied to each individual feature (attribute)

Set of values can then be discretized by replacing each value in the bin, by bin mean, bin median, bin boundaries.

Example: Set of values of attribute Age: 0. 4 , 12, 16, 14, 18, 23, 26, 28

BINNING : EXAMPLE

Page 77: Social Networks and Data Mining

Example : Set of values of attribute Age: 0. 4 , 12, 16, 16, 18, 23, 26, 28 Take bin width = 10

EXAMPLE: EQUI- WIDTH BINNING

Bin # Bin Elements Bin Boundaries

1 {0,4} [ - , 10)

2 { 12, 16, 16, 18 } [10, 20)

3 { 23, 26, 28 } [ 20, +)

Page 78: Social Networks and Data Mining

Example : Set of values of attribute Age: 0. 4 , 12, 16, 16, 18, 23, 26, 28 Take bin depth = 3

EXAMPLE: EQUI- DEPTH BINNING

Bin # Bin Elements Bin Boundaries

1 {0,4, 12} [ - , 14)

2 { 16, 16, 18 } [14, 21)

3 { 23, 26, 28 } [ 21, +)

Page 79: Social Networks and Data Mining

SMOOTHING USING BINNING METHODS

Sorted data for price (in Rs): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 Smoothing by bin boundaries: [4,15],[21,25],[26,34] - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

Page 80: Social Networks and Data Mining

SIMPLE DISCRETISATION METHODS: BINNING

Example: customer ages

0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

Equi-width binning:

numberof values

0-22 22-31

44-4832-3838-44 48-55

55-6262-80

Equi-depth binning:

Page 81: Social Networks and Data Mining

THANK YOU ! THANK YOU !

Any Questions?Any Questions? SUSHIL KULKARNISUSHIL KULKARNI

[email protected]@yahoo.co.in

THANK YOU ! THANK YOU !

Any Questions?Any Questions? SUSHIL KULKARNISUSHIL KULKARNI

[email protected]@yahoo.co.in