Download - Social Networks and Data Mining

SUSHIL KULKARNISUSHIL KULKARNIJAI-HIND COLLEGEJAI-HIND COLLEGE

[email protected]@yahoo.co.in

Social Networks : ExampleSocial Networks : Example Technology usedTechnology used What is Data Mining? What is Data Mining? DM Process & ExampleDM Process & Example DM QueriesDM Queries DM Tasks and MethodsDM Tasks and Methods Relation & Data WarehouseRelation & Data Warehouse What is ETL ? What is ETL ? Data Preprocessing Data Preprocessing

Social Networks : ExampleSocial Networks : Example Technology usedTechnology used What is Data Mining? What is Data Mining? DM Process & ExampleDM Process & Example DM QueriesDM Queries DM Tasks and MethodsDM Tasks and Methods Relation & Data WarehouseRelation & Data Warehouse What is ETL ? What is ETL ? Data Preprocessing Data Preprocessing

What is a Network?

node

node

node node

node

node

node

node

node node

node

node

node

node

node

node

node

Web Definition : A set of nodes, points, or locations connected by means of data, voice, and video communications for the purpose of exchange.

Link

Social Networks

A social network is a social structure of people, related (directly or indirectly) to each other through a common relation or interest

Social Network Analysis Social network analysis [SNA] is the mapping and

measuring of relationships and flows between people, groups, organizations, computers or other information/knowledge processing entities.

The nodes in the network are the people and groups while

the links show relationships or flows between the nodes.

A shift in approach: from ‘synthesis’ to ‘analysis’

Problems•High cost of

manual surveys•Survey bias - Perceptions of

individuals may be incorrect

•Logistics - Organizations

are now spread across several countries.

SdfdsfsdfFvsdfsdfs

dfdfsdSdfdsfsdfSdfsdfs

`

SdfdsfsdfFvsdfsdfs


`

SdfdsfsdfFvsdfsdfs


`

Electroniccommunication

- Email- Web logs

Analysis

Socialnetwork

Cognitivenetwork

SocialNetwork

Employee

Surveys

Cognitive network for

B


C


A

A

B

C

Synthesis

Shift in approach

Technology

Various technologies that help in creating

Social Networks are:

Email Blogs Social Networking Software like Orkut,

Face Book, Flickr etc.

SOCIAL NETWORK: Profile & Platforms

USENET

SOCIAL NETWORK: Profile & Platforms

Social Community

SOCIAL NETWORK: Growth

SOCIAL NETWORK : Growth Rate

Technology :

What is Your Network? - When your connections invite their connections, your Network starts to grow. - Your Network is your connections, their connections, and so on out from you at the center.

How do you classify users? - Your Network contains professionals out to “three degrees” that is, friends-of-friends-of-friends. If each person had 10 connections (and some have many more) then your network would contain 10,000 professionals. How do you see who is in your Network? Facebook lets you see your network as one large group of searchable professional profiles.

SOCIAL NETWORK: Visualization

ME

FRIEND

FRIEND FRIENDFRIEND

FRIEND

ON ANY OF SOCIAL NETWORK

NameGenderAgeBirth date/Home townSchool attendedInterests/ HobbiesPhotoesFriendsActivitiesAudio clipsVideo clips


YOU

FRIEND

ON ANY OF SOCIAL NETWORK



YOU

FRIEND After making the friend, I can able to access his/ her friends, audios, videos, share information A friend may be from any remote site.

SOCIAL NETWORK : Growth Rate

SOCIAL NETWORK : VisualizationBetween friends: How many of them ?

Male vs. Female Young vs. Old

Thin vs. Fat

SOCIAL NETWORK : VisualizationBetween friends: Relationships

Thick Friends Just Friends

SOCIAL NETWORK : VisualizationBetween friends: Likes

Coffee Chocolate

Friends Friends

HOW MANY OF MADHURI DIXIT’S FRIEND LIKE ? HOW MANY OF PRASHANT DAMLE’S FRIEND LIKE ?

FRIENDS OF A FRIENDS OF A FRIEND SHOULD KNOW

How many friends use a social network regularly?

How many friends send messages frequently?

What is the mood of your friend list? How many friends are vegetarian? How many friends are closest or far from

you? How many friends studied or studying in

your school?

FRIENDS OF A FRIENDS OF A FRIEND FRIENDS OF A FRIENDS OF A FRIEND SHOULD KNOWSHOULD KNOW

INTERESTING PATTERNSFROM UNKNOWN DATA

DEFINE DATA MININGDEFINE DATA MINING

Data Mining is:

The analysis of (often large) observational The analysis of (often large) observational data sets to find unsuspected data sets to find unsuspected relationships and to summarize the data relationships and to summarize the data in novel ways that are both in novel ways that are both understandable and useful to the data understandable and useful to the data owner.owner.

Methods for exploring and modeling relationships in large amount of data

Finding hidden information in a database

Fit data to a model

THUS : DATA MININGTHUS : DATA MINING

Understand the Domain

- Understands particulars of the business or scientific problems

Create a Data set

- Understand structure, size, and format of data

- Select the interesting attributes

- Data cleaning and preprocessing

Data Mining ProcessData Mining Process

Choose the data mining task and the specific algorithm

- Understand capabilities and limitations of algorithms that may be relevant to the problem

Interpret the results, and possibly return to bullet 2

Data Mining ProcessData Mining Process

Understand social networks.

Grow connections.

Choose appropriate built in methods to find hidden information.

EXAMPLEEXAMPLE

Example :E-mail Communication A sends an e-mail to B

With Cc to C And Bcc to D

C forwards this e-mail to E From analyzing the header, we can infer

A and D know that A, B, C and D know about this e-mail B and C know that A, B and C know about this e-mail C also knows that E knows about this e-mail D also knows that B and C do not know that it knows about

this e-mail; and that A knows this fact E knows that A, B and C exchanged this e-mail; and that

neither A nor B know that it knows about it and so on and so forth …

A C

B

D

E

DB VS DM PROCESSINGDB VS DM PROCESSING

• Query– Well defined– SQL

• Query– Poorly defined– No precise query language

DataData– Operational dataOperational data

OutputOutput– PrecisePrecise– Subset of Subset of

databasedatabase

DataData– Not operational dataNot operational data

OutputOutput– FuzzyFuzzy– Not a subset Not a subset

of databaseof database

QUERY EXAMPLESQUERY EXAMPLESDatabase

Data Mining

– Find all customers who have purchased milkFind all customers who have purchased milk

– Find all items which are frequently Find all items which are frequently purchased with milk. (association rules)purchased with milk. (association rules)

– Find all credit applicants with first name of Sane.Find all credit applicants with first name of Sane.– Identify customers who have purchased Identify customers who have purchased more than Rs.10,000 in the last month.more than Rs.10,000 in the last month.

– Find all credit applicants who are poor Find all credit applicants who are poor

credit risks. (classification)credit risks. (classification)– Identify customers with similar buying Identify customers with similar buying habits. (Clustering)habits. (Clustering)

ARE ALL THE ‘DISCOVERED’ PATTERNS INTERESTING?

Interestingness measures:

A pattern is interesting if it is easily

understood by humans, valid on new or

test data with some degree of purity,

potentially useful, novel, or validates

some hypothesis that a user seeks to

confirm

DATA MINING DEVELOPMENTDATA MINING DEVELOPMENT Similarity Measures Hierarchical Clustering IR Systems Imprecise Queries Textual Data Web Search Engines

Bayes Theorem Regression Analysis EM Algorithm K-Means Clustering Time Series Analysis

Neural Networks Decision Tree Algorithms

Algorithm Design Techniques Algorithm Analysis Data Structures

Relational Data Model SQL Association Rule Algorithms Data Warehousing Scalability Techniques

D 1, D 2, ……, D n are domains

Relation r is a subset of a Cartesian

product D 1× D 2× ……× D n

RELATION (r)RELATION (r)

n D 2 D 1 D r n D 2 D 1 D r

D1 = {Ram, Shyam} , D 2 = {24, 34}

D 1× D 2 = { (Ram, 24), (Ram, 34),

(Shyam, 24), (Shyam, 34)}

r is a sub set of D 1× D 2

r = { (Ram, 24), (Shyam, 34)}

EXAMPLE : rEXAMPLE : r

SUSHIL KULKARNI

Employee

RELATION is TABLERELATION is TABLE

NAME AGE Ram 24 Shyam 34

Instance of the relation is a tuple or row

Notation :

t < (a(1), a(2), a(3),… a(n)):

a(i) A(i); i N > Example: t < (Ram,24) >

TUPLES OR ROWS : tTUPLES OR ROWS : t

A A 11 A A 22 A A 33 …… …… A A k k …….……. A A nn

a 11 a 21 a 31

……a k1 ……. a n1

a 12 a 22 a 32 …… a k 2 …… a n2

….. ….. …….... ………… …..

a 1i a 2 i a 3 i …… a k i …… a n3

……. ……. ……. ……. …….

a 1m a 2m a 3m a n m …… a n m

RR

tt

RELATION (r)RELATION (r)

k th attribute R of i th tuple t

WHAT IS WHAT IS DATA WAREHOUSE ?DATA WAREHOUSE ?

Subject-oriented:

customers, patients, students, products, time.

Integrated: Gathered CENTRALLY from

1.several internal systems of records 2. sources external to the organization

WHAT IS WHAT IS DATA WAREHOUSE ?DATA WAREHOUSE ?

Time - variant:

Use to study trends and changes.

Non - updatable:

cannot updated by end users.

BIG PICTUREBIG PICTURE

The ETL Process

Capture

Scrub or data cleansing

Transform

Load and Index

ETL = Extract, Transform, and Load

Steps in data reconciliationSteps in data reconciliation

Static extract = capturing a snapshot of the source data at a point in time

Incremental extractIncremental extract = capturing changes that have occurred since the last static extract

Capture = extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse

Scrub = cleanse…uses pattern recognition and AI techniques to upgrade data quality

Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies

Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data


Transform = convert data from format of operational system to format of data warehouse

Record-level:Selection – data partitioningJoining – data combiningAggregation – data summarization

Field-level: single-field – from one field to one fieldmulti-field – from many fields to one, or one field to many


Load/Index = place transformed data into the warehouse and create indexes

Refresh mode: bulk rewriting of target data at periodic intervals

Update mode: only changes in source data are written to data warehouse


DIRTY DATA

Data in the real world is dirty:

– incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

– noisy: containing errors or outliers

– inconsistent: containing discrepancies in codes or names

WHY DATA PREPROCESSING?

No quality data, no quality mining results!

Quality decisions must be based on quality data

Data warehouse needs consistent integration of quality data

Required for Data Mining!

Why can Data be Incomplete?

Attributes of interest are not available (e.g., customer information for sales transaction data)

Data were not considered important at the time of transactions, so they were not recorded!

Why can Data be Incomplete?

Data not recorder because of misunderstanding or malfunctions

Data may have been recorded and later deleted!

Missing/unknown values for some data

Why can Data be Noisy / Inconsistent ?

Faulty instruments for data collection

Human or computer errors

Errors in data transmission

Technology limitations (e.g., sensor data come at a faster rate than they can be processed)

Why can Data be Noisy / Inconsistent ?

Inconsistencies in naming conventions or data codes (e.g., 2/5/2002 could be 2 May 2002 or 5 Feb 2002)

Duplicate tuples, which were received twice should also be removed

Major Tasks in Data Preprocessing

Data cleaning– Fill in missing values, smooth noisy data,

identify or remove outliers, and resolve inconsistencies

Data integration– Integration of multiple databases or files

Data transformation– Normalization and aggregation

outliers=exceptions!

Major Tasks in Data Preprocessing

Data reduction– Obtains reduced representation in volume

but produces the same or similar analytical results

Data discretization– Part of data reduction but with particular

importance, especially for numerical data

Forms of data preprocessing

Data cleaning tasks

- Fill in missing values

- Identify outliers and smooth out noisy data

- Correct inconsistent data

DATA CLEANING

Ignore the tuple: usually done when class label is missing (assuming the tasks in classification)— not effective when the percentage of missing values per attribute varies considerably.

Fill in the missing value manually: tedious + infeasible?

HOW TO HANDLE MISSING DATA?

Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!

Use the attribute mean to fill in the missing value

Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter

Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree



Age Income Team Gender

23 24,200 Red Sox M

39 ? Yankees F

45 45,390 ? F

Fill missing values using aggregate functions (e.g., average) or probabilistic estimates on global value distributionE.g., put the average income here, or put the most probable income based on the fact that the person is 39 years oldE.g., put the most frequent team here

The process of partitioning continuousVariables into categories is called Discretization.

HOW TO HANDLE NOISY DATA? Discretization

Binning method:- first sort data and partition into (equi-depth) bins- then one can smooth by bin means, smooth by

bin median, smooth by bin boundaries, etc.

Clustering- detect and remove outliers

HOW TO HANDLE NOISY DATA? Discretization : Smoothing techniques

Combined computer and human inspection- computer detects suspicious values, which are

then checked by humans

Regression- smooth by fitting the data into regression

functions

HOW TO HANDLE NOISY DATA? Discretization : Smoothing techniques

Equal-width (distance) partitioning:

- It divides the range into N intervals of equal size: uniform grid

- if A and B are the lowest and highest values of the attribute, the width of intervals will be:

W = (B-A)/N.- The most straightforward- But outliers may dominate presentation- Skewed data is not handled well.

SIMPLE DISCRETISATION SIMPLE DISCRETISATION METHODS: BINNINGMETHODS: BINNING

Equal-depth (frequency) partitioning: - It divides the range into N intervals, each

containing approximately same number of samples

- Good data scaling – good handing of skewed data

SIMPLE DISCRETISATION METHODS: BINNING

Binning is applied to each individual feature (attribute)

Set of values can then be discretized by replacing each value in the bin, by bin mean, bin median, bin boundaries.

Example: Set of values of attribute Age: 0. 4 , 12, 16, 14, 18, 23, 26, 28

BINNING : EXAMPLE

Example : Set of values of attribute Age: 0. 4 , 12, 16, 16, 18, 23, 26, 28 Take bin width = 10

EXAMPLE: EQUI- WIDTH BINNING

Bin # Bin Elements Bin Boundaries

1 {0,4} [ - , 10)

2 { 12, 16, 16, 18 } [10, 20)

3 { 23, 26, 28 } [ 20, +)

Example : Set of values of attribute Age: 0. 4 , 12, 16, 16, 18, 23, 26, 28 Take bin depth = 3

EXAMPLE: EQUI- DEPTH BINNING

Bin # Bin Elements Bin Boundaries

1 {0,4, 12} [ - , 14)

2 { 16, 16, 18 } [14, 21)

3 { 23, 26, 28 } [ 21, +)

SMOOTHING USING BINNING METHODS

Sorted data for price (in Rs): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 Smoothing by bin boundaries: [4,15],[21,25],[26,34] - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

SIMPLE DISCRETISATION METHODS: BINNING

Example: customer ages

0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80

Equi-width binning:

numberof values

0-22 22-31

44-4832-3838-44 48-55

55-6262-80

Equi-depth binning:

THANK YOU ! THANK YOU !

Any Questions?Any Questions? SUSHIL KULKARNISUSHIL KULKARNI


THANK YOU ! THANK YOU !

Any Questions?Any Questions? SUSHIL KULKARNISUSHIL KULKARNI