SUSHIL KULKARNISUSHIL KULKARNIJAI-HIND COLLEGEJAI-HIND COLLEGE
[email protected]@yahoo.co.in
Social Networks : ExampleSocial Networks : Example Technology usedTechnology used What is Data Mining? What is Data Mining? DM Process & ExampleDM Process & Example DM QueriesDM Queries DM Tasks and MethodsDM Tasks and Methods Relation & Data WarehouseRelation & Data Warehouse What is ETL ? What is ETL ? Data Preprocessing Data Preprocessing
Social Networks : ExampleSocial Networks : Example Technology usedTechnology used What is Data Mining? What is Data Mining? DM Process & ExampleDM Process & Example DM QueriesDM Queries DM Tasks and MethodsDM Tasks and Methods Relation & Data WarehouseRelation & Data Warehouse What is ETL ? What is ETL ? Data Preprocessing Data Preprocessing
What is a Network?
node
node
node node
node
node
node
node
node node
node
node
node
node
node
node
node
Web Definition : A set of nodes, points, or locations connected by means of data, voice, and video communications for the purpose of exchange.
Link
Social Networks
A social network is a social structure of people, related (directly or indirectly) to each other through a common relation or interest
Social Network Analysis Social network analysis [SNA] is the mapping and
measuring of relationships and flows between people, groups, organizations, computers or other information/knowledge processing entities.
The nodes in the network are the people and groups while
the links show relationships or flows between the nodes.
A shift in approach: from ‘synthesis’ to ‘analysis’
Problems•High cost of
manual surveys•Survey bias - Perceptions of
individuals may be incorrect
•Logistics - Organizations
are now spread across several countries.
SdfdsfsdfFvsdfsdfs
dfdfsdSdfdsfsdfSdfsdfs
`
SdfdsfsdfFvsdfsdfs
dfdfsdSdfdsfsdfSdfsdfs
`
SdfdsfsdfFvsdfsdfs
dfdfsdSdfdsfsdfSdfsdfs
`
Electroniccommunication
- Email- Web logs
Analysis
Socialnetwork
Cognitivenetwork
SocialNetwork
Employee
Surveys
Cognitive network for
B
Cognitive network for
C
Cognitive network for
A
A
B
C
Synthesis
Shift in approach
Technology
Various technologies that help in creating
Social Networks are:
Email Blogs Social Networking Software like Orkut,
Face Book, Flickr etc.
SOCIAL NETWORK: Profile & Platforms
USENET
SOCIAL NETWORK: Profile & Platforms
Social Community
SOCIAL NETWORK: Growth
SOCIAL NETWORK : Growth Rate
SOCIAL NETWORK : Growth Rate
Technology :
What is Your Network? - When your connections invite their connections, your Network starts to grow. - Your Network is your connections, their connections, and so on out from you at the center.
How do you classify users? - Your Network contains professionals out to “three degrees” that is, friends-of-friends-of-friends. If each person had 10 connections (and some have many more) then your network would contain 10,000 professionals. How do you see who is in your Network? Facebook lets you see your network as one large group of searchable professional profiles.
SOCIAL NETWORK: Visualization
ME
FRIEND
FRIEND FRIENDFRIEND
FRIEND
ON ANY OF SOCIAL NETWORK
NameGenderAgeBirth date/Home townSchool attendedInterests/ HobbiesPhotoesFriendsActivitiesAudio clipsVideo clips
NameGenderAgeBirth date/Home townSchool attendedInterests/ HobbiesPhotoesFriendsActivitiesAudio clipsVideo clips
YOU
FRIEND
ON ANY OF SOCIAL NETWORK
NameGenderAgeBirth date/Home townSchool attendedInterests/ HobbiesPhotoesFriendsActivitiesAudio clipsVideo clips
NameGenderAgeBirth date/Home townSchool attendedInterests/ HobbiesPhotoesFriendsActivitiesAudio clipsVideo clips
YOU
FRIEND After making the friend, I can able to access his/ her friends, audios, videos, share information A friend may be from any remote site.
SOCIAL NETWORK : Growth Rate
SOCIAL NETWORK : VisualizationBetween friends: How many of them ?
Male vs. Female Young vs. Old
Thin vs. Fat
SOCIAL NETWORK : VisualizationBetween friends: Relationships
Thick Friends Just Friends
SOCIAL NETWORK : VisualizationBetween friends: Likes
Coffee Chocolate
Friends Friends
HOW MANY OF MADHURI DIXIT’S FRIEND LIKE ? HOW MANY OF PRASHANT DAMLE’S FRIEND LIKE ?
FRIENDS OF A FRIENDS OF A FRIEND SHOULD KNOW
How many friends use a social network regularly?
How many friends send messages frequently?
What is the mood of your friend list? How many friends are vegetarian? How many friends are closest or far from
you? How many friends studied or studying in
your school?
FRIENDS OF A FRIENDS OF A FRIEND FRIENDS OF A FRIENDS OF A FRIEND SHOULD KNOWSHOULD KNOW
INTERESTING PATTERNSFROM UNKNOWN DATA
DEFINE DATA MININGDEFINE DATA MINING
Data Mining is:
The analysis of (often large) observational The analysis of (often large) observational data sets to find unsuspected data sets to find unsuspected relationships and to summarize the data relationships and to summarize the data in novel ways that are both in novel ways that are both understandable and useful to the data understandable and useful to the data owner.owner.
Methods for exploring and modeling relationships in large amount of data
Finding hidden information in a database
Fit data to a model
THUS : DATA MININGTHUS : DATA MINING
Understand the Domain
- Understands particulars of the business or scientific problems
Create a Data set
- Understand structure, size, and format of data
- Select the interesting attributes
- Data cleaning and preprocessing
Data Mining ProcessData Mining Process
Choose the data mining task and the specific algorithm
- Understand capabilities and limitations of algorithms that may be relevant to the problem
Interpret the results, and possibly return to bullet 2
Data Mining ProcessData Mining Process
Understand social networks.
Grow connections.
Choose appropriate built in methods to find hidden information.
EXAMPLEEXAMPLE
Example :E-mail Communication A sends an e-mail to B
With Cc to C And Bcc to D
C forwards this e-mail to E From analyzing the header, we can infer
A and D know that A, B, C and D know about this e-mail B and C know that A, B and C know about this e-mail C also knows that E knows about this e-mail D also knows that B and C do not know that it knows about
this e-mail; and that A knows this fact E knows that A, B and C exchanged this e-mail; and that
neither A nor B know that it knows about it and so on and so forth …
A C
B
D
E
DB VS DM PROCESSINGDB VS DM PROCESSING
• Query– Well defined– SQL
• Query– Poorly defined– No precise query language
DataData– Operational dataOperational data
OutputOutput– PrecisePrecise– Subset of Subset of
databasedatabase
DataData– Not operational dataNot operational data
OutputOutput– FuzzyFuzzy– Not a subset Not a subset
of databaseof database
QUERY EXAMPLESQUERY EXAMPLESDatabase
Data Mining
– Find all customers who have purchased milkFind all customers who have purchased milk
– Find all items which are frequently Find all items which are frequently purchased with milk. (association rules)purchased with milk. (association rules)
– Find all credit applicants with first name of Sane.Find all credit applicants with first name of Sane.– Identify customers who have purchased Identify customers who have purchased more than Rs.10,000 in the last month.more than Rs.10,000 in the last month.
– Find all credit applicants who are poor Find all credit applicants who are poor
credit risks. (classification)credit risks. (classification)– Identify customers with similar buying Identify customers with similar buying habits. (Clustering)habits. (Clustering)
ARE ALL THE ‘DISCOVERED’ PATTERNS INTERESTING?
Interestingness measures:
A pattern is interesting if it is easily
understood by humans, valid on new or
test data with some degree of purity,
potentially useful, novel, or validates
some hypothesis that a user seeks to
confirm
DATA MINING DEVELOPMENTDATA MINING DEVELOPMENT Similarity Measures Hierarchical Clustering IR Systems Imprecise Queries Textual Data Web Search Engines
Bayes Theorem Regression Analysis EM Algorithm K-Means Clustering Time Series Analysis
Neural Networks Decision Tree Algorithms
Algorithm Design Techniques Algorithm Analysis Data Structures
Relational Data Model SQL Association Rule Algorithms Data Warehousing Scalability Techniques
D 1, D 2, ……, D n are domains
Relation r is a subset of a Cartesian
product D 1× D 2× ……× D n
RELATION (r)RELATION (r)
n D 2 D 1 D r n D 2 D 1 D r
D1 = {Ram, Shyam} , D 2 = {24, 34}
D 1× D 2 = { (Ram, 24), (Ram, 34),
(Shyam, 24), (Shyam, 34)}
r is a sub set of D 1× D 2
r = { (Ram, 24), (Shyam, 34)}
EXAMPLE : rEXAMPLE : r
SUSHIL KULKARNI
Employee
RELATION is TABLERELATION is TABLE
NAME AGE Ram 24 Shyam 34
Instance of the relation is a tuple or row
Notation :
t < (a(1), a(2), a(3),… a(n)):
a(i) A(i); i N > Example: t < (Ram,24) >
TUPLES OR ROWS : tTUPLES OR ROWS : t
A A 11 A A 22 A A 33 …… …… A A k k …….……. A A nn
a 11 a 21 a 31
……a k1 ……. a n1
a 12 a 22 a 32 …… a k 2 …… a n2
….. ….. …….... ………… …..
a 1i a 2 i a 3 i …… a k i …… a n3
……. ……. ……. ……. …….
a 1m a 2m a 3m a n m …… a n m
RR
tt
RELATION (r)RELATION (r)
k th attribute R of i th tuple t
WHAT IS WHAT IS DATA WAREHOUSE ?DATA WAREHOUSE ?
Subject-oriented:
customers, patients, students, products, time.
Integrated: Gathered CENTRALLY from
1.several internal systems of records 2. sources external to the organization
WHAT IS WHAT IS DATA WAREHOUSE ?DATA WAREHOUSE ?
Time - variant:
Use to study trends and changes.
Non - updatable:
cannot updated by end users.
BIG PICTUREBIG PICTURE
The ETL Process
Capture
Scrub or data cleansing
Transform
Load and Index
ETL = Extract, Transform, and Load
Steps in data reconciliationSteps in data reconciliation
Static extract = capturing a snapshot of the source data at a point in time
Incremental extractIncremental extract = capturing changes that have occurred since the last static extract
Capture = extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse
Scrub = cleanse…uses pattern recognition and AI techniques to upgrade data quality
Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies
Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data
Steps in data reconciliationSteps in data reconciliation
Transform = convert data from format of operational system to format of data warehouse
Record-level:Selection – data partitioningJoining – data combiningAggregation – data summarization
Field-level: single-field – from one field to one fieldmulti-field – from many fields to one, or one field to many
Steps in data reconciliationSteps in data reconciliation
Load/Index = place transformed data into the warehouse and create indexes
Refresh mode: bulk rewriting of target data at periodic intervals
Update mode: only changes in source data are written to data warehouse
Steps in data reconciliationSteps in data reconciliation
DIRTY DATA
Data in the real world is dirty:
– incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
WHY DATA PREPROCESSING?
No quality data, no quality mining results!
Quality decisions must be based on quality data
Data warehouse needs consistent integration of quality data
Required for Data Mining!
Why can Data be Incomplete?
Attributes of interest are not available (e.g., customer information for sales transaction data)
Data were not considered important at the time of transactions, so they were not recorded!
Why can Data be Incomplete?
Data not recorder because of misunderstanding or malfunctions
Data may have been recorded and later deleted!
Missing/unknown values for some data
Why can Data be Noisy / Inconsistent ?
Faulty instruments for data collection
Human or computer errors
Errors in data transmission
Technology limitations (e.g., sensor data come at a faster rate than they can be processed)
Why can Data be Noisy / Inconsistent ?
Inconsistencies in naming conventions or data codes (e.g., 2/5/2002 could be 2 May 2002 or 5 Feb 2002)
Duplicate tuples, which were received twice should also be removed
Major Tasks in Data Preprocessing
Data cleaning– Fill in missing values, smooth noisy data,
identify or remove outliers, and resolve inconsistencies
Data integration– Integration of multiple databases or files
Data transformation– Normalization and aggregation
outliers=exceptions!
Major Tasks in Data Preprocessing
Data reduction– Obtains reduced representation in volume
but produces the same or similar analytical results
Data discretization– Part of data reduction but with particular
importance, especially for numerical data
Forms of data preprocessing
Data cleaning tasks
- Fill in missing values
- Identify outliers and smooth out noisy data
- Correct inconsistent data
DATA CLEANING
Ignore the tuple: usually done when class label is missing (assuming the tasks in classification)— not effective when the percentage of missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
HOW TO HANDLE MISSING DATA?
Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter
Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree
HOW TO HANDLE MISSING DATA?
HOW TO HANDLE MISSING DATA?
Age Income Team Gender
23 24,200 Red Sox M
39 ? Yankees F
45 45,390 ? F
Fill missing values using aggregate functions (e.g., average) or probabilistic estimates on global value distributionE.g., put the average income here, or put the most probable income based on the fact that the person is 39 years oldE.g., put the most frequent team here
The process of partitioning continuousVariables into categories is called Discretization.
HOW TO HANDLE NOISY DATA? Discretization
Binning method:- first sort data and partition into (equi-depth) bins- then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
Clustering- detect and remove outliers
HOW TO HANDLE NOISY DATA? Discretization : Smoothing techniques
Combined computer and human inspection- computer detects suspicious values, which are
then checked by humans
Regression- smooth by fitting the data into regression
functions
HOW TO HANDLE NOISY DATA? Discretization : Smoothing techniques
Equal-width (distance) partitioning:
- It divides the range into N intervals of equal size: uniform grid
- if A and B are the lowest and highest values of the attribute, the width of intervals will be:
W = (B-A)/N.- The most straightforward- But outliers may dominate presentation- Skewed data is not handled well.
SIMPLE DISCRETISATION SIMPLE DISCRETISATION METHODS: BINNINGMETHODS: BINNING
Equal-depth (frequency) partitioning: - It divides the range into N intervals, each
containing approximately same number of samples
- Good data scaling – good handing of skewed data
SIMPLE DISCRETISATION METHODS: BINNING
Binning is applied to each individual feature (attribute)
Set of values can then be discretized by replacing each value in the bin, by bin mean, bin median, bin boundaries.
Example: Set of values of attribute Age: 0. 4 , 12, 16, 14, 18, 23, 26, 28
BINNING : EXAMPLE
Example : Set of values of attribute Age: 0. 4 , 12, 16, 16, 18, 23, 26, 28 Take bin width = 10
EXAMPLE: EQUI- WIDTH BINNING
Bin # Bin Elements Bin Boundaries
1 {0,4} [ - , 10)
2 { 12, 16, 16, 18 } [10, 20)
3 { 23, 26, 28 } [ 20, +)
Example : Set of values of attribute Age: 0. 4 , 12, 16, 16, 18, 23, 26, 28 Take bin depth = 3
EXAMPLE: EQUI- DEPTH BINNING
Bin # Bin Elements Bin Boundaries
1 {0,4, 12} [ - , 14)
2 { 16, 16, 18 } [14, 21)
3 { 23, 26, 28 } [ 21, +)
SMOOTHING USING BINNING METHODS
Sorted data for price (in Rs): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 Smoothing by bin boundaries: [4,15],[21,25],[26,34] - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
SIMPLE DISCRETISATION METHODS: BINNING
Example: customer ages
0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Equi-width binning:
numberof values
0-22 22-31
44-4832-3838-44 48-55
55-6262-80
Equi-depth binning:
THANK YOU ! THANK YOU !
Any Questions?Any Questions? SUSHIL KULKARNISUSHIL KULKARNI
[email protected]@yahoo.co.in
THANK YOU ! THANK YOU !
Any Questions?Any Questions? SUSHIL KULKARNISUSHIL KULKARNI
[email protected]@yahoo.co.in
Top Related