Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week...
Transcript of Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week...
Introductionto Big Data
Chapter 5 & 6 (Week 3)Data Structure & Data types
DCCS208(02) Korea University 2019 Fall
Asst. Prof. Minseok [email protected]
Contents
Record-based data
Types of data2.
Data structure
Data Structure1. Discrete & Continuous Attribute
Tabulation
Structured & unstructured data
Graph-based data
Ordered data
01Data StructureIntroduction to Big Data
copyrightⓒ 2018 All rights reserved by Korea University 4
Data StructureWhat is dataset?
What is dataset?
Tid Refund Marital Status
Taxable Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
Attributes, Chracteristics, Features, Variables, ..., etc.
Object
Instance
Individual
Sample
Subject
...
• Set of objects consisting oftheir attributes.
copyrightⓒ 2018 All rights reserved by Korea University 5
Data StructureWhat is dataset?
What is variables?Attributes, Chracteristics, Features, ..., etc.
• Properties / features of specific object.
• i.e. Human• Eye size (mm unit)• Eye color• Skin color• Height (cm unit)• Wear glasses or not• Gender• Age• Length of finger (cm unit)• ...
copyrightⓒ 2018 All rights reserved by Korea University 6
Discrete & Continuous AttributeTypes of variable
Continuous Attribute
• It has a real number a s property value
• A continuous variable is one which can take on infinitely many,uncountable values.
• Continuous attributes are usually represented as floating pointvariables
copyrightⓒ 2018 All rights reserved by Korea University 7
In Korean...?
copyrightⓒ 2018 All rights reserved by Korea University 8
Discrete & Continuous AttributeTypes of variable
Discrete Attribute
• Finite or infinite set of countable values• Usually expressed as an integer variable• Binary attribute is a special form of discrete attribute• Continuous variables can also be convereted to discrete variables
thorugh binning
copyrightⓒ 2018 All rights reserved by Korea University 9
TabulationImportance of Tabulation
It is a systematic and logical arrangement of classified data in rows and columns.
Let’s think about how we can collect structured or unstructured data
copyrightⓒ 2018 All rights reserved by Korea University 10
Collection of the dataTwo types of processes
Usually, this is data-science process
Experimental design
Determination of population Set variables Get values
Experimental design
Determination of population Get values Prescreening
Usually, we mentioned that ‘Data Mining’
Tabulation process
Tabulation process
copyrightⓒ 2018 All rights reserved by Korea University 11
Let’s try tabulationPractice of tabulation
AAAAA AAAAA AAAAA AAAAABB BB BB BBCCC CCC CCC CCCD D D D
Try to tabulize above unstructured data
copyrightⓒ 2018 All rights reserved by Korea University 12
Features of ‘unstructured” dataImportance of tabulation
It does not reside in traditional databases and data warehouses.
It may have an internal structure, but does not fit a relational data model.
It generated by both humans and machines.
It will usually be textual and multimedia content, Machine-to-machinecommunication
Examples include• Personal messaging – email, instant messages, tweets, chat• Business documents – business reports, presentations, survey
responses• Web content – web pages, blogs, wikis, audio files, photos, videos• Sensor output – satellite imagery, geolocation data, scanner
transactions
copyrightⓒ 2018 All rights reserved by Korea University 13
Current dataZettabytes’ unstructured data
About 85% is unstructured data
copyrightⓒ 2018 All rights reserved by Korea University 14
Big DataImportance of Tabulation in Big Data era
Data sets of such size, complexity and volatility that their businessvalue cannot be fully realised with existing data capture, storage,processing, analysis and management capabilities
The systematic use ofunstructured data is aBig Data challenge!
02Types of DataIntroduction to Big Data
copyrightⓒ 2018 All rights reserved by Korea University 16
Types of DatasetData... data... data.......
Record-based data• Data matrix• Document data• Transaction data• ...
Graph-based data• World wide web• Molecular structure• Map data• ...
Ordered data• Spatial data• Temporal data• Sequential data• Genetic/Genomic sequence data
copyrightⓒ 2018 All rights reserved by Korea University 17
Types of DatasetRecord-based data
Data that consists of a collection of records, each of which consistsof a fixed set of attributes
copyrightⓒ 2018 All rights reserved by Korea University 18
Types of DatasetData matrix
This is a term derived from linear algebra.
When composed of a fixed number of numerical attributes, anobject (record) can be considered as a point in multidimensionalspace.
Such data is represented by n x p matrices, where n rows eachrepresent an object and p columns each represent an attribute.
copyrightⓒ 2018 All rights reserved by Korea University 19
Types of DatasetDocument data
Each document can be represented by a term vector.
Each term corresponds to a component of the vector.
Each value corresponds to the number of times the term appearedin the document.
copyrightⓒ 2018 All rights reserved by Korea University 20
Types of DatasetTransaction Data
As a special type of record, each record (transaction) is acollection of items.
It also known as ‘market basket data’.
copyrightⓒ 2018 All rights reserved by Korea University 21
Types of DatasetGraph Data
G = (V, E)• V is set of vertices (nodes)• E is set of edges (arcs or link)
copyrightⓒ 2018 All rights reserved by Korea University 22
Types of DatasetGraph Data
Set of HTML documents
copyrightⓒ 2018 All rights reserved by Korea University 23
Types of DatasetGraph Data
Chemical structure data
Social network data
copyrightⓒ 2018 All rights reserved by Korea University 24
Types of DatasetOrdered data
Sequences of transactions
copyrightⓒ 2018 All rights reserved by Korea University 25
Types of DatasetOrdered data
Genetic / Genomics sequence data
copyrightⓒ 2018 All rights reserved by Korea University 26
Types of DatasetOrdered data
Time-series data
copyrightⓒ 2018 All rights reserved by Korea University 27
Types of DatasetOrdered data
Spatio- Temporal data
End of Slide
copyrightⓒ 2018 All rights reserved by Korea University 29
Topic of next classComing soon...
Introduction to R programmingThursday
Data Preprocessing Quality control Similarity & dissimilarity Distance metric Fundamental statistics
Next Wednesday