Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week...

29
Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types DCCS208(02) Korea University 2019 Fall Asst. Prof. Minseok Seo [email protected]

Transcript of Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week...

Page 1: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

Introductionto Big Data

Chapter 5 & 6 (Week 3)Data Structure & Data types

DCCS208(02) Korea University 2019 Fall

Asst. Prof. Minseok [email protected]

Page 2: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

Contents

Record-based data

Types of data2.

Data structure

Data Structure1. Discrete & Continuous Attribute

Tabulation

Structured & unstructured data

Graph-based data

Ordered data

Page 3: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

01Data StructureIntroduction to Big Data

Page 4: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 4

Data StructureWhat is dataset?

What is dataset?

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Attributes, Chracteristics, Features, Variables, ..., etc.

Object

Instance

Individual

Sample

Subject

...

• Set of objects consisting oftheir attributes.

Page 5: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 5

Data StructureWhat is dataset?

What is variables?Attributes, Chracteristics, Features, ..., etc.

• Properties / features of specific object.

• i.e. Human• Eye size (mm unit)• Eye color• Skin color• Height (cm unit)• Wear glasses or not• Gender• Age• Length of finger (cm unit)• ...

Page 6: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 6

Discrete & Continuous AttributeTypes of variable

Continuous Attribute

• It has a real number a s property value

• A continuous variable is one which can take on infinitely many,uncountable values.

• Continuous attributes are usually represented as floating pointvariables

Page 7: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 7

In Korean...?

Page 8: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 8

Discrete & Continuous AttributeTypes of variable

Discrete Attribute

• Finite or infinite set of countable values• Usually expressed as an integer variable• Binary attribute is a special form of discrete attribute• Continuous variables can also be convereted to discrete variables

thorugh binning

Page 9: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 9

TabulationImportance of Tabulation

It is a systematic and logical arrangement of classified data in rows and columns.

Let’s think about how we can collect structured or unstructured data

Page 10: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 10

Collection of the dataTwo types of processes

Usually, this is data-science process

Experimental design

Determination of population Set variables Get values

Experimental design

Determination of population Get values Prescreening

Usually, we mentioned that ‘Data Mining’

Tabulation process

Tabulation process

Page 11: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 11

Let’s try tabulationPractice of tabulation

AAAAA AAAAA AAAAA AAAAABB BB BB BBCCC CCC CCC CCCD D D D

Try to tabulize above unstructured data

Page 12: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 12

Features of ‘unstructured” dataImportance of tabulation

It does not reside in traditional databases and data warehouses.

It may have an internal structure, but does not fit a relational data model.

It generated by both humans and machines.

It will usually be textual and multimedia content, Machine-to-machinecommunication

Examples include• Personal messaging – email, instant messages, tweets, chat• Business documents – business reports, presentations, survey

responses• Web content – web pages, blogs, wikis, audio files, photos, videos• Sensor output – satellite imagery, geolocation data, scanner

transactions

Page 13: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 13

Current dataZettabytes’ unstructured data

About 85% is unstructured data

Page 14: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 14

Big DataImportance of Tabulation in Big Data era

Data sets of such size, complexity and volatility that their businessvalue cannot be fully realised with existing data capture, storage,processing, analysis and management capabilities

The systematic use ofunstructured data is aBig Data challenge!

Page 15: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

02Types of DataIntroduction to Big Data

Page 16: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 16

Types of DatasetData... data... data.......

Record-based data• Data matrix• Document data• Transaction data• ...

Graph-based data• World wide web• Molecular structure• Map data• ...

Ordered data• Spatial data• Temporal data• Sequential data• Genetic/Genomic sequence data

Page 17: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 17

Types of DatasetRecord-based data

Data that consists of a collection of records, each of which consistsof a fixed set of attributes

Page 18: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 18

Types of DatasetData matrix

This is a term derived from linear algebra.

When composed of a fixed number of numerical attributes, anobject (record) can be considered as a point in multidimensionalspace.

Such data is represented by n x p matrices, where n rows eachrepresent an object and p columns each represent an attribute.

Page 19: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 19

Types of DatasetDocument data

Each document can be represented by a term vector.

Each term corresponds to a component of the vector.

Each value corresponds to the number of times the term appearedin the document.

Page 20: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 20

Types of DatasetTransaction Data

As a special type of record, each record (transaction) is acollection of items.

It also known as ‘market basket data’.

Page 21: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 21

Types of DatasetGraph Data

G = (V, E)• V is set of vertices (nodes)• E is set of edges (arcs or link)

Page 22: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 22

Types of DatasetGraph Data

Set of HTML documents

Page 23: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 23

Types of DatasetGraph Data

Chemical structure data

Social network data

Page 24: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 24

Types of DatasetOrdered data

Sequences of transactions

Page 25: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 25

Types of DatasetOrdered data

Genetic / Genomics sequence data

Page 26: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 26

Types of DatasetOrdered data

Time-series data

Page 27: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 27

Types of DatasetOrdered data

Spatio- Temporal data

Page 28: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

End of Slide

Page 29: Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week 3) Data Structure & Data types. DCCS208(02) Korea University 2019 Fall. Asst. Prof.

copyrightⓒ 2018 All rights reserved by Korea University 29

Topic of next classComing soon...

Introduction to R programmingThursday

Data Preprocessing Quality control Similarity & dissimilarity Distance metric Fundamental statistics

Next Wednesday