Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week...

Post on 21-May-2020

10 views 0 download

Transcript of Introduction to Big Data - Harvard University€¦ · Introduction to Big Data Chapter 5 & 6 (Week...

Introductionto Big Data

Chapter 5 & 6 (Week 3)Data Structure & Data types

DCCS208(02) Korea University 2019 Fall

Asst. Prof. Minseok Seomins@korea.ac.kr

Contents

Record-based data

Types of data2.

Data structure

Data Structure1. Discrete & Continuous Attribute

Tabulation

Structured & unstructured data

Graph-based data

Ordered data

01Data StructureIntroduction to Big Data

copyrightⓒ 2018 All rights reserved by Korea University 4

Data StructureWhat is dataset?

What is dataset?

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Attributes, Chracteristics, Features, Variables, ..., etc.

Object

Instance

Individual

Sample

Subject

...

• Set of objects consisting oftheir attributes.

copyrightⓒ 2018 All rights reserved by Korea University 5

Data StructureWhat is dataset?

What is variables?Attributes, Chracteristics, Features, ..., etc.

• Properties / features of specific object.

• i.e. Human• Eye size (mm unit)• Eye color• Skin color• Height (cm unit)• Wear glasses or not• Gender• Age• Length of finger (cm unit)• ...

copyrightⓒ 2018 All rights reserved by Korea University 6

Discrete & Continuous AttributeTypes of variable

Continuous Attribute

• It has a real number a s property value

• A continuous variable is one which can take on infinitely many,uncountable values.

• Continuous attributes are usually represented as floating pointvariables

copyrightⓒ 2018 All rights reserved by Korea University 7

In Korean...?

copyrightⓒ 2018 All rights reserved by Korea University 8

Discrete & Continuous AttributeTypes of variable

Discrete Attribute

• Finite or infinite set of countable values• Usually expressed as an integer variable• Binary attribute is a special form of discrete attribute• Continuous variables can also be convereted to discrete variables

thorugh binning

copyrightⓒ 2018 All rights reserved by Korea University 9

TabulationImportance of Tabulation

It is a systematic and logical arrangement of classified data in rows and columns.

Let’s think about how we can collect structured or unstructured data

copyrightⓒ 2018 All rights reserved by Korea University 10

Collection of the dataTwo types of processes

Usually, this is data-science process

Experimental design

Determination of population Set variables Get values

Experimental design

Determination of population Get values Prescreening

Usually, we mentioned that ‘Data Mining’

Tabulation process

Tabulation process

copyrightⓒ 2018 All rights reserved by Korea University 11

Let’s try tabulationPractice of tabulation

AAAAA AAAAA AAAAA AAAAABB BB BB BBCCC CCC CCC CCCD D D D

Try to tabulize above unstructured data

copyrightⓒ 2018 All rights reserved by Korea University 12

Features of ‘unstructured” dataImportance of tabulation

It does not reside in traditional databases and data warehouses.

It may have an internal structure, but does not fit a relational data model.

It generated by both humans and machines.

It will usually be textual and multimedia content, Machine-to-machinecommunication

Examples include• Personal messaging – email, instant messages, tweets, chat• Business documents – business reports, presentations, survey

responses• Web content – web pages, blogs, wikis, audio files, photos, videos• Sensor output – satellite imagery, geolocation data, scanner

transactions

copyrightⓒ 2018 All rights reserved by Korea University 13

Current dataZettabytes’ unstructured data

About 85% is unstructured data

copyrightⓒ 2018 All rights reserved by Korea University 14

Big DataImportance of Tabulation in Big Data era

Data sets of such size, complexity and volatility that their businessvalue cannot be fully realised with existing data capture, storage,processing, analysis and management capabilities

The systematic use ofunstructured data is aBig Data challenge!

02Types of DataIntroduction to Big Data

copyrightⓒ 2018 All rights reserved by Korea University 16

Types of DatasetData... data... data.......

Record-based data• Data matrix• Document data• Transaction data• ...

Graph-based data• World wide web• Molecular structure• Map data• ...

Ordered data• Spatial data• Temporal data• Sequential data• Genetic/Genomic sequence data

copyrightⓒ 2018 All rights reserved by Korea University 17

Types of DatasetRecord-based data

Data that consists of a collection of records, each of which consistsof a fixed set of attributes

copyrightⓒ 2018 All rights reserved by Korea University 18

Types of DatasetData matrix

This is a term derived from linear algebra.

When composed of a fixed number of numerical attributes, anobject (record) can be considered as a point in multidimensionalspace.

Such data is represented by n x p matrices, where n rows eachrepresent an object and p columns each represent an attribute.

copyrightⓒ 2018 All rights reserved by Korea University 19

Types of DatasetDocument data

Each document can be represented by a term vector.

Each term corresponds to a component of the vector.

Each value corresponds to the number of times the term appearedin the document.

copyrightⓒ 2018 All rights reserved by Korea University 20

Types of DatasetTransaction Data

As a special type of record, each record (transaction) is acollection of items.

It also known as ‘market basket data’.

copyrightⓒ 2018 All rights reserved by Korea University 21

Types of DatasetGraph Data

G = (V, E)• V is set of vertices (nodes)• E is set of edges (arcs or link)

copyrightⓒ 2018 All rights reserved by Korea University 22

Types of DatasetGraph Data

Set of HTML documents

copyrightⓒ 2018 All rights reserved by Korea University 23

Types of DatasetGraph Data

Chemical structure data

Social network data

copyrightⓒ 2018 All rights reserved by Korea University 24

Types of DatasetOrdered data

Sequences of transactions

copyrightⓒ 2018 All rights reserved by Korea University 25

Types of DatasetOrdered data

Genetic / Genomics sequence data

copyrightⓒ 2018 All rights reserved by Korea University 26

Types of DatasetOrdered data

Time-series data

copyrightⓒ 2018 All rights reserved by Korea University 27

Types of DatasetOrdered data

Spatio- Temporal data

End of Slide

copyrightⓒ 2018 All rights reserved by Korea University 29

Topic of next classComing soon...

Introduction to R programmingThursday

Data Preprocessing Quality control Similarity & dissimilarity Distance metric Fundamental statistics

Next Wednesday