CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

14
Mingon Kang, Ph.D. Department of Computer Science, University of Nevada, Las Vegas * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Transcript of CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Page 1: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Mingon Kang, Ph.D.

Department of Computer Science, University of Nevada, Las Vegas

* Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington

CS 789 ADVANCED BIG DATA ANALYTICS

DATA REPRESENTATION

Page 2: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Data Representation?

How does a computer represent data?

0 and 1 in the aspect of โ€œgeneralโ€ computer science

Vector/Matrix in the aspect of โ€œMachine Learningโ€

Page 3: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Data Representation

Scalar single number - usually write in italics

- lower-case variable names

- e.g., ๐‘  โˆˆ โ„, ๐‘› โˆˆ โ„• [1]

Vector array of numbers - arranged in order

- lower-case names written in bold typeface

- ๐ฑ =

๐‘ฅ1๐‘ฅ2โ‹ฎ๐‘ฅ๐‘›

, ๐ฑ = {๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘›}

- what is ๐ฑ๐ฌ when ๐ฌ = {1, 3, 6}?- Then, ๐ฑโˆ’๐ฌ?

1 https://en.wikipedia.org/wiki/List_of_mathematical_symbols

Page 4: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Data Representation

Matrix 2-D array of

numbers

- an element is identified by two indices

- upper-case variable name with bold

typeface, e.g., ๐—- ๐— โˆˆ โ„๐’Žโˆ—๐’: matrix has a height of m and a

width of n, and elements are real numbers

- e.g., ๐— =๐‘ฅ11 ๐‘ฅ12 ๐‘ฅ13๐‘ฅ21 ๐‘ฅ22 ๐‘ฅ23

Tensor array with more than

two axes

- three indices to identify an element

Page 5: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Types of Variable

Categorical variable: discrete or qualitative variables

Nominal:

โ—ผ Have two or more categories, but which do not have an

intrinsic order

Ordinal

โ—ผ Have two or more categories, which can be ordered or

ranked.

Continuous variable

Page 6: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Data Representation

Features

An individual measurable property of a phenomenon being observed

The number of features or distinct traits that can be used to describe each item in a quantitative manner

May have implicit/explicit patterns to describe a phenomenon

Samples

Items to process (classify or cluster)

Can be a document, a picture, a sound, a video, or a patient

Reference: https://en.wikipedia.org/wiki/Feature_(machine_learning)

Page 7: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Data Representation

Feature vector

An N-dimensional vector of numerical features that

represent some objects

A sample consists of feature vectors

Reference: http://www.slideshare.net/rahuldausa/introduction-to-machine-learning-38791937

Page 8: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Example - Survey

Convert Data to a feature vector/sample matrix

๐‘ก๐‘–๐‘š๐‘’ = ๐‘Ž๐‘”๐‘Ÿ๐‘’๐‘’๐‘Ž๐‘ข๐‘‘๐‘–๐‘œ = ๐‘ฆ๐‘’๐‘ 

โ‹ฎ

Page 9: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Example โ€“ Structured data

Convert Data to a feature vector/sample matrix

๐น๐‘–๐‘›๐‘Ž๐‘›๐‘๐‘’๐‘€๐‘Ž๐‘Ÿ๐‘˜๐‘’๐‘ก๐‘–๐‘›๐‘”

โ‹ฎ

Page 10: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Example โ€“ Image data

Page 11: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Example โ€“ Unstructured data

Feature Extraction

Unstructured data (e.g., text data)

Structured data (e.g., Bag-of-Words Model)

Page 12: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Data in Machine Learning

๐‘ฅ๐‘–: input vector, independent variable

๐‘ฆ: response variable, dependent variable

๐‘ฆ โˆˆ {โˆ’1, 1} or {0, 1}: binary classification

๐‘ฆ โˆˆ โ„ค: multi-label classification

๐‘ฆ โˆˆ โ„: regression

Predict a label when having observed some new ๐‘ฅ

Page 13: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Data Visualization

Vector space model

Data is a set of features, ๐๐ข = ๐‘“1, ๐‘“2, โ€ฆ , ๐‘“๐‘

All data can be represented by vector

Ref: https://www.slideshare.net/pkgosh/the-vector-space-model

Page 14: CS 789 ADVANCED BIG DATA ANALYTICS DATA REPRESENTATION

Data Visualization

Hand-written data (MNIST)

High-dimensional data

Can visualize data using Principle Component Analysis