New Concepts in Data Mining - sorry.vse.czberka/docs/4iz451/KDD-trends.pdf · New Concepts in Data...

21
New Concepts in Data Mining Big Data Ubiquitous Knowledge Discovery Reality Mining Data Science

Transcript of New Concepts in Data Mining - sorry.vse.czberka/docs/4iz451/KDD-trends.pdf · New Concepts in Data...

New Concepts in Data Mining

Big Data

Ubiquitous Knowledge Discovery

Reality Mining

Data Science

BIG DATA

Big Data are data that cannot be handled using standard data base systems and standard data analysis tools.

Big Data sources: sensors, surveillance systems, mobile phones, GPS devices, RFID readers, social networks, computer networks, web logs, scientific data …

Big Data characteristics (3 V’s or 4 V’s)

Volume: the size of Big Data goes beyond standard

data storage and manipulation techniques

Velocity: Big Data is often available in real time

Variety: Big Data contains not only structured data

(e.g. in tabular or relational form) but also texts, images, audio or video

Veracity: the quality and reliability of Big Data can

vary

The Big Data pipeline

Data generation

Data acquisition: data collection,

data transmission,

data pre-processing (integration, celansing, rendundancy elimination)

Data storage

Data analysis

Challenges for collecting, storing and manipulating Big Data

New forms of data storage: file systems, NoSQL databases

New forms of computation: parallel computing, distributed computing, grid computing, cloud computing

batch processing X stream processing

Apache Hadoop

Software platform that supports data-intensive distributed applications

Hadoop distributed file system

Map/Reduce: divide and conquer approach to break-down intractable problem into tractable sub-problems

Challenges for analyzing Big Data

New forms of data: heterogeneous data, unstructured data, stream data

New properties of data: non-stationarity, concept drift

New forms of learning: real-time learning, incremental learning, sequentional learning

New forms of computation: distributed computation, cloud computation

Areas related to Big Data Analysis

Analysis of Big Data grounded in Knowledge Discovery in Databases and Data Mining. However, new names appear used by different people:

Ubiquitous Knowledge Discovery

Reality Mining

UBIQUITOUS KNOWLEDGE

DISCOVERY

data mining in mobile systems, wireless communication networks, calm technologies,

distributed architectures: distributed data mining, grid, P2P, autonomic computing,

agents,

learning components: statistical learning (incl. online learning), evolutionary computing,

anytime algorithms data types: spatio-temporal, stream, multimedia,

security and privacy: privacy preserving data mining, intrusion detection,

HCI and cognitive modelling: user interfaces of ubiquitous discovery systems.

EU funded project KDUbiq (2005-2008 FP6 FET IST )

Knowledge discovery process in mobile, distributed, dynamic environments, in presence of massive amounts of data

REALITY MINING

Tackles some of the most challenging data mining problems: scaling up for high dimensional data/high speed streams mining sequence data and time series data data mining in a network setting

Collection and analysis of machine-sensed environmentaldata pertaining to human social behavior, with the goal of identifying predictable patterns of behavior. (Pentland2004)

Example reality mining projects

Complex social systems

Public health and medicine

Traffic monitoring and control

Smart homes and ambient assisted living

Environmental monitoring

Complex social systems (1/3)

Data from mobile phones of MIT students and researchers used to analyze collective human behaviour

Proximity pattern (left) and inferred friendship network (right)

Eagle and Pentland, 2004

Complex social systems (2/3)same data as before to investigate how people’s social relations affect their encounters (Friends or Strangers)

Miklas et al., 2007

Number of encounters in two-week period (left), and number of pairs of people and number of encounters for friends and strangers (right)

Complex social systems (3/3)daily travel patterns of 14 816 521 individuals across Kenya to study human mobility

Wesolowski et al., 2013

Relation between mobility and expenditure (left), and between income and expenditure (right)

Public health and medicineData from mobile phones to study the role of human mobility in dissemination of malaria in Kenya

Buckee et al., 2011

The parasite rate (left), and location of mobile phone tower in overlaid on a settlement and parasite rate maps (right)

Traffic monitoring and control (1/2)

Analyze data from GPS-enabled mobile phones as a proof-of-concept of traffic monitoring system

Herrera et al., 2010

Snapshot of Mobile Millennium Traffic in San Francisco and the Bay Area

Traffic monitoring and control (2/2)

Data on taxi locations and booking requests in a GPS-enabled taxi dispatch system in Singapore

Santani et al., 2008

Taxi Observations by Location and Booking Frequency of Zone

https://www.novinky.cz/ekonomika/412841-prvni-samoridici-taxiky-vozi-v-singapuru-zakazniky.html

Smart homes and ambient assisted living

Monitor and analyze data form mobile phones, wearable sensors and other devices integrated in the residential infrastructure

O’Grady et al, 2010

A generic scheme of Ambient Assistive Living Systems

Environmental monitoringNoiseTube – a low-cost approach involving the general public to monitor noise pollution using their mobile phones as noise sensors

Maisonneuve et al, 2010

Collective noise map for part of Paris

https://play.google.com/store/apps/details?id=net.noisetube

Lessons from reality miningprojects

Using mobile phones we can gather significantly larger and more reliable data sets than by querying the users

Mobile phones are a cheap alternative to more complex sensor systems

The data anlysis often does not go beyond data description and summarization task

DATA SCIENCE

Set of fundamental principles that support and guide the principled extraction of information and knowledge from data.

Theoretical backgorund for data mining, big data analysis, data-driven decision making e.t.c