Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity...

Post on 07-Jun-2020

0 views 0 download

Transcript of Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity...

Future Trends in Data AnalyticsHannes Voigt

2

General Information

Lecture Notes! Further information on our website http://wwwdb.inf.tu-dresden.de

3

Dresden Database System Group

4

…, ! face detection & recogition, autobeam, Siri/Cortant/…, 3D printing, ...

Future is Now

[http://www.bostondynamics.com/][http://[https://www.google.com/selfdrivingcar/] [http://www.idsc.ethz.ch/research-dandrea/research-projects/cubli.html]

Cubli

[http://www.ibmwatson.com/]

5

… tomorrow?

Sooner then you think!

6

When have we entered the future?

What was the threshold

we crossed?

Exponential Growth

8

Exponential Growth

Outnumbers everything else quicklyAsymptotic advantage!Quickly increasing add! -on

Leads to surprising results! Black swans in economic crisis! Shooting stars in business, media, sport, etc.

[http://roadmap2retire.com/2016/06/disruptive-technologies-exponential-growth/]

[http://xkcd.com/1727/]

9

Second Half of the Chessboard

When exponential growth really kicks in According to Ray ! Kurzweil Things start to get interesting in the second !half of the chess boardBeyond 4G numbers quickly go beyond !human intuitionWhat happens in the second half can hardly !foreseen

[https://en.wikipedia.org/wiki/File:Wheat_Chessboard_with_line.svg]

10

First half of theChessboard

Welcome to the Second Half of the Chessboard

Digitization

12

Everything is Digital

12

13

Landscape has changed

From Islands of digital data … … to ponds of analog signals

(Tuamotu Archipelago, French Polynesia) (Algonquin Provincial Park, Ontario, Canada)

14

Everything is Digital

World’s capacity to store information World’s capacity to telecommunicate

Analog

Analog

DigitalDigital

[M. Hilbert and P. Lopez, The World's Technological Capacity to Store, Communicate, and Compute Information, Science, 332, April 2011, DOI: 10.1126/science.1200970]

94% digital in 2007 99% digital in 2007

15

Everything is Digital

[http://blog.acronis.com/posts/data-everything-8-noble-truths]

Everything is DigitalEverything is Digital

Big Data

17

Big Data term reflects the three drivers

Exponential growth! beyond intuition! second half of the chessboard

Digitization Everything is digital data!Analog signals are not part of big data!

Big Data

18[http://www.ibmbigdatahub.com/infographic/four-vs-big-data]

19

Data, data, everywhere… Volume

The Petabyte Age! 2008

! Eric Schmidt (in 2010): Every 2 Days We Create As Much Information As We Did Up To 2003

The Zettabyte Age2015!One ! Zettabyte = Stack of books from Earth to Pluto 20 times

The internet in 2020~26.3 ! billionnetworked devices25.1 GB ! averagetraffic per capitaper month2.3 Zettabytes !annual IP-Traffic

20

Data is produced continuously Velocity

Humane-produced data! ~300 million email sent/received per minute in 2016! >0.4 million tweets per minute in 2016! ~138 million google searches per minute in 2016! >400 hours of video was uploaded to YouTube per minute in 2016

Machine produced data! IoT will be boost data velocity greatly! Sensors become ubiquitous! Sensors for sound, images, position, motion, temperature, pressure, etc …! Resolution (in time and space) is continuously increasing

! Assume Waze like cars collecting 20 double values every second! With one million driving cars that is almost 10 TB every minute

21

Data from all kinds of digital sources! Structure vs. unstructured! Text vs. image! Curated vs. automatically collected! Raw vs. edited vs. refined

Semantic heterogenity! Decentralized content generation! Multiple perspectives (conceptualizations) of

the reality! Ambiguity, vagueness, inconsistency

Everything is data Variety

“A lot of Big Data is a lot of small “A lot of Big Data is a lot of small data data put put put togethertogether.”

22

Data is messy Veracity

Quality of captured data varies greatlySensor inaccuracy!Human mistakes!Incompleteness!Untrusted sources!Deterministic processes!

! etc.

22

“Next to the analytes, we see everything in the results, from the perfume of the lab assistant to the softener in the new machine sitting next.” —About Liquid chromatography–mass spectrometry at IPB Halle

Data Science/Data Analysis

“Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.” –Clive Humby

Value – the fifth V of Big Data

… or how to turn raw data into something valuable?

24

Levels of Analysis

Reporting

Analytics

25

Descriptive Analytics

How have I done? (and why?)Simplest ! class of analyticsCondense ! big data into smaller, more useful bits of informationSummary ! of what happened70! -80% penetration

ExamplesDatabase aggregation queries!Business reporting (e.g. ! Sales figures)Market survey (e.g. GFK)!(classical) business intelligence, dashboards, scorecards!Google Analytics!

26

Predictive Analytics

How will I do?Next ! step up in data reductionStudies ! recent and historical dataUtilizes ! a variety of statistical, modeling, data mining and machine learning techniquesAllows (potential inaccurate) predictions ! about the futureUse data you have, to create data you do not have!15! -25% penetration

ExamplesMarket developments!Stock developments!Movie/product recommendations on ! netflix/amazonEnergy demand and supply forecasting!Preditive! policing (e.g. precobs, predpol)

Fill in the blanks

Project the curve

27

Prescriptive Analytics

What should I do?! Predicts “multiple futures” based on the potential actions! Recommends the best course of action for any

pre-specified outcome! Typically involves a feedback system to track outcome

produced by the action taken! Utilizes predictive methods + optimization techniques! 1-5% penetration

Examples! Energy load balancing by flexoffer scheduling! Inventory optimization in supply chains! Targeted marketing campaign optimization! Focus treatment of clinical obesity in health care! Waze-like car navigation

the potential actions

Annual savings in U.S. in 2016

nnual100 Annn

-

vingl

00 million s in U.S. in

gs inmiles

-- 1

10001011100

savingsav0 millio

gson milles

0 milthousand CO2 tones

11111110000010 -

-

00000000 mil0

000000 tnd

000000000000000000000 ttttttmillion fuel

iledd COOO2222 toooooooonesdddd COOO

gallons

111111110000 mmmm$300-

-

mmmmmmm0000-

27

thhhhhoooouuuuussssssaaaaaandttthhhh

iiillllion fffffffuuuuu

2dddd Ceeeeeeeeell ggggaaaaaaaaaaaaalllllllloooooooonnnns

mmiillionmmmillio

0000000-400 million costs

ORION N –OOnOROOnnn-

NNORIOnnn-Road dd Integrated OOOOOOOOnnnnnnnn RRRRRRRRRRooooooooooaaaaadddddd IInt

Optimization ratedtegr

nnnnn and Navigation

28

Roles in Big Data Projects

Data scientistData science is a systematic method !dedicated to knowledge discovery via data analysisIn business, optimize organizational !processes for efficiencyIn science, analyze !experimental/observational data to derive resultsTypical skills!

Statistics + (mathematics) background-Computer science: Programming, e.g.: R, (SAS,) -Java, Scala, Python; Machine learningSome domain knowledge for the problem to solve-

Data engineer! Data engineering is the domain that

develops and provides systems for managing and analyzing big data

! Build modular and scalable data platforms for data scientists

! Deploy big data solutions! Typical skills

- Computer science background- Databases- Software engineering- Massively parallel processing- Real-time processing- Languages: C++, Java, (Scala,) Python- Understand performance factors and limitations

of systems

Our Focus in Teaching and Research

30

Levels of Analysis

Reporting

Analytics

Methods of Data Science

M!

MethodsMeth! Lecture

of Data Scienceof Dre Datenintegrationonnnnn und nd –d –analyse

31

Big Data Landscape(s)

32

Data Systems

data systems are in the middle of all this

a data system…! …stores data…! …provides access to data…! …and (ideally) makes data analysis easy

Different data systems use different data models

big data

datasystems

33

(Big) Data System Landscape(s)

Concept and techniques of data systems

C!

Concept aConc! Lecture

ahniq

aArchitektur

ues ouesvon

of data systemsof da

Datenbanksystemen

!!!

on! Leeeeecturre! Leeeeee!!!!!!! Lecture

and techniqand te Arrrchitektu

ueur vooooon Daaaaaaaten

e Arrrc

rrrrrrreeeeeeee Big Data Platforms

34

35

Graph Building Blocks

Nodes (Dots)

! Like an entity in ER! Exist on their own! Have object identity

Edges (Lines)

Like a relationship in ER!Exist only between nodes!Identity depends on the !nodes they connect

36

Graph Data – Social Network

37

Graph Data – Social Network

38

Graph Data – Social Network

39

Graph Data

40

Social Graphs

FacebookMay 2013!

As of 12/2014: “1.39 billion active users with more than 400 billion edges”

[Ching et al. One Trillion Edges: Graph Processing at Facebook-Scale. PVLDB (12)2015]

41

Structured Search

Example: Facebook Graph SearchFinding subgraph structures!Very natural way of formulating queries!

[http://socialnewsdaily.com/15865/facebook-social-graph-search-a-great-way-to-find-working-professionals-in-your-network/]

42

Social Network Analysis

Example: Twitter communication! Users tweeting on a specific topic! Others reply of retweet! Users can be grouped based on

communication topology (-> graph clustering)

! Analysis reveals user groups and dominant communication patterns

[https://nodexlgraphgallery.org/Pages/Graph.aspx?graphID=76277]

43

Citation Networks

Example: DBLP! Open bibliographic information on major

computer science journals and proceedings>3.4 million publication!>7000 new publication per month!>1.7 million authors!

[http://well-formed.eigenfactor.org/projects/well-formed/radial.html#/][Emre Sarigöl et al. Predicting Scientific Success Based On Coauthorship Networks. EPJ Data Science, 2014]

44

Viral Marketing

Viral Marketing! spreading content to one person so that more than one person engaging with the content! Techniques

- Influence estimation - Influence maximization

45

Knowledge Graphs

Knowledge graph of a picture

[The National Library of Wales, https://www.llgc.org.uk/blog/?p=11246]

46

Knowledge Graphs

Sächsische Landesbibliothek – Staats- und Universitätsbibliothek Dresden (SLUB)! Adds semantics search to library online catalog! Utilizing multi-lingual knowledge data from Wikipedia! Significant improvements in search quality for library users

[http://www.slideshare.net/JensMittelbach/dswarm-a-library-data-management-platform-based-on-a-linked-open-data-approach]

47

sed

Supply Chain Management

Supply Chain OptimizationCustomer A: 10! -30% reduction in inventoryCustomer B! : 8% reduction in transportation costs

[http://www.logicblox.com/solutions/supply-chain-optimization/]

When to send?How much to send?From where to where?

much to send?From where to where?

?From where to where?

InventoryResponseResponseeeee

Times

48

Level of Analytics

Graph Data Management Applications

49

The Power of Networks

Watch on YouTube: https://youtu.be/nJmGrNdJ5Gw

50

(Big) Data System Landscape(s)

!! Lecture re Graph Data Management nt andnnnnnnnd Analytics

51

(Big) Data System Landscape(s)

52

Use Cases of Search

Searching Textual Contents

53

Use Cases of Search

Stack Overflow ! Programming Q & A site

54

Use Cases of Search

Image Search

55

Use Cases of Search

Last.fm ! Music Search

56

Example: Library of Congress

57

Databases versus Information Retrieval

Database Information RetrievalMatching Exact Match Partial Match, Best MatchModel Deterministic ProbabilisticQuery Language Structured / Formal NaturalQuery Specification Complete IncompleteQueried Objects Matching RelevantError Sensitivity Sensitive Insensitive

Hard to formulate Queries!

Iterative workflow base on reponses!

Tons of results, but only a few are relevant!

Ranking of results (instead of set of results)!

Representation of document content often inadequate / inacurate!

58

(Big) Data System Landscape(s)

!! Lecture re Information Retrieval

59

Summary

Big DataCrossing thresholds in ! exponential growth & digitizationTechnical challenges in volume, velocity, variety, veracity, value!

Related research and lecturesData Science ! ! Datenintegration und -analyse

Data ! Systems ! Architektur von Datenbanksystemen! Big Data Platforms

Graph ! Data ! Graph Data Management and Analytics

Search ! ! Information Retrieval