Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity...

59
Future Trends in Data Analytics Hannes Voigt

Transcript of Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity...

Page 1: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

Future Trends in Data AnalyticsHannes Voigt

Page 2: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

2

General Information

Lecture Notes! Further information on our website http://wwwdb.inf.tu-dresden.de

Page 3: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

3

Dresden Database System Group

Page 4: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

4

…, ! face detection & recogition, autobeam, Siri/Cortant/…, 3D printing, ...

Future is Now

[http://www.bostondynamics.com/][http://[https://www.google.com/selfdrivingcar/] [http://www.idsc.ethz.ch/research-dandrea/research-projects/cubli.html]

Cubli

[http://www.ibmwatson.com/]

Page 5: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

5

… tomorrow?

Sooner then you think!

Page 6: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

6

When have we entered the future?

What was the threshold

we crossed?

Page 7: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

Exponential Growth

Page 8: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

8

Exponential Growth

Outnumbers everything else quicklyAsymptotic advantage!Quickly increasing add! -on

Leads to surprising results! Black swans in economic crisis! Shooting stars in business, media, sport, etc.

[http://roadmap2retire.com/2016/06/disruptive-technologies-exponential-growth/]

[http://xkcd.com/1727/]

Page 9: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

9

Second Half of the Chessboard

When exponential growth really kicks in According to Ray ! Kurzweil Things start to get interesting in the second !half of the chess boardBeyond 4G numbers quickly go beyond !human intuitionWhat happens in the second half can hardly !foreseen

[https://en.wikipedia.org/wiki/File:Wheat_Chessboard_with_line.svg]

Page 10: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

10

First half of theChessboard

Welcome to the Second Half of the Chessboard

Page 11: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

Digitization

Page 12: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

12

Everything is Digital

12

Page 13: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

13

Landscape has changed

From Islands of digital data … … to ponds of analog signals

(Tuamotu Archipelago, French Polynesia) (Algonquin Provincial Park, Ontario, Canada)

Page 14: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

14

Everything is Digital

World’s capacity to store information World’s capacity to telecommunicate

Analog

Analog

DigitalDigital

[M. Hilbert and P. Lopez, The World's Technological Capacity to Store, Communicate, and Compute Information, Science, 332, April 2011, DOI: 10.1126/science.1200970]

94% digital in 2007 99% digital in 2007

Page 15: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

15

Everything is Digital

[http://blog.acronis.com/posts/data-everything-8-noble-truths]

Everything is DigitalEverything is Digital

Page 16: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

Big Data

Page 17: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

17

Big Data term reflects the three drivers

Exponential growth! beyond intuition! second half of the chessboard

Digitization Everything is digital data!Analog signals are not part of big data!

Big Data

Page 18: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

18[http://www.ibmbigdatahub.com/infographic/four-vs-big-data]

Page 19: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

19

Data, data, everywhere… Volume

The Petabyte Age! 2008

! Eric Schmidt (in 2010): Every 2 Days We Create As Much Information As We Did Up To 2003

The Zettabyte Age2015!One ! Zettabyte = Stack of books from Earth to Pluto 20 times

The internet in 2020~26.3 ! billionnetworked devices25.1 GB ! averagetraffic per capitaper month2.3 Zettabytes !annual IP-Traffic

Page 20: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

20

Data is produced continuously Velocity

Humane-produced data! ~300 million email sent/received per minute in 2016! >0.4 million tweets per minute in 2016! ~138 million google searches per minute in 2016! >400 hours of video was uploaded to YouTube per minute in 2016

Machine produced data! IoT will be boost data velocity greatly! Sensors become ubiquitous! Sensors for sound, images, position, motion, temperature, pressure, etc …! Resolution (in time and space) is continuously increasing

! Assume Waze like cars collecting 20 double values every second! With one million driving cars that is almost 10 TB every minute

Page 21: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

21

Data from all kinds of digital sources! Structure vs. unstructured! Text vs. image! Curated vs. automatically collected! Raw vs. edited vs. refined

Semantic heterogenity! Decentralized content generation! Multiple perspectives (conceptualizations) of

the reality! Ambiguity, vagueness, inconsistency

Everything is data Variety

“A lot of Big Data is a lot of small “A lot of Big Data is a lot of small data data put put put togethertogether.”

Page 22: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

22

Data is messy Veracity

Quality of captured data varies greatlySensor inaccuracy!Human mistakes!Incompleteness!Untrusted sources!Deterministic processes!

! etc.

22

“Next to the analytes, we see everything in the results, from the perfume of the lab assistant to the softener in the new machine sitting next.” —About Liquid chromatography–mass spectrometry at IPB Halle

Page 23: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

Data Science/Data Analysis

“Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc. to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.” –Clive Humby

Value – the fifth V of Big Data

… or how to turn raw data into something valuable?

Page 24: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

24

Levels of Analysis

Reporting

Analytics

Page 25: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

25

Descriptive Analytics

How have I done? (and why?)Simplest ! class of analyticsCondense ! big data into smaller, more useful bits of informationSummary ! of what happened70! -80% penetration

ExamplesDatabase aggregation queries!Business reporting (e.g. ! Sales figures)Market survey (e.g. GFK)!(classical) business intelligence, dashboards, scorecards!Google Analytics!

Page 26: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

26

Predictive Analytics

How will I do?Next ! step up in data reductionStudies ! recent and historical dataUtilizes ! a variety of statistical, modeling, data mining and machine learning techniquesAllows (potential inaccurate) predictions ! about the futureUse data you have, to create data you do not have!15! -25% penetration

ExamplesMarket developments!Stock developments!Movie/product recommendations on ! netflix/amazonEnergy demand and supply forecasting!Preditive! policing (e.g. precobs, predpol)

Fill in the blanks

Project the curve

Page 27: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

27

Prescriptive Analytics

What should I do?! Predicts “multiple futures” based on the potential actions! Recommends the best course of action for any

pre-specified outcome! Typically involves a feedback system to track outcome

produced by the action taken! Utilizes predictive methods + optimization techniques! 1-5% penetration

Examples! Energy load balancing by flexoffer scheduling! Inventory optimization in supply chains! Targeted marketing campaign optimization! Focus treatment of clinical obesity in health care! Waze-like car navigation

the potential actions

Annual savings in U.S. in 2016

nnual100 Annn

-

vingl

00 million s in U.S. in

gs inmiles

-- 1

10001011100

savingsav0 millio

gson milles

0 milthousand CO2 tones

11111110000010 -

-

00000000 mil0

000000 tnd

000000000000000000000 ttttttmillion fuel

iledd COOO2222 toooooooonesdddd COOO

gallons

111111110000 mmmm$300-

-

mmmmmmm0000-

27

thhhhhoooouuuuussssssaaaaaandttthhhh

iiillllion fffffffuuuuu

2dddd Ceeeeeeeeell ggggaaaaaaaaaaaaalllllllloooooooonnnns

mmiillionmmmillio

0000000-400 million costs

ORION N –OOnOROOnnn-

NNORIOnnn-Road dd Integrated OOOOOOOOnnnnnnnn RRRRRRRRRRooooooooooaaaaadddddd IInt

Optimization ratedtegr

nnnnn and Navigation

Page 28: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

28

Roles in Big Data Projects

Data scientistData science is a systematic method !dedicated to knowledge discovery via data analysisIn business, optimize organizational !processes for efficiencyIn science, analyze !experimental/observational data to derive resultsTypical skills!

Statistics + (mathematics) background-Computer science: Programming, e.g.: R, (SAS,) -Java, Scala, Python; Machine learningSome domain knowledge for the problem to solve-

Data engineer! Data engineering is the domain that

develops and provides systems for managing and analyzing big data

! Build modular and scalable data platforms for data scientists

! Deploy big data solutions! Typical skills

- Computer science background- Databases- Software engineering- Massively parallel processing- Real-time processing- Languages: C++, Java, (Scala,) Python- Understand performance factors and limitations

of systems

Page 29: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

Our Focus in Teaching and Research

Page 30: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

30

Levels of Analysis

Reporting

Analytics

Methods of Data Science

M!

MethodsMeth! Lecture

of Data Scienceof Dre Datenintegrationonnnnn und nd –d –analyse

Page 31: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

31

Big Data Landscape(s)

Page 32: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

32

Data Systems

data systems are in the middle of all this

a data system…! …stores data…! …provides access to data…! …and (ideally) makes data analysis easy

Different data systems use different data models

big data

datasystems

Page 33: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

33

(Big) Data System Landscape(s)

Concept and techniques of data systems

C!

Concept aConc! Lecture

ahniq

aArchitektur

ues ouesvon

of data systemsof da

Datenbanksystemen

!!!

on! Leeeeecturre! Leeeeee!!!!!!! Lecture

and techniqand te Arrrchitektu

ueur vooooon Daaaaaaaten

e Arrrc

rrrrrrreeeeeeee Big Data Platforms

Page 34: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

34

Page 35: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

35

Graph Building Blocks

Nodes (Dots)

! Like an entity in ER! Exist on their own! Have object identity

Edges (Lines)

Like a relationship in ER!Exist only between nodes!Identity depends on the !nodes they connect

Page 36: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

36

Graph Data – Social Network

Page 37: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

37

Graph Data – Social Network

Page 38: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

38

Graph Data – Social Network

Page 39: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

39

Graph Data

Page 40: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

40

Social Graphs

FacebookMay 2013!

As of 12/2014: “1.39 billion active users with more than 400 billion edges”

[Ching et al. One Trillion Edges: Graph Processing at Facebook-Scale. PVLDB (12)2015]

Page 41: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

41

Structured Search

Example: Facebook Graph SearchFinding subgraph structures!Very natural way of formulating queries!

[http://socialnewsdaily.com/15865/facebook-social-graph-search-a-great-way-to-find-working-professionals-in-your-network/]

Page 42: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

42

Social Network Analysis

Example: Twitter communication! Users tweeting on a specific topic! Others reply of retweet! Users can be grouped based on

communication topology (-> graph clustering)

! Analysis reveals user groups and dominant communication patterns

[https://nodexlgraphgallery.org/Pages/Graph.aspx?graphID=76277]

Page 43: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

43

Citation Networks

Example: DBLP! Open bibliographic information on major

computer science journals and proceedings>3.4 million publication!>7000 new publication per month!>1.7 million authors!

[http://well-formed.eigenfactor.org/projects/well-formed/radial.html#/][Emre Sarigöl et al. Predicting Scientific Success Based On Coauthorship Networks. EPJ Data Science, 2014]

Page 44: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

44

Viral Marketing

Viral Marketing! spreading content to one person so that more than one person engaging with the content! Techniques

- Influence estimation - Influence maximization

Page 45: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

45

Knowledge Graphs

Knowledge graph of a picture

[The National Library of Wales, https://www.llgc.org.uk/blog/?p=11246]

Page 46: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

46

Knowledge Graphs

Sächsische Landesbibliothek – Staats- und Universitätsbibliothek Dresden (SLUB)! Adds semantics search to library online catalog! Utilizing multi-lingual knowledge data from Wikipedia! Significant improvements in search quality for library users

[http://www.slideshare.net/JensMittelbach/dswarm-a-library-data-management-platform-based-on-a-linked-open-data-approach]

Page 47: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

47

sed

Supply Chain Management

Supply Chain OptimizationCustomer A: 10! -30% reduction in inventoryCustomer B! : 8% reduction in transportation costs

[http://www.logicblox.com/solutions/supply-chain-optimization/]

When to send?How much to send?From where to where?

much to send?From where to where?

?From where to where?

InventoryResponseResponseeeee

Times

Page 48: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

48

Level of Analytics

Graph Data Management Applications

Page 49: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

49

The Power of Networks

Watch on YouTube: https://youtu.be/nJmGrNdJ5Gw

Page 50: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

50

(Big) Data System Landscape(s)

!! Lecture re Graph Data Management nt andnnnnnnnd Analytics

Page 51: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

51

(Big) Data System Landscape(s)

Page 52: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

52

Use Cases of Search

Searching Textual Contents

Page 53: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

53

Use Cases of Search

Stack Overflow ! Programming Q & A site

Page 54: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

54

Use Cases of Search

Image Search

Page 55: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

55

Use Cases of Search

Last.fm ! Music Search

Page 56: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

56

Example: Library of Congress

Page 57: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

57

Databases versus Information Retrieval

Database Information RetrievalMatching Exact Match Partial Match, Best MatchModel Deterministic ProbabilisticQuery Language Structured / Formal NaturalQuery Specification Complete IncompleteQueried Objects Matching RelevantError Sensitivity Sensitive Insensitive

Hard to formulate Queries!

Iterative workflow base on reponses!

Tons of results, but only a few are relevant!

Ranking of results (instead of set of results)!

Representation of document content often inadequate / inacurate!

Page 58: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

58

(Big) Data System Landscape(s)

!! Lecture re Information Retrieval

Page 59: Future Trends in Data Analytics - TU Dresden · 20 Data is produced continuously Velocity Humane-produced data!~300 million email sent/received per minute in 2016!>0.4 million tweets

59

Summary

Big DataCrossing thresholds in ! exponential growth & digitizationTechnical challenges in volume, velocity, variety, veracity, value!

Related research and lecturesData Science ! ! Datenintegration und -analyse

Data ! Systems ! Architektur von Datenbanksystemen! Big Data Platforms

Graph ! Data ! Graph Data Management and Analytics

Search ! ! Information Retrieval