Emc 2013 Big Data in Astronomy

72
07/02/13 1 EMC Summer School on BIG DATA – NCE/UFRJ Fabio Porto ([email protected] ) LNCC – MCTI DEXL Lab (dexl.lncc.br) Big Data in Astronomy The LIneA-DEXL case Outline Introduction Big Data in Science Hypothesis Driven-Research Data management Data partitioning Parallel workflow processing Final remarks EMC Summer School 2013 2

description

A discussion on some research activities on data management of Big Astronomy Data being carried on by the DEXL laboratory @ LNCC, Brazil

Transcript of Emc 2013 Big Data in Astronomy

Page 1: Emc 2013 Big Data in Astronomy

07/02/13

1

EMC Summer School on BIG DATA – NCE/UFRJ

Fabio Porto ([email protected]) LNCC – MCTI DEXL Lab (dexl.lncc.br)

Big Data in Astronomy The LIneA-DEXL case

Outline

l  Introduction l  Big Data in Science l  Hypothesis Driven-Research l  Data management

–  Data partitioning –  Parallel workflow processing

l  Final remarks

EMC Summer School 2013 2

Page 2: Emc 2013 Big Data in Astronomy

07/02/13

2

Laboratório Nacional de Computação Científica (LNCC)

EMC Summer School 2013

Petropolis, Rio de Janeiro 3

LNCC - MCTI l  Graduate Course in Computational Modelling

–  CAPES 6 l  BioInformatics Laboratory

–  Roche 454 high throughput sequencing l  Coordinator of INCT –MACC

–  Medicine Supported by Computational Science l  Coordinator of SINAPAD

–  HPC National System l  Thematic laboratories

–  ACIMA –  MARTIN –  DEXEL –  COMCIDIS –  HEMOLAB –  LABINFO

EMC Summer School 2013 4

Page 3: Emc 2013 Big Data in Astronomy

07/02/13

3

SINAPAD – National System of High Processing Computing

•  Organized in CENAPADS:

•  Universities •  Research Centers •  Different

Architectures: •  Shared Disks •  Shared Memory •  GPUs

5 EMC Summer School 2013

sinapad.lncc.br

6840 CPU Cores + 8192 GPU Cores ~106.6 TFlops / ~17.3 TBytes RAM / ~ 2.3 PBytes Storage

6

CENAPADS

6 EMC Summer School 2013

Page 4: Emc 2013 Big Data in Astronomy

07/02/13

4

The DEXL Lab Mission

l  To support in-silico science with Big Data management techniques; –  To develop interdisciplinary research with

contributions on data modelling, design and management;

–  To develop tools and systems in support to in-silico science data management;

EMC Summer School 2013 7

e-Astronomy l  LNCC is a member of the LIneA Lab:

–  Laboratório Inter-institucional de Astronomia l  O.N., LNCC, CBPF, RNP l  Development of e-Astronomy infrastructure in support for astronomy surveys l  Official south hemisphere DES node

l  Large astronomy surveys: –  Sloan Digital sky Survey

l  Currently SDSS-3 –  Dark Energy Survey

l  DES – Brazil managed by LIneA laboratory l  5.000 square degrees of the sky

–  Large Synoptic Sky Telescope l  20.000 square degrees of the sky l  Each patch visited 1000 times during 10 years

l  One of the scientific domains with extreme data processing and storage needs

l  Big Data today !!!!

EMC Summer School 2013 8

Page 5: Emc 2013 Big Data in Astronomy

07/02/13

5

LSST – Large Synoptic Survey Telescope

EMC Summer School 2013

•  800 images p/ night during 10 years !! •  3D Map of the Universe •  30 TeraBytes per night •  100 PetaBytes in 10 years

•  105 disks of 1 TB

9

Sloan Portal

10 EMC Summer School 2013

Page 6: Emc 2013 Big Data in Astronomy

07/02/13

6

Skyserver – Projeto Sloan

EMC Summer School 2013 11

Dark Energy Survey l  Dark Energy Survey

–  Astronomic project to explain: l  Acceleration of the universe l  Nature of dark energy

–  Data production l  DECam takes images of 1GB (400/night) l  Images are analyzed;

–  galaxies and stars identified and catalogued

l  Catalogs are stored in database systems –  Estimates of 1 billion of rows and 1 thousand attributes

l  LIneA is the official Brazilian contributor for the DES collaboration

EMC Summer School 2013 12

Page 7: Emc 2013 Big Data in Astronomy

07/02/13

7

Stellar mass, LF, HOD fit

Addstar (MW, GC), Addqso

Identification, characterization

Global and local tests

Cluster industrialization

Point source catalog

Findsat, Sparse, fitmodel

Classifier, photo-z

Test environment & CTIO

Un-supervised process

Masks, random catalogs

Cosmological parameters

DES  Science  Pipelines    

EMC Summer School 2013 13

EMC Summer School 2013 14

Page 8: Emc 2013 Big Data in Astronomy

07/02/13

8

BIG DATA in Science

l  Scientific process is being remodelled to be developed within an in-silico environment

l  Powerful instruments: –  Digital telescopes –  DNA sequencers –  Mass spectrometers

l  Huge simulations –  Weak lensing simulations –  Cardio-vascular system simulation

l  Massive amounts of information streams in and out… l  Hypothesis-driven research supported by in-silico

infrastructure, methods, models…

EMC Summer School 2013 15

Big Data needs for e-science

l  Data archival infra-structure; l  Scientific life cycle metadata management; l  Distributed big data management;

–  Parallel workflow processing; –  Parallel Analytical algorithms;

EMC Summer School 2013 16

Page 9: Emc 2013 Big Data in Astronomy

07/02/13

9

EMC Summer School 2013

“Scientists are spending most of their time manipulating, organizing, finding and moving data, instead of researching. And it’s going to

get worse” –  Office Science. Data-Management Challenge

Report– DoE - 2004

17

Big Data needs for e-science

l  Data archival infra-structure; l  Scientific life cycle metadata management; l  Distributed big data management;

–  Parallel workflow processing; –  Parallel Analytical algorithms;

EMC Summer School 2013 18

Page 10: Emc 2013 Big Data in Astronomy

07/02/13

10

Scientific Experiment Life-cycle

EMC Summer School 2013

[Mattoso et al. 2010]

Experiment Data

19

MODELLING - HYPOTHESIS-DRIVEN RESEARCH

EMC Summer School 2013 20

Page 11: Emc 2013 Big Data in Astronomy

07/02/13

11

Hypothesis Formulation Modeling Experiment

Life-cycle

EMC Summer School 2013

Publication Phenomenon

e-Science life cycle

21

Big Data Scenario in Scientific exploration life-cycle

EMC Summer School 2013

Hypothesis,  experiment  

Goals  

Experiment,  Workflow  Design  

Workflow  Prepara;on  

Workflow  Execu;on  Post-­‐

Execu;on  analysis  

Workflow  repository  

Data  Sources  

Provenance  Store  

Monitoring

Hypotheses  database  

Adapted from [Mattoso et al. 2010]

Analysis  Results  

22

Page 12: Emc 2013 Big Data in Astronomy

07/02/13

12

Motivation

l  As experiments produce more and more data, extracting meaning out of these data requires, among other things, contextualizing the data

l  Metadata about the research allows for results sharing, fostering collaborative work

l  Sharing knowledge about the scientific reasoning

EMC Summer School 2013 23

Hypotheses in Astronomy - DES

l  Phenomenon: –  Universe is speeding-up

l  Discovered by scientists in 1998 studying distant supernovae l  Supported by observations of redshift on long distance supernovae

light l  Hypothesis

–  A new odd behaviour named “Dark Energy” could make up 70% of the universe

–  The universe is not homogeneous - it has regions with different densities (our location is special….)

l  Supporting evidences –  Weak gravitational lensing –  Galaxy clusters in different redshifts

EMC Summer School 2013 24

Page 13: Emc 2013 Big Data in Astronomy

07/02/13

13

Hypothesis in Big Data Analytics

l  Scientific exploration is hypothesis-driven –  Nevertheless, hypothesis remain out of reach of

in-silico exploration (big data analyses ??) l  Big Data Analyses is explorative in nature

–  Understanding what one is doing when exploring Big Data requires scientific hypothesis-driven approach

l  Corollary –  BIG Data needs hypothesis management

EMC Summer School 2013 25

Context

l  Scientists trying to understand some phenomenon –  Formulate Hypothesis about Phenomenon behaviour

l  Natural Phenomena –  Simulated by computational models –  Explained by Scientific hypothesis

l  Time-Space varying –  Space represented by physical meshes

l  1D, 3D,… –  Time reflected on simulation ticks

EMC Summer School 2013 26

Page 14: Emc 2013 Big Data in Astronomy

07/02/13

14

Scientific Hypothesis Human Cardio-vascular System

EMC Summer School 2013 27

Elements of hypothesis-driven research

l  Scientific Phenomenon – an observable event –  occurs in space-time; –  characterized by observable quantities;

l  Scientific Hypothesis – a falsifiable statement proposed to explain a phenomenon [Popper 2012]

–  We are interested in a conceptual representation that puts forward the idea the hypothesis carries on

l  Mathematical Model – a language specific formalization of a scientific hypothesis

l  Experiment – the set of computational artifacts put together to validate a scientific hypothesis;

l  Data – observed or experimental data use in validating hypotheses;

EMC Summer School 2013 28

Page 15: Emc 2013 Big Data in Astronomy

07/02/13

15

Hypothesis modelling initiatives l  Robot Scientist

–  [R.D.King et al] The automation of science, Science, 2009. l  HyQueu and HyBrow

–  [A. Callahan, M. Dumontier, and N. H. Shah]. HyQue: Evaluating hypotheses using semantic web technologies. Journal of Biomedical Semantics, 2(Suppl 2):S3, 2011.

–  Modeling hypothesis as propositions in part of the domain language

l  Bioinformatics l  SWAN

–  Y. Gao et al. Journal of Web Semantics, 2006 l  J. Sowa, Process Ontology

EMC Summer School 2013 29

Phenomenon

0..1  

1..1  explains  

1..1

1..1

Sc  Hypothesis  Conceptual  Model  

Con:nuous  Ph_Process  

Discrete  Ph_Process  

 Mathema:cal  

Model    

1..1  formulatedby  

isTheBlendOf  1..n

1..n

Is  basedOn  

represented_as  

Compared_with  

Mathematical Formulae XML

Represented  with  

Physical  Quan::es  

Phenomenon  physical  quan::es  1..1  

0..n   0..n  

1..1  

Formal  Representa:on  

Scien:st  

1..m  

0..n  

0..n  0..n  

elements  

constant  

fucn:on  

equa:on  

1..n  

0..n  

1..n  

1..1  

Observa:on  Element  

Simulated  Element  

Data View (query over Data view)

Modeled_as  

1..1  

0..n  

Refers-to 0..1  

Space-­‐Time  Dimension   1..1  

0..n  

0..n  1..1 0..n  

Event  

Computational Model View

modeled_as  

transforms  Mesh  

1..1

1..n  

Mesh Data view

Domain ontology URL

1..1  

0..n  

Formal  Language  

Discrete  Phenomenon  Simula:on  

0..1  

0..1

0..1   0..n  

represents  

1..1 1..1

0..n  

Topologically    modeled    by  0..n  

0..1  

1-n

State  

Ph_Process  

represented_as  

SC Hypothesis 1..n

EMC Summer School 2013

isAuthor  

variable   1..n  

30

[Porto et al. ER 2008, ER 2012]

Page 16: Emc 2013 Big Data in Astronomy

07/02/13

16

Modelling Hypotheses and their interconnections

EMC Summer School 2013

Τ

Dark Energy Non uniform universe

Weak lensing Galaxy clustering

Earth special location

Τ

A lattice theoretic representation for hypotheses interconnect

31

Focus on Hypothesis modeling

l  Scientific Hypothesis formulation as a conceptual entity

l  Structuring of research evolution l  Isomorphic representation of: hypothesis,

scientific model and phenomenon l  Structure amenable for data representation,

association, querying and publishing

EMC Summer School 2013 32

Page 17: Emc 2013 Big Data in Astronomy

07/02/13

17

Hypotheses Structuring: Lattice

EMC Summer School 2013 33

EMC Summer School 2013

The core entities of the hypothesis conceptual model

34

Page 18: Emc 2013 Big Data in Astronomy

07/02/13

18

Representation Isomorphism

EMC Summer School 2013 35

Application: Linked Science

l  An initiative to have a machine-readable content describing the scientific exploration;

l  Support reproducibility of experiments; l  To foster reusing previous results; l  The community needs a more “open”

science”

EMC Summer School 2013 36

Page 19: Emc 2013 Big Data in Astronomy

07/02/13

19

Linked Science (or Linked Open Science)

l  Is an initiative to interconnect all scientific assets;

l  It is a combination of: –  Linked data, semantic web –  Open source; –  Scientific workflows and provenance (OPM); –  Scientific models; –  Cloud computing; –  …

EMC Summer School 2013 37

Linked Science Core Vocabulary (LSC)

l  Defines a vocabulary (LSC) with “basic” terms for science; –  More specific terminology shall be added by

individual communities (minimal ontological commitment)

EMC Summer School 2013 38

Page 20: Emc 2013 Big Data in Astronomy

07/02/13

20

LSC Core Vocabulary

EMC Summer School 2013 39

Extension to LSC

EMC Summer School 2013 40

Page 21: Emc 2013 Big Data in Astronomy

07/02/13

21

EMC Summer School 2013

Semanticengineering of

hypotheses

IntroductionMotivation

Goals & Challenges

Related Work

SemanticModeling

Combinationand Order

Partial Results

Next Steps

18/23

Published Research as Linked Data (1)3

rdfs:Class rdf:Resource ! rdf:Literal

lsc:Researcher authors1 rdf:value�����! “P.J. Blanco, M.R. Pivello, S.A. Urquiza, and R.A. Feijóo.”

lsc:Research research1dc:description��������! “Simulation of hemodynamic conditions in the carotid

artery.”

lsc:Publication pub1 dc:title����! “On the potentialities of 3D–1D coupled models in hemo-dynamics simulations.”

lsc:Data dataset1dc:description��������! “Flow rate of 5.0 l/min as an inflow boundary condition at

the aortic root, in observation of Avolio (1980) and others.”

lsc:Data dataset2dc:description��������! “1D mechanical and geometric data from Avolio (1980).”

lsc:Data dataset3dc:description��������! “MRI images processed for reconstructing the 3D geome-

try of both the left femoral and the carotid arteries.”

Phenomenon p17dc:description��������! “Blood flow in the carotid artery.”

tisc:Region region1dc:description��������! “The carotid artery, a part of the human CVS.”

owl:IntervalEvent beat1dc:description��������! “A heart beat with period T = 0.8 s.”

Observable ob1dc:description��������! “Blood flow rate.”

Observable ob2dc:description��������! “Blood pressure.”

lsc:Hypothesis h17 rdfs:label�����! “blend(h13, h15, h16)”

Model m17dc:description��������! “3D-1D coupled model with lumped windkessel terminals.”

3Blanco et al.’s published research as an LSC instantiation.

41

EMC Summer School 2013

Semanticengineering of

hypotheses

IntroductionMotivation

Goals & Challenges

Related Work

SemanticModeling

Combinationand Order

Partial Results

Next Steps

19/23

Published Research as Linked Data (2)4

rdfs:Class rdf:Resource ! rdf:Literal

lsc:Data dataset4dc:description��������! “Plots of hemodynamic observables in the left femoral artery

produced to validate the hypothesis.”

lsc:Data dataset5dc:description��������! “Plots of hemodynamic observables in the carotid artery.”

lsc:Data dataset6dc:description��������! “Scientific visualization of hemodynamic observables in the

left femoral artery produced to validate the hypothesis.”

lsc:Data dataset7dc:description��������! “Scientific visualization of hemodynamic observables in the

carotid artery both with and without aneurism.”

lsc:Prediction predict1 rdf:value�����! “Sensitivity of local blood flow in the carotid artery to the heartaortic inflow condition.”

lsc:Prediction predict2 rdf:value�����! “Sensitivity of the cardiac pulse to the presence of ananeurysm in the carotid.”

lsc:Conclusion conclusion1 rdf:value�����! “3D-1D coupled models allow to perform quantitative andqualitative studies about how local and global phenomenaare related, which is relevant in hemodynamics.”

4Blanco et al.’s published research as an LSC instantiation.

42

Page 22: Emc 2013 Big Data in Astronomy

07/02/13

22

Find in Blanco et al.'s microtheory a hypothesis (if any) explaining phenomena of blood flow in microvascular vessels and show which model formulates it.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dc: <http://purl.org/dc/elements/1.1/> PREFIX lsc: <http://linkedscience.org/lsc/ns#> SELECT ?hypothesis_name ?model_name WHERE { ?h rdfs:label ?hypothesis_name . ?m rdfs:label ?model_name . ?h a lsc:Hypothesis . ?p a lsc:Phenomenon . ?m a lsc:Model . ?h lsc:explains ?p . ?m lsc:formulates ?h . ?p dc:description ?d . FILTER regex(?d, "blood flow", "i") . FILTER regex(?d, "microvascular", "i") }

EMC Summer School 2013 43

Remarks l  Hypothesis modeling reflects the scientist mental model

during data analyses; –  supports hypothesis-driven data exploration –  extends current eScience infrastructure;

l  Scientific Hypothesis, Models and Phenomenon are the main primitives;

l  The primitives maybe represented as isomorphic lattices with semantic association among themselves;

l  One can search, discovery, mine hypotheses and related scientific artefacts;

l  ER 2012- MODIC Workshop l  ISWC 2012– Linked Science workshop

EMC Summer School 2013 44

Page 23: Emc 2013 Big Data in Astronomy

07/02/13

23

DATA MANAGEMENT

EMC Summer School 2013 45

Dark Energy Survey l  Dark Energy Survey

–  Astronomic project to explain: l Acceleration of the universe l Nature of dark energy

–  Data production l DECam takes images of 1GB (400/night) l  Images are analyzed; galaxies and starts are identified

and catalogued l Catalogs are stored in database systems

EMC Summer School 2013 46

Page 24: Emc 2013 Big Data in Astronomy

07/02/13

24

Dark Energy Survey Project l  Main technical (CS) issue:

–  Managing huge catalogs –  Relations loaded from std FITS files

l  Database features –  Single relation for each catalog –  Volume: 1 billion tuples x 1000 attributes (300GB) –  Queries

l  Users submit ad-hoc queries to the database l  Usually too many results for each query

–  Need to choose best results, e.g. using top-k techniques l Some queries scan the whole database

–  Looking for clusters of stars

EMC Summer School 2013 47

Processing Astronomy data

EMC Summer School 2013

Astronomy catalogs

User access - Ad-hoc queries - downloads

Scientific workflows - Analysis

48

Page 25: Emc 2013 Big Data in Astronomy

07/02/13

25

Ad-Hoc Queries

l  Submitted by users through portal; l  For small size queries (Regions of the sky)

–  Indexing based on ra, dec (e.g. Q3C) l  [Koposov, S.,Bartunov, O., 2006] Q3C Quad Tree Cube,

Astronomical Data Analysis Software and Systems, 2006 l  HTM, Hierarchical Triangular Mesh, MSSQlServer, Sloan

–  Spatial function (eg. Radial search) –  Other criteria need more fine grained criteria

l  For large size queries (whole sky) –  Explore parallelism over partitioned data

l  Data partitioning is efficient for small and large queries

49 EMC Summer School 2013

Astronomer’s coordinate system

EMC Summer School 2013 50

Page 26: Emc 2013 Big Data in Astronomy

07/02/13

26

Workflow queries

l  Workflows process data retrieved from the Catalog –  Two systems

l  Workflow engine l  Database engine

–  Lack of integration l  upper bound on performance

–  Large queries l  Parallelism obtained by data partitioning is jeopardized by

consolidation of results operated by DBMS; l  Workflow receives data and redistribute it to parallelize activities

–  Concurrency among workflows l  May impose huge penalties

EMC Summer School 2013 51

Need to partition data

l  Beneficial for both access patterns –  Ad-hoc and workflow

l  How to apply it? l  Vertical partitioning

–  Already applied based on semantic clustering of attributes l  Ra, dec l  Photometry, spectrometry, astrometry

l  Horizontal partitioning –  Ra, Dec (the current approach) –  More fine grained criteria

l  Been developed in collaboration with INRIA Montpellier

EMC Summer School 2013 52

Page 27: Emc 2013 Big Data in Astronomy

07/02/13

27

First Step: Hybrid Data Partitioning(HDP)

EMC Summer School 2013 53

Id Ra Dec Catalog

Catalog-ph

Catalog_a

Id spectrometry Catalog

Id photometry

Id astronometry

Criterion 1

07/02/13

Id Ra Dec Catalog

Catalog-ph

Catalog_a

Id spectrometry Catalog_s

Id photometry

Id astronometry

Criterion k

���

Std criterion: range of ra,dec

IMPLEMENTATION ALTERNATIVES

EMC Summer School 2013 54

Page 28: Emc 2013 Big Data in Astronomy

07/02/13

28

Using PGPOOL-II

l  Pgpool II –  Implemented on top of PostgreSQL 9.1 –  Central node coordinates data/query distribution/

replication –  Requests distributed through nodes –  Parallel query Processing

l  data partitioning based on a table column range (e.g. id)

l  For short queries, may reduce the number of accessed data

–  Load Balance l  Concurrent requests directed to different DB copies

EMC Summer School 2013 55

Parallelism & LoadBalance

EMC Summer School 2013 56

Parallel query Pgpool II

Replication Pgpool II

Replication Pgpool II

postgreSQL

postgreSQL

PostgreSQL

PostgreSQL

Page 29: Emc 2013 Big Data in Astronomy

07/02/13

29

Evaluation

l  Strength –  Extends PostgreSQL –  Load balance queries from concurrent workflows –  Scales up to 128 DB nodes

l  Weaknesses –  Lack of support to spatial functions –  Partitioning based on a single column –  Ingestion can’t use COPY

EMC Summer School 2013 57

QServ - LSST

l  Developed by the LSST DM team l  Astronomy data management l  Horizontal partitioning based on declination

zones (nodes) and data on each node distributed into chunks based on RA-chunk

l  Approx. 1000 partitions l  Native support to spatio-temporal functions l  Built on top of MySQL

EMC Summer School 2013 58

Page 30: Emc 2013 Big Data in Astronomy

07/02/13

30

Evaluation

l  Strong –  Designed to support astronomy data surveys –  Highly scalable: ~1000 nodes –  First performance results are very promising –  Alignment with the LSST project

l  Weaknesses –  Current culture based on PostgreSQL

EMC Summer School 2013 59

Context (3/3) l  Requirement

–  Efficient data storage and processing l Challenges

–  Big size of the database –  High number of attributes –  Evolving workload –  Mostly Scan Processing

l  Questions: a)  How to efficiently process queries over catalogs? b)  How to efficiently process scientific workflows over

catalogs?

EMC Summer School 2013 60

Page 31: Emc 2013 Big Data in Astronomy

07/02/13

31

Current activities at DEXL a)  Design data partitioning strategies

–  Cooperation with INRIA Montpellier- Zenith group –  Partition the data into blocks

l  such that the number of query accesses to the blocks is minimum

l Each block can be stored on a different machine

b)  Efficient execution of scientific workflows over partitioned data

EMC Summer School 2013 61

a) Intuition

EMC Summer School 2013

Q Queries and scientific workflows take a Time proportional to the amount of Data to be processed

Q’ Q’’ Q’’’

Queries and scientific workflows take Time proportional to the size of their data partitioning

62

Page 32: Emc 2013 Big Data in Astronomy

07/02/13

32

Partitioning the DB into Blocks

R(a1,…,a9)

B2

B1

Bm

EMC Summer School 2013

How to compute The best Partitioning?

63

Problem statement l  Given

–  Single relation database R(a1,…,an), n ~1000 –  Initial workload: set of k queries W0 = {q1,…,qk} –  m empty fixed size blocks

l  Assumptions –  Accessing a block ≈ accessing all its tuples –  Periodically new tuples and queries arrive –  No privilege to a particular attribute

l  Goal –  Minimize the total block access during the execution of queries by:

l  Optimal partitioning of R’s data in blocks l  Optimal query execution

–  Adapt to the arrival of new data and queries

EMC Summer School 2013 64

Page 33: Emc 2013 Big Data in Astronomy

07/02/13

33

Overview of the solution l  Data partitioning : graph based algorithm

–  Nodes: each data item (e.g. tuple) represent a node in the graph –  Edges: an edge between two data items if are accessed by a

common query –  Edge weight : the number of queries that access both data items –  Goal: partition the graph into m equal size sub-graphs with minimum

edge cut l  Use a min-cut algorithm

l  Block explanation –  Blocks are explained in terms of queries

l  Each block is assigned an explaining query Bi = vi(R)

l  Query processing –  Queries are compared to explaining queries

–  Matching blocks are selected (we haven’t worked on that yet)

EMC Summer School 2013 65

Partitioning strategy

1

We create a node for each row

Schism: VLDB2010

EMC Summer School 2013 66

Page 34: Emc 2013 Big Data in Astronomy

07/02/13

34

1

2

We create a node for each row

EMC Summer School 2013

Partitioning strategy

67

1

2

3

We create a node for each row

EMC Summer School 2013

Partitioning strategy

68

Page 35: Emc 2013 Big Data in Astronomy

07/02/13

35

1 2 3 4 5 6 7

1

2

3

4

5

6

7

We create a node for each row

For each vertical fragment

EMC Summer School 2013

Partitioning strategy

69

1 2 3 4 5 6 7

1

2

3

4

5

6

1

1

1

7

We increment the arc weight when two rows are accessed together

q1

For each vertical fragment

EMC Summer School 2013

Partitioning strategy

70

Page 36: Emc 2013 Big Data in Astronomy

07/02/13

36

1 2 3 4 5 6 7

1

2

3

4

5

6

1

1

2 1

1

7

We increment the arc weight when two rows are accessed together

q2

For each vertical fragment

EMC Summer School 2013

Partitioning strategy

71

1

2

3

4

5

6

1

1

7 3

2

7 2

5

4

7

1

1 2 3 4 5 6 7

We increment the arc weight when two rows are accessed together

W = {q1,…,qn}

For each vertical fragment

EMC Summer School 2013

Partitioning strategy

72

Page 37: Emc 2013 Big Data in Astronomy

07/02/13

37

1

2

3

4

5

6

1

1

7 3

2

7 5

4

7

1

1 2 3 4 5 6 7

We execute a min-cut algorithm

For each vertical fragment

2

EMC Summer School 2013

Partitioning strategy

73

1 2 3 4 5 6 7

1 2 4

3 5 6 7

1

2

3

4

5

6

1

1

7 3

2

7 5

4

7

1

Each partition is assigned a block B1 B2

Catalog

2

EMC Summer School 2013

Partitioning strategy

74

Page 38: Emc 2013 Big Data in Astronomy

07/02/13

38

Partitioned data with queries

EMC Summer School 2013

1

2

4

3

5

6

7

B1 B2

{q3,q5,…,, q13}

{q1,q2,…,, q14}

Each block is associated with the queries that access Some records of the block

For a given query q the number of accessed blocks is minimized

75

l  New tuple arrival: [DEXA 2012] –  Select the best block

l  i.e. block to which the new tuple is more correlated –  Challenges:

l  How to select the best block with minimum effort? –  Initial approach : find it based on the correlation of queries

to blocks –  Define optimal allocation –  Compute actual allocation efficiency –  Compute block affinity

l  What if the best block is full? –  Initial approach: split the block

Adaptive Strategy (1/2)

EMC Summer School 2013 76

Page 39: Emc 2013 Big Data in Astronomy

07/02/13

39

Allocation based on affinity to blocks

EMC Summer School 2013 77

Elapsed-time of incrementing the DB as the size increases

EMC Summer School 2013

0.1

1

10

100

1000

2M 4M 6M 8M 10M 12M 14M 16M 18M 20M

Execution

time(s)

DB size

staticDynPart, |D!| = 500 k

+

+ +

++

+

+

+

+

+

+

+

+

+

+

+

+ +

+

+

+ + + +

+

++ +

++ + + +

+

+ + ++

+

DynPart, |D!| = 1M

78

Experiment: - Sloan DR8 – 350 million tuples - workload- synthetic 27000 queries - PaToH – hyper-graph partitioner

Page 40: Emc 2013 Big Data in Astronomy

07/02/13

40

E-ASTRONOMY WORKFLOWS OVER PARTITIONED DATA

EMC Summer School 2013 79

Processing Scientific Workflows

l  Analytical Workflows process a large part of Catalog data –  Catalogs are supported by few indexes, thus most queries

scan tens-to-hundreds of millions of tuples l  Parallelization comes as a rescue to reduce analyses

elapsed-time, but –  Compromise between:

l  Data partitioning and degree of parallelization; –  Current solutions consider:

l  Centralized files to be distributed through nodes (MapReduce) –  [Alagianins, SIGMOD, 2012] NoDB – reading raw files without data

ingestion; l  Distributed databases (Qserv) to serve Workflow engines

–  [ Wang.D.L,2011], Qserv: A Distributed Shared-Nothing Database for the LSST catalog;

l  Centralized databases to serve Workflow Engine (Orchestration LineA) l  Partitioned database to serve distributed queries (HadoopDB)

EMC Summer School 2013 80

Page 41: Emc 2013 Big Data in Astronomy

07/02/13

41

EMC Summer School 2013

HadoopDB - a step in between [Abouzeid09]

l  Offers parallelism and fault tolerance as Hadoop, with SQL queries pushed-down to postgreSQL DBMS;

l  Pushed-down queries are implemented as Map-reduce functions;

l  Data are partitioned through nodes. –  Partitioning information stored in the catalog –  Distributed through the N nodes

81

EMC Summer School 2013

HadoopDB architecture

Task Tracker

Database DataNode

Node 1

Task Tracker

Database DataNode

Node 2 Task Tracker

Database DataNode

Node n

MapReduce Framework

SMS Planner

SQL query

Catalog

82

Page 42: Emc 2013 Big Data in Astronomy

07/02/13

42

EMC Summer School 2013

Example

a)

Select Year(SalesDate), Sum(revenue) From Sales Group by year(salesDate)

FileSink Operator

Map

Table partitioned by year(SalesDate) b)

Select Year(SalesDate), Sum(revenue) From Sales Group by year(salesDate)

Reduce Sink Operator

Map

no partitioning by year(SalesDate)

Group by Operator

Sum Operator

FileSink Operator

Reduce

Select year(SalesDate),sum(revenue) From Sales Group by year(salesDate)

83

Processing Astronomy data

EMC Summer School 2013

Astronomy catalogs

User access - Ad-hoc queries - downloads

Scientific workflows - Analysis

84

Page 43: Emc 2013 Big Data in Astronomy

07/02/13

43

Traditional WF–Database decoupled architecture

EMC Summer School 2013

act1 act2 act3

DBp1

Data is consolidated as input to the workflow engine

Database

Workflow engine

DBp2 DBp3

85

Problems

l  Data locality –  Workflow activities run in remote nodes wrt the

partitioned data; l  Load Balance

–  Local processes facing different processing time

EMC Summer School 2013 86

Page 44: Emc 2013 Big Data in Astronomy

07/02/13

44

Data locality

l  Traditional distributed query processing pushes operations through joins and unions so that can be done close to the data partitions;

l  Can we “localize” workflow activities? –  Moving activities in workflows require operation

semantics to be exposed –  Mapping of workflow activities to a known algebra –  Equivalence of algebra expressions enabling pushing

down operations

EMC Summer School 2013 87

Algebraic transformation

88

R S T Map Filter

(i - workflow – relation perspective)

R S T

U

Q

(ii - decomposition) Filte

r

* *

T R S Map

Filter U

Q

(iiii - anticipation)

* *

(iv - procastination)

T R

S

Map

U

Q *

*

V *

EMC Summer School 2013

Page 45: Emc 2013 Big Data in Astronomy

07/02/13

45

Workflow optimization process

89

Generatation of search space

Evaluation of search strategy

Initial algebraic expressions

Transformation rules

Cost model

Optimized algebraic expressions

Equivalent algebraic expressions

Searh more

? yes

no

EMC Summer School 2013

Pushing down workflow activities

l  A first naïve attempt –  Push down all operations before a Reduce;

l  Use a MapReduce implementation where –  Mappers execute the “pushed-down” operations

close to the data

EMC Summer School 2013 90

Page 46: Emc 2013 Big Data in Astronomy

07/02/13

46

Typical Implementation at LineA Portal

EMC Summer School 2013

Spatial partitioning Catalog DB

91

Parallel workflow over partitioned data

EMC Summer School 2013

DBp1

DBp2

DBpn

SkyMap

SkyMap

SkyMap

SkyAdd

92

Partitioned catalogue stored on PostgreSQL

Page 47: Emc 2013 Big Data in Astronomy

07/02/13

47

HQOOP - Parallelizing Pushed-down Scientific Workflows

l  Partition of data across cluster nodes –  Partitioning criteria

l  Spatial (currently used and necessary for some applications) l  Random (possible in SkyMap) l  Based on query workload (Miguel Liroz-Gestau’s Work)

l  Process the workflow close to data location –  Reduce data transfer

l  Use Apache/Hadoop Implementation to manage parallel execution

l  Widely used in Big Data processing; l  Implements Map-Reduce programming paradigm; l  Fault Tolerance of failed Map processes;

l  Use QEF as workflow Engine –  Implements Mapper interface –  Run workflows in Hadoop seamlessly;

EMC Summer School 2013 93

Perspective

Data distribution

Query Distribution

Workflow Parallelization

HadoopDB+Hive

Qserv+ Wkfw Engine

Orchestration layer, MapReduce

HQOOP

EMC Summer School 2013

Hadoop+Kepler

94

Page 48: Emc 2013 Big Data in Astronomy

07/02/13

48

Integrated architecture

EMC Summer School 2013

act1

act 2

act3

act1

act 2

act3

act1

act 2

act3

DB1 DB2 DB3

Final Result

Workflow engine Workflow engine Workflow engine

95

Experiment Set-up

l  Cluster SGI –  Configurations: 1, 47 and 95 nodes; –  Each node:

l  2 proc. Intel Zeon – X5650, 6 cores, 2.67 GHz l  24 GB RAM l  500 GB HD

l  Data –  Catalog DC6B

l  Hadoop –  QEF workflow engine

EMC Summer School 2013 96

Page 49: Emc 2013 Big Data in Astronomy

07/02/13

49

Preliminary Results

l  Preliminary results are encouraging: –  Baseline Orchestration layer (234 nodes) –

approx. 46 min –  1 node HQOOP – approx. 35 min –  4 nodes HQOOP – approx. 12.3 min –  95 nodes (94 workers) HQOOP – approx. 2.10

min –  95 nodes (94 workers) Hadoop+Python – approx.

2.4 min

EMC Summer School 2013 97

Resulting Image

EMC Summer School 2013 98

Page 50: Emc 2013 Big Data in Astronomy

07/02/13

50

Conclusions l  Big data users (scientists) are in Big Trouble;

–  Too much data, too fast, too complex;

l  Different expertise required to cooperate towards Big Data Management;

l  Adapted software development methods based on workflows;

l  Complete support to scientific exploration life-cycle

l  Efficient workflow execution on Big Data

EMC Summer School 2013 99

Collaborators

l  LNCC Researchers –  Ana Maria de C. Moura –  Bruno R. Schulze –  Antonio Tadeu Gomes

l  PhD Students –  Bernardo N. Gonçalves –  Rocio Millagros –  Douglas Ericson de Oliveira –  Miguel Liroz-Gistau (INRIA) –  Vinicius Pires (UFC)

EMC Summer School 2013

100

Page 51: Emc 2013 Big Data in Astronomy

07/02/13

51

Collaborators l  ON

–  Angelo Fausti –  Luiz Nicolaci da Costa –  Ricardo Ogando

l  COPPE-UFRJ –  Marta Mattoso –  Jonas Dias (Phd Student) –  Eduardo Ogasawara (CEFET-RJ)

l  UFC –  Vania Vidal –  José Antonio F. de Macedo

l  PUC-Rio –  Marco Antonio Casanova

l  INRIA-Montpellier –  Patrick Valduriez group

l  EPFL –  Stefano Spaccapietra

EMC Summer School 2013

101

EMC Summer School on BIG DATA – NCE/UFRJ

Fabio Porto ([email protected]) LNCC – MCTI DEXL Lab (dexl.lncc.br)

Big Data in Astronomy

Page 52: Emc 2013 Big Data in Astronomy

07/02/13

52

Overall performance

EMC Summer School 2013

0 5

10 15 20 25 30 35 40 45 50

Baseline (234

nodes)

1 node HQOOP

4 nodes HQOOP

94 nodes HQOOP

94 nodes Hadoop

elapsed-time (min)

linear scale-up

0

100

200

300

400

500

600

Baseline (234

nodes)

1 node HQOOP

4 nodes HQOOP

94 nodes

HQOOP

94 nodes

Hadoop

elapsed-time (min)

linear scale-up

% Linear Scale-up

103

EMC Summer School 2013

0

200000

400000

600000

800000

1000000

1200000

1400000

47 CENT QEF

47 CENT SEM QEF

94 CENT QEF

94 CENT SEM QEF

Tempo Hadoop Tempo Reduce

0

20000

40000

60000

80000

100000

120000

140000

160000

47 DIST QEF

47 DIST SEM QEF

94 DIST QEF

94 DIST SEM QEF

Tempo Hadoop Tempo Reduce

104

Page 53: Emc 2013 Big Data in Astronomy

07/02/13

53

Execution with 4 nodes

Elapsed-time total: 11.27 min

EMC Summer School 2013

105

Page 54: Emc 2013 Big Data in Astronomy

07/02/13

54

EMC Summer School 2013

Adaptive and Extensible Query Engine

l  Extensible to data types l  Extensible to application algebra l  Extensible to execution model l  Extensible to heterogeneous data sources

107

EMC Summer School 2013

Objective •  Offer a query processing framework that can be extended to adapt to data centric application needs;

•  Offer transparency in using resources to answer queries;

•  Query optimization transparently introduced

•  Standardize remote communication using web services even when dealing with large amount of unstructured data

•  Run-time performance monitoring and decision

108

Page 55: Emc 2013 Big Data in Astronomy

07/02/13

55

EMC Summer School 2013

Control Operators •  Add data-flow and transformation operators •  Isolate application oriented operators from execution model data-flow concerns

•  parallel grid based execution model: •  Split/Merge - controls the routing of tuples to parallel

nodes and the corresponding unification of multiple routes to a single flow

•  Send/Receive - marshalling/ unmarshalling of tuples and interface with communication mechanisms

•  B2I/I2B - blocks and unblocks tuples •  Orbit - implements loop in a data-flow •  Fold/Unfold - logical serialization of complex structues

(e.g. PointList to Points) 109

EMC Summer School 2013

The Execution Model

Example of simple QEF Workflow

Data sources (Input)

Output Operator

Possibly distributed over a Grid environment

Integration unit (Tuple) containing data source units 11

0

Page 56: Emc 2013 Big Data in Astronomy

07/02/13

56

EMC Summer School 2013

Iteration Model

A B C

DataSource

OPEN OPEN OPEN

A B C

DataSource

GETNEXT GETNEXT GETNEXT

A B C

DataSource

CLOSE CLOSE CLOSE

Results

111

EMC Summer School 2013

Distribution and Parallelization Operator distribution

A Query Optimizer selects a set of operators in the QEP to execute over a Grid environment.

A B2 C

DataSource

B1

B3

112

Page 57: Emc 2013 Big Data in Astronomy

07/02/13

57

EMC Summer School 2013

General Parallel Execution Model

Remote QEP

In order to parallelize an execution, the initial QEP is modified and sent to remote nodes to handle the distributed execution.

Control operator

Distributed operator

User’s operator

R : Receiver

S : Sender

Sp : Split

M : Merge

Initial plan

Modified plan

113

EMC Summer School 2013

Modifying IQEP to adapt to execution model

Particles

Geometry

Velocity

A (TCP)

SJ

TJ

Orbit

merge Split

Send

Receive

B2I

Send

I2B

Receive

B2I I2B

Query optimizer adds control operators according to execution model and IQEP statistics

Local dataflow Remote dataflow

Logical operator

Control operator

Control node

Remote nodei

114

Page 58: Emc 2013 Big Data in Astronomy

07/02/13

58

EMC Summer School 2013

Grid node allocation algorithm (G2N)

Grid Greedy Node scheduling algorithm (G2N)

•  Offers maximum usage of scheduled resources during query evaluation.

•  Basic idea : “an optimal parallel allocation strategy for an independent query operator … is the one in which the computed elapsed-time of its execution is as close as possible to the maximum sequential time in each node evaluating an instance of the operator”.

A Bn

t1+ t

2= t

xBn( )

node on thiscost operator )(Bnt

1t

2t

Introduction

Application

Architecture

Implem.

Conclusion

Principles

115

EMC Summer School 2013

Implementation

•  Core development in Java 1.5.

•  Globus toolkit 4.

•  Derby DBMS (catalog).

•  Tomcat, AJAX and Google Web Toolkit for user interface.

•  Runs on Windows, Unix and Linux.

•  source code, demo, user guide available at:

http://dexl.lncc.br 116

Page 59: Emc 2013 Big Data in Astronomy

07/02/13

59

EMC Summer School 2013

Summing-up

l  HadoopDB extends Hadoop with expressive query language, supported by DBMSs

l  Keeps Hadoop MapReduce framework l  Queries are mapped to MapReduce tasks l  For scientific applications is a question to be

answered whether or not scientists will enjoy writing SQL queries

l  Algebraic like languages may seem more natural (eg. Pig Latin)

117

EMC Summer School 2013

Pig Latin - an high-level language alternative to SQL

l  The use of high-level languages such as SQL may not please scientific community;

l  Pig Latin tries to give an answer by providing a procedural language where primitives are Relational albegra operations;

l  Pig Latin: A not-so-foreign language for data processing, Christopher Olson, Benjamin Reed et al., SIGMOD08;

118

Page 60: Emc 2013 Big Data in Astronomy

07/02/13

60

EMC Summer School 2013

Example l  Urls (url, category, pagerank) l  In SQL

–  Select category, avg (pagerank) from urls where pagerank > 0.2 group by category having count(*) > 106 l  In PIG

–  Groupurls = FILTER urls by Pagerank > 0.2; –  Groups= Group good-urls by category; –  Big-group=FILTER groups BY count(good_urls) > 106

–  Output = FOREACH big-groups GENERATE category, avg(good_urls_pagerank); 11

9

EMC Summer School 2013

Pig Latin

l  Program is a sequence of steps –  Each step executes one data transformation

l  Optimizations among steps can be dynamically generated, example: –  1) spam-urls= FILTER urls BY isSpam(url); –  2) Highrankurl = FILTER spam-url BY pagerank >

0.8; 1 2 2 1 12

0

Page 61: Emc 2013 Big Data in Astronomy

07/02/13

61

EMC Summer School 2013

Data Model

l  Types: –  Atom - a single atomic value; –  Tuple - a sequence of fields, eg.(‘DB’,’Science’,7) –  Bag - a collection of tuples with possible

duplicates; –  Map - a collection of data items where for each

data item a key is associated ‘fanOf’ ‘flamengo’

‘music’

‘age’ 20 121

EMC Summer School 2013

Operations

l  Per tuple processing: Foreach –  Allows the specification of iterations over bags

l  Ex: –  Expanded-queries=FOREACH queries generate userId,

expandedQuery (queryString); –  Each tuple in a bag should be independent of all others, so

parallelization is possible;

–  Flatten l  Permits flattening of nested-tuples

alice, Ipod,nano Ipod, shuffle

flatten alice, ipod, nano alice, ipod, shuffle 12

2

Page 62: Emc 2013 Big Data in Astronomy

07/02/13

62

Olympic Laboratory

EMC Summer School 2013

123

EMC Summer School 2013

Olympic Laboratory

l  Objective –  To study high performance sports as a science discipline –  To build the first sports laboratory in South America

l  US$ 10M Project sponsored by FINEP(Funding Agency)

l  Departments: –  Biochemistry, physiology, genetics, nutrition, computational

modeling, computer science, physiology

124

Page 63: Emc 2013 Big Data in Astronomy

07/02/13

63

Our task

l  To support athlete’s follow-up data –  Athlete’s training –  Variation on biochemical elements –  Variation on biometric variables

l  More recently –  For some modalities, Integrate meteorological

conditions

EMC Summer School 2013

125

Analyses Board

EMC Summer School 2013

126

Page 64: Emc 2013 Big Data in Astronomy

07/02/13

64

EMC Summer School 2013

Athletes follow-up database

l  Athletes follow-up data modeled as trajectories –  Register measurements from athletes in different training

states l  Trajectory model

–  Ordered set of measurements –  Division of time in training states –  Materialized view limited in time-range –  Imprecise measurements

l  Not detected =0 l  < x -> ]0,x[ l  y , y ≥ x

127

More on Athlete’s Trajectories

l  Stops – modelled as measurements –  Qualified according the athlete’s training state –  Training states (recovery, training, rest,…)

l  Moves – extrapolation between two stops l  Trajectory – the set of measurements,

ordered in time, and limited in time according to some criteria (eg. A training program). –  Measurements of the same observable element –  Measurements of the same athlete

EMC Summer School 2013

128

Page 65: Emc 2013 Big Data in Astronomy

07/02/13

65

Metaphoric Trajectory

EMC Summer School 2013

!129

EMC Summer School 2013

130

Page 66: Emc 2013 Big Data in Astronomy

07/02/13

66

EMC Summer School 2013

131

Challenges

l  Integrating athlete’s trajectory with weather information

l  How to efficiently store metaphoric trajectories ? –  Trajstore [Cudre-Mauroux et al ICDE 2010] –  SciDB

l  How to express and efficiently process similar trajectories

EMC Summer School 2013

132

Page 67: Emc 2013 Big Data in Astronomy

07/02/13

67

Part I: Where are they coming from ?

EMC Summer School 2013

l  “Scientists are spending most of their time manipulating, organizing, finding and moving data, instead of researching. And it’s going to get worse” –  Office Science of Data Management challenge -

DoE

134

Page 68: Emc 2013 Big Data in Astronomy

07/02/13

68

Petabyte, parece muito mas

EMC Summer School 2013

LSST – Large Synoptic Survey Telescope

•  800 imagens p/ noite durante 10 anos !! •  Mapa 3D do Universo •  30 TeraBytes por noite •  30 PetaBytes em 10 anos 13

5

LSST

EMC Summer School 2013

136

Page 69: Emc 2013 Big Data in Astronomy

07/02/13

69

Sequências de DNA Publicadas no Genbank (UK NCBI)

EMC Summer School 2013

Em Abril 2012: •  1.5 x 107 sequências

•  50% em 4 anos •  1.3 x 1011pares de base

•  30% em 4 anos

137

Comunidades

EMC Summer School 2013

Segundo o IDC, a quantidade de dados digitais disponível em nosso cyberambiente ultrapassará número de Avogrado em 2023 (> 1023) Yottabyte 13

8

Page 70: Emc 2013 Big Data in Astronomy

07/02/13

70

Em números:

l  12 Terabytes de Tweets a cada dia (IBM, 2012) l  10 TeraBytes em Facebook a cada dia l  Algumas empresas produzem terabytes por

hora, todos os dias do ano –  Eventos:

l  Abertura da porta do metrô l  Fazer um check-in no aeroporto l  Comprar uma música no iTunes

EMC Summer School 2013

139

EMC Summer School 2013

Comunidades Científicas

140

Page 71: Emc 2013 Big Data in Astronomy

07/02/13

71

Dados Governamentais

l  Investimentos l  Programas de Governo l  Impostos l  Contratos, prestações de contas l  Índices: econômicos, sociais, educação,

saúde, … l  Segurança e Defesa

EMC Summer School 2013

141

Dados Históricos

EMC Summer School 2013

142

Page 72: Emc 2013 Big Data in Astronomy

07/02/13

72

EMC Summer School 2013

GenBank exponential growth 1982 - 2008

 Growth  in  nucleo;de  sequences  submiIed  to  GenBank  between  1982  and  2005.  The  note  from  each  release  of  GenBank  reveal  the  total  number  of  nucleo;des  submiIed  to  the  database.  This  graph  uses  one  data  point  from  each  year  to  show  the  exponen;al  growth  rate  of  nucleo;de  sequences  in  this  interna;onal  database.  Data  from  whole  genome  sequencing  is  not  included  in  these  figures,  but  expressed  sequence  tag  data  and  data  from  sequencing  centers  is  included.  Copyright  2008  Nature  Educa;on.  

143