Perspective on Future Data AnalysisL1 Computing in High Energy Physics 2003 La Jolla 24 March Ren é...
-
Upload
vivien-barton -
Category
Documents
-
view
217 -
download
0
Transcript of Perspective on Future Data AnalysisL1 Computing in High Energy Physics 2003 La Jolla 24 March Ren é...
Perspective on Future Data AnalysisL 1
Computing in High Energy Physics 2003
La Jolla 24 March
René Brun
CERN
Perspective on Future Data Analysis in HENP
René Brun CHEP03Perspective on Future Data
Analysis 2
Data Analysis ??
Data Analysis has been traditionally associated with the final stages of data processing, ie Physics Analysis.
In this talk, I will cover a more general aspect of Data Analysis (in the true sense).
How to interact with data at all stages of data processing (batch or interactive modes)?
Can we imagine an experiment-independent way to achieve this?
René Brun CHEP03Perspective on Future Data
Analysis 3
Evolution
To understand the possible directions, we must understand some messages from the past, the solid recipes!
One important message is “Make it simple”.
Heavy experiment frameworks are often perceived as a serious obstacle and push users to use more basic but universal frameworks.
René Brun CHEP03Perspective on Future Data
Analysis 4
Once upon a time (seventies)
With the first electronic (as opposed to bubble chamber) experiments, data analysis was experiment specific, an activity after the data taking.
The only common software was the histograming package (eg Hbook) ,the fitting package (eg Minuit), some plotting packages and independent routines in cernlib (linear algebra and small utilities)
Data structures = Fortran common blocks
René Brun CHEP03Perspective on Future Data
Analysis 5
Early Eighties
With the growing complexity of the experiments and corresponding software, we see the development of Data Structures management systems (hydra, zbook-->zebra, bos).
These systems are able to write/read complex bank collections. Zebra had a self-describing bank format with built-in support for bank evolution.
Most data processed in batch, but many prototypes of interactive systems start to appear (htv, gep, then paw..)
René Brun CHEP03Perspective on Future Data
Analysis 6
PAW
Designed in 1985. Stable since 1993 Row-Wise-Ntuples. OK for small data sets,
interactive histograming with cuts. Column-Wise-Ntuples. A major step
illustrating the advantage of structured data sets
PAW: a success not so much because of its technical merits but perceived as a tool widely available stability since many years: an important
element
René Brun CHEP03Perspective on Future Data
Analysis 7
1993-->2000 (1)
Move from Fortran to OO Took far more time than expected new language(s) new programming techniques basic infrastructure not available to
compete with existing libraries and tools conflicts between projects ad-hoc software in experiments
René Brun CHEP03Perspective on Future Data
Analysis 8
1993-->2000 (2)
False hopes with OODBMS (or too early?) OODBMS -->Objectivity OO models designed for Objy batch oriented Interactive use via conversion to PAW
ntuples central data base does not fit well with
GRID concepts Licensing problems and more
Perspective on Future Data AnalysisL 9
Data Analysis Models
René Brun CHEP03Perspective on Future Data
Analysis 10
From the desktop to the GRID
Desktop Local/remote
Storage
Online/Offline
Farms
GRID
New data analysis tools must be able to use in parallel remote CPUS, storage elements and networks in a transparent way for a user at a desktop
René Brun CHEP03Perspective on Future Data
Analysis 11
My laptop in 200X
Using a naïve extrapolation of Moore’s law
for a state of the art laptop
Year CPU/Ghz RAM/GB disk/GB
2003 2.4 0.5 60
2005 5 1 150
2007 10 2 300
2009 20 4 600
2011 40 8 1000
Nice !But less than 1/1000
of what I need
René Brun CHEP03Perspective on Future Data
Analysis 12
Batch-mode Local analysis
Conventional model: The user has full control on the event loop.
The program produces histograms, ntuples or trees.
The selection is via user private code Histograms are then added (tool or in the
interactive session) ntuples/trees are combined into a chain
and analyzed interactively.
René Brun CHEP03Perspective on Future Data
Analysis 13
Batch Analysis on the GRID
From a user viewpoint, a simple extrapolation of the local batch analysis.
In practice, must involve all the GRID machinery: authentication, resource brokers, sandboxes.
Viewing the current status (histograms) must be possible.
Advantage: Stateless, can process large data volumes.
Advanced systems already exist (see talk by Andreas Wagner)
René Brun CHEP03Perspective on Future Data
Analysis 14
AliEnFS & Distributed Analysis
******************************************* * * * W E L C O M E to R O O T * * * * Version 3.03/09 3 December 2002 * * * * You are welcome to visit our Web site * * http://root.cern.ch * * * *******************************************
Compiled for linux with thread support.
CINT/ROOT C/C++ Interpreter version 5.15.61, Oct 6 2002Type ? for help. Commands must be C++ statements.Enclose multiple statements between { }.root [0]newanalysis->Submit();
Analysis Macro
MSS
MSS
MSS
MSS
MSS
CE
CE
CE
CE
CE
merged Trees +Histograms
? Query for Input Data
MSS
MSS
MSS
MSS
VFS
Kernel
LUFS
Kernel Space
AliEnFSAliEn API
User Space
castor://
soap://root://
root:// root://
https://
/alien/
alice/ atlas/
data/ prod/mc/
a/ b/
Linux File System
MSS
René Brun CHEP03Perspective on Future Data
Analysis 15
Interactive Local Analysis
On a public cluster, or the user’s laptop. Tools like PAW or successor are used for
visualization and ntuples/trees analysis.
René Brun CHEP03Perspective on Future Data
Analysis 16
GRID: Interactive AnalysisCase 1
Data transfer to user’s laptop Optional Run/File catalog Optional GRID software
Optionalrun/FileCatalog
Remotefile servereg rootd
Trees
Trees
Analysis scripts are interpretedor compiled on the local machine
René Brun CHEP03Perspective on Future Data
Analysis 17
GRID: Interactive AnalysisCase 2
Remote data processing Optional Run/File catalog Optional GRID software
Optionalrun/FileCatalog
Remotedata analyzer
eg proofd
Trees
Trees
Commands, scripts
histograms
Analysis scripts are interpretedor compiled on the remote machine
René Brun CHEP03Perspective on Future Data
Analysis 18
GRID: Interactive AnalysisCase 3
Remote data processing Run/File catalog Full GRID software
Run/FileCatalog
Remotedata analyzer
eg proofd
Trees
Trees
Commands, scripts
Histograms,trees
TreesTreesTrees
TreesTreesTrees
slave
slave
slave
slave
slave
slave
Analysis scripts are interpretedor compiled on the remote master(s)
Perspective on Future Data AnalysisL 19
Data Analysis Projects
René Brun CHEP03Perspective on Future Data
Analysis 20
Tools for data analysis
PAW: started in 1985, no major developments since 1994.
HippoDraw: started in 1991 ROOT: started in 1995, continuous
developments JAS: started in 1995, continuous
developments Open Scientist: ? LHC++/Anaphe: 1996-->2002 PI: new project in the LHC Computing Grid,
just starting now
René Brun CHEP03Perspective on Future Data
Analysis 21
PAW The reference since 18 years (1985), Used by most collaborations ported on many platforms, small (3 to 15 MB) many criticisms during the development phase applauded since it is stable maintained by Olivier Couet (ROOT team)
Usagestill growing
0.1 FTE
René Brun CHEP03Perspective on Future Data
Analysis 22
HippoDraw
Author: Paul Kunz show the way in 1991/1992 Usage: Paul + “a 50 year-old CERN
physicist” Seems to be in constant prototyping
phases Good to have this type of prototype to
illustrate new possible interactive techniques.
1 FTE ?
René Brun CHEP03Perspective on Future Data
Analysis 23
ROOT
In constant development since 1995 Used by many collaborations and outside
HEP
More than 10000 distributionsof binary tar files in February
6 +2+..FTE
René Brun CHEP03Perspective on Future Data
Analysis 24
JAS
Started in 1995. (Tony Johnson) Current version 2. JAS3 presented at this
CHEP For the Java world. How to cooperate with C++ frameworks?
3 FTE ?
René Brun CHEP03Perspective on Future Data
Analysis 25
In AIDA you believe ?
The Abstract Interfaces for Data Analysis project was started by the defunct LHC++ and continued by Anaphe (now stopped).
Supported by JAS and Open Scientist Goal: define abstract interfaces to
facilitate cooperation between developers and facilitate migration of users to new products
Versions 1, 2 and 3 (version 4 for PI ?)
René Brun CHEP03Perspective on Future Data
Analysis 26
In AIDA I don’t believe Abstract Interfaces are fundamental in modern
systems to make a system more modular and adaptable.
But, common abstract interfaces are not a good idea.
They force a lowest common denominator They require international agreements Users will be confused (what is common and not) you become slave of a deal: against creativity
It is more important to agree on object interchange formats and data base access You can easily change a few hundred lines of
code. You cannot copy Terabytes of data
René Brun CHEP03Perspective on Future Data
Analysis 27
The LCG PI project
Fresh from the oven One of the projects recently launched by
the Applications Area of the LCG project. Ideas:
promote the use of AIDA (version 4) Python for scripting interface to ROOT & CINT in gestation
see Vincenzo
René Brun CHEP03Perspective on Future Data
Analysis 28
User & Developer views
Users Requests very rarely requests for grandiose new
features zillions of tiny new features zillions of tiny improvements want consolidation & stability
Developers view want to implement the sexy features target modularity (more complex installation?) maintenance & helpdesk: a problem or a
chance?
René Brun CHEP03Perspective on Future Data
Analysis 29
Lessons from the past
It takes time to develop a general tool more than 7 years for PAW, ROOT and JAS
User feedback is essential in the development phase
People like stable systems Efficient access to data sets is a
prerequisite 24h x 7days x 12 months x N years online
support is vital
René Brun CHEP03Perspective on Future Data
Analysis 30
Develop/Debug/maintainIn an Interactive system with N basic functions, the number of combinations may be unlimited, (Not NxN, but N! )10% of the time to develop first 90% of the code.90% of the time to develop the remaining 10%
René Brun CHEP03Perspective on Future Data
Analysis 31
Time to develop
LCG
Perspective on Future Data AnalysisL 32
Technical aspects
René Brun CHEP03Perspective on Future Data
Analysis 33
Desktop
Plug-in Manager and Dictionary GUI Graphics 2-d, 3-d Event Displays Histograming & Fitting Statistics tools Scripting Data/Program organization
René Brun CHEP03Perspective on Future Data
Analysis 34
Plug-in Manager
Object Dictionary
I/O manager InterpreterI/O manager
Plug-in managerBasic Services, GUI, Math..
User Shared lib Exp Shared libs
General Utility Shared lib
Exp Shared libsExp Shared libs
René Brun CHEP03Perspective on Future Data
Analysis 35
The Object Dictionary
Object Dictionary
Data dictionary Functions dictionary
Compiled code
Interpreted scriptsGUI
Command line
I/O InspectorsBrowsers
René Brun CHEP03Perspective on Future Data
Analysis 36
Scripting for data analysis
After KUIP and Tk/Tcl era Command line Interface required Scripts
interpreted or/and byte-code interpreted automatic compilation and linking call compiled or interpreted code compiled code must be able to call interpreted code (GUI
and configuration scripts) Big bonus if compiled and interpreted languages are the
same
Scripting and object dictionary symbiosis Remote execution of scripts (in parallel)
René Brun CHEP03Perspective on Future Data
Analysis 37
Languages & scripting
C++ Compiled code
Python/Perl scripts
GUI with signal/slots
Interactive User
C++ Interpreted scripts
Batch User
René Brun CHEP03Perspective on Future Data
Analysis 38
Comparing scripts
http://sarkar.home.cern.ch/sarkar/jroot/main.html
Very interesting projectfrom Subir Sarkar
Cooperation between
Javaand a C++ framework
based on Object Dictionary
René Brun CHEP03Perspective on Future Data
Analysis 39
GUI(s)
Constant evolution
+Microsoft MFC, Win32 API Signals/Slots principle: very nice. It helps
designing large and modular GUI systems Interpreters help GUI builders/editors
1983
Vax/VMS
SMS
VT100
1985
GKS
Textronix
1989
MOTIF
Unix workstations
2001
Qt
Linux/Laptops
1997
Java/Swing
The Web
René Brun CHEP03Perspective on Future Data
Analysis 40
2-D graphics
An area where constant improvements are required.
Better plotters, better fonts,... Better drivers: postscript, SVG, XML, etc
Publication quality is a must. This requirement alone explains why many proposed data analysis systems do not penetrate experiments
René Brun CHEP03Perspective on Future Data
Analysis 41
3-D graphics
Data structures: Objects <--> scene Scene renderers: OpenGL, Open Inventor Most difficult is detector geometry graphics z-buffer algorithms OK for fast real time
fancy graphics, not OK for good debugging (shape outline is important on top of z-buffer views).
Vector Postscript (or PDF/SVG) must be available (not Postscript from OpenGL triangles)
see talks about GraXML and Persint
René Brun CHEP03Perspective on Future Data
Analysis 42
Example with PERSINT/ATLAS
René Brun CHEP03Perspective on Future Data
Analysis 43
Event Displays The most successful event displays so far were 2-
D projections (see Aleph, Atlas/Atlantis) A lot of work with 3-d graphics in many
experiments (see talks about Iguana) Client-server model Access to framework objects, browsers One could have expected a bigger role for Java!
Mismatch with experiment C++ frameworks? Possible directions
standardize object exchange (SOAP/XML/Root I/O) standardize low level graphics exchange (HEPREP)
René Brun CHEP03Perspective on Future Data
Analysis 44
Histograming
This should be a stable area Thread Safety Binning on parallel systems Merging on batch/parallel systems
René Brun CHEP03Perspective on Future Data
Analysis 45
Fitting
Minuit: the standard Fumili: was nice and fast Upgrade of Minuit with new algorithms
including Fumili in the pipeline several GUIs on top a very powerful package developed by
BaBar see talk on RooFit by D.Kirkby
René Brun CHEP03Perspective on Future Data
Analysis 46
Statistics & Math
Many tools and algorithms exist GSL ? Gnu R-Math project TerraFerma Initiative
Subject of discussions at many workshops confidence limits workshops ACAT FermiLab and Moscow Durham
Need to be federated in a coherent framework
René Brun CHEP03Perspective on Future Data
Analysis 47
Lost with Complexity?
In large collaborations, users are often lost when confronted to the complexity of big simulation and reconstruction programs:
What is the data organization? How are algorithms organized? The
hierarchy? The problem is amplified by the use of
dynamically configurable systems, dynamic linking and polymorphism
Browsing data and algorithms is a must
René Brun CHEP03Perspective on Future Data
Analysis 48
Folders/ white boards
Folders help understandingcomplex hierarchical
structuresLanguage IndependentCould be GRID-aware
René Brun CHEP03Perspective on Future Data
Analysis 49
Why Folders ?
This diagram shows a system without folders. The objects have pointers to each other to access each other's data.
Pointers are an efficient way to share data between classes. However, a direct pointer creates a direct coupling between classes.
This design can become a very tangled web of dependencies in a system with a large number of classes.
René Brun CHEP03Perspective on Future Data
Analysis 50
Why Folders ?
In the diagram below, a reference to the data is in the folder and the consumers refer to the folder rather than each other to access the data.
A naming and search service provides an alternative. It loosely couples the classes and greatly enhances I/O operations.
In this way, folders separate the data from the algorithms and greatly improve the modularity of an application by minimizing the class dependencies.
René Brun CHEP03Perspective on Future Data
Analysis 51
Tasks/Algorithms
In the same way that Folders can be used to organize the data, one can use Tasks to organize a hierarchy of algorithms.
Tasks can be organized into a hierarchical tree of tasks and displayed in the browser. A Task is an abstraction with standard functions to Begin,Execute,Finish.
Each Task derived class may contain other Tasks that can be executed recursively, such that a complex program can be dynamically built and executed by invoking the services of the top level task or one of its subtasks.
Tasks help understandingthe organization and
sequence of executionof large programs
Perspective on Future Data AnalysisL 52
Directions
René Brun CHEP03Perspective on Future Data
Analysis 53
Exchange/Compatibility
If we assume that several data analysis tools will be around (HEP made or commercial), it is important to exchange objects between these tools (drag&drop, network or files).
The SOAP/XML have emerged as standards to exchange low level volume of objects.
Several technical solutions are possible. The winning solutions will be the ones that will be able to automatize the process by exploiting all the information in the object dictionary.
René Brun CHEP03Perspective on Future Data
Analysis 54
Follow Microsoft ?
The SOAP/XML are one of the key components of .NET (and also of the MS competition).
MS is preparing a new OS (Longhorn ?) for 2005. This new OS will introduce an Object distributed data base.
This may have a serious impact on the GRID software and on our tools.
René Brun CHEP03Perspective on Future Data
Analysis 55
Access Patterns
Understand data access patterns to objects in one file to subsets of objects in many collections
relations with run/file catalogs persistent reference pointers Optimize design of containers for
processing in batch interactive parallel processing
cache management and proxies
René Brun CHEP03Perspective on Future Data
Analysis 56
Query processor Extend/Develop powerful query systems that
minimize the amount of programming Optimize I/O (read only the strict necessary) are able to process data in parallel, hiding the
complexity of parallelism to the end user. can be executed again and again, possibly
learning from the previous passes. Are robust against network failures, CRTL/C,
programming errors. Can be run in GUI mode, interpreted or compiled
mode
René Brun CHEP03Perspective on Future Data
Analysis 57
Event Collections Develop/Extend objects able to keep a summary
of previous runs Event collections with their iterators well
matched to the query processor (event+run, UUID, tree entry serial number).
Special objects: masks, bit slice index to speed up searches in large collections.
The system must be able to run with and without the run/file catalog
René Brun CHEP03Perspective on Future Data
Analysis 58
Exploiting meta information
The normal data analysis mode requires access to the user classes.
However, experience shows that users also expect (as it was the case for PAW) to be able to process their data sets without the classes/shared libraries used to generate these data sets, still supporting automatic schema evolution.
The class meta information is saved in the data set. Simple queries involving only data class attributes must be possible without the code.
This requirement has consequences on the way the object dictionary is used.
René Brun CHEP03Perspective on Future Data
Analysis 59
Dependencies & Simplicity
Minimize component dependencies to facilitate software distribution/portability
The winning tools will be the ones that are easy to port to new systems
(OS/compilers) depend only on other systems also easy to
port are used in real conditions to guarantee
feedback are able to evolve very quickly to adapt to new
situations and new requirements.
René Brun CHEP03Perspective on Future Data
Analysis 60
Integration with GRID soft The data analysis software is an integral part of
the GRID software. It drives the process, not the inverse.
This implies a close cooperation between teams working on tools for data analysis and teams working on the GRID plumbing: resource brokers, authentication,etc, and GRID high level tools like Condor.
The Batch line and the Interactive line must be developed in a complementary way.
René Brun CHEP03Perspective on Future Data
Analysis 61
Trends Summary
HistogramNtuple viewers
Data Presenters
Efficient Accessto large andstructured
event collections
Interactionwith user &
experiment classes
Parallelism on the GRID
Batch/Interactive
Access to Catalogs
Resource BrokersProcess migration
Progress Monitors
Proxies/cachesVirtual data sets
More and more GRID oriented data analysisMore and more experiment-independent software
René Brun CHEP03Perspective on Future Data
Analysis 62
Acknowledgements
For a long time, data analysis has been the last wheel of the car. Many thanks to the organizing committee for giving me the opportunity to present my views on the subject.
Enjoy this conference