The Claremont Report on Database Research
description
Transcript of The Claremont Report on Database Research
The Claremont Report on Database Research
SIGMOD 2008
What is it?
• May, 2008 prominent DB researchers, architects, users, pundits met in Berkeley, CA at Claremont Resort
• Seventh meeting in 20 years• Report based on discussion of new directions
in DBs
Turning point in DB Research
• New opportunities for technical advances, impact on society, etc.
1. Big Data– not only traditional enterprises, but also e-
science, digital entertainment, natural language processing, social network analysis
– Design new custom data management – solutions from simpler components
2. Data analysis as profit center– Barriers between IT dept. and business units
dropping– Data is the business– Data capture, integration, etc. keys to efficiency
and profit– BI vendors - $10B (only front-end)– Also need better analytics, sophisticated analysis– non-technical decision makers want data
3. Ubiquity of structured and unstructured data– Structured data – extracted from text, SW logs,
sensors and deep web crawl– Semi-structured – blogs, Web 2.0 communities,
instant messaging– Publish and curate structured data– Develop techniques to extract useful data, enable
deeper explorations, connect datasets
4. Expanded developer demands– Adoption of relational DBMS and query languages
has grown• MySQL, PostegreSQL, Ruby on Rails• Less interest in SQL, view DBMS as too much to learn
relative to other open source components
– Need new programming models for Data management
5. Architectural Shifts in computing– Computing substrates for DM are shifting– Macro: Rise of cloud computing• Democratizes access to parallel clusters
– Micro: shift from increasing chip clock speed to increase number of cores, threads• Changes in memory hierarchy• Power consumption
– New DM technologies
Research Opportunities• Impact of DB research has not evolved beyond
traditional DBs• Reformation– Reform data centric ideas for new applications and
architectures• Synthesis– Data integration, information extraction, data privacy
• Some topics not mentioned, because still part of significant effort– Must continue with these efforts– Also must continue with
• Uncertain data, data privacy and security, e-science, human-centric interactions, social networks, etc.
DB Engines
• Big market relational DBs well known limitations
• Peak performance:– OLTP with lots of small, concurrent transactions
debit/credit workloads– OLAP with few real-mostly, large join, aggregation
• Bad for:– Text indexing, server web pages, media delivery
• DB engine technology could be useful in sciences and Web 2.0 applications, but not in current bundled DB systems
• Petabytes of storage and 1000s processors, but current DB cannot scale
• Need schema evolution, versioning, etc• Currently, many DB engine startup companies
1. Broaden range for multi-purpose DBs2. Design special purpose DBs• Topics in DB engine area:– Systems for clusters of many processors– Exploit remote RAM and Flash as persistent– Query opt. and data layout continuous– Compress and encrypt data integrated with data
layout and optimization– Embrace non-relational DB models– Trade off consistency/availability for performance– Design power aware dBMS
• Declarative programming for emerging platforms
• Programmer productivity is important– Non-expert must be able to write robust code– Data Centric programming techniques• Map reduce – language and data parallelism• Declarative languages – Data log• Enterprise application programming – Ruby Rails, LINQ
• New challenges – programming across multiple machines• Data independence valuable, no assumptions about where
data stored• XQuery for declarative programming?• Also need language design, efficient compilers, optimize code
across parallel processors and vertical distribution of tiers• Need more expressive languages• Attractive syntax, development tools, etc• Data management – not only storage service, but
programming paradigm
Interplay of Structured and Unstructured Data
• Data behind forms – Deep Web• Data items in HTML • Data in Web 2.0 services (photo, video sites)
• Transition from traditional DBs to managing structured, semi-structured and unstructured data in enterprises and on the web
• Challenge of managing dataspaces
• On the web– Vertical search engines– Domain independent technology for crawling
• Within the enterprise– Discover relationships between structured and
unstructured data
• Extract structure and meaning from un- and semi-structured data
• Information extraction technology – pull entities and relationships from unstructured text
• Need: apply and management predictions from independent extractors– Algorithms to determine correctness of extraction– Join with IR and ML communities
• Better DB technology needed to manage data in context– Discover implicit relationships, maintain context
through storage and computation• Query and derive insight from heterogeneous data– Answer keyword queries over heterogeneous data
sources– Analysis to extract semantics– Cannot assume have semantic mappings or domain
is known
• Develop algorithms to provide best-effort services on loosely integrated data– Pay as you go as semantic relationships
discovered
• Develop index structures to support querying hybrid data
• New notions of correctness and consistency
• Innovate on creating data collections• Ad-hoc communities to collaborate– Schema will be dynamic– Consensus to guide users– Need visualization tools to create data that are
easy to use• Result of tools may be easier to extract info
Cloud Data Services
• Infrastructures providing software and computing facilities as a service
• Efficient for applications – Limit up-front capitol expenses– reduce cost of ownership over time
• Services hosted in a data center– Shared commodity hardware for computation and
storage
Cloud services available today
• Application services (salesforce.com)• Storage services (Amazon S3)• Compute services (Google App Enginer,
Amazon EC2)• Data services (Amazon SimpleDB, SQL Server
Data Services, Google’s Datastore)
• Cloud data services offer API more restricted than traditional DBs– Minimalist query languages, limited consistency– More predictable services• Difficult if had to provide full-function SQL data service
– Managability important in cloud environments• Limited human intervention• High workloads• Variety of shared infrastructures
• No DBA or system admin • Automatically by platform• Large variations in workloads– Economical to user more resources for short
bursts– Service tuning depends upon virtualization• HW virtual machines as programming interface (EC2)• Multi-tenant hosting many independent schemas in
single managed DBMS (salesforce.com)
• Need for manageability• Adaptive online techniques• New architectures and APIs– Depart from SQL and transactions semantics when
can
• SQL DBs cannot scale to thousands of nodes– Different transactional implementation
techniques or different storage semantics?
• Query processing and optimization– Cannot exhaust search plan if 1000s sites
• More work needed to understand scaling realities
• Data security and privacy– No longer physical boundaries of machines or
networks
• New scenarios– Specialized services with pre-loaded data sets
(stock prices, weather)
• Combine data from private and public domains
• Reaching across clouds (scientific grids)– Federated cloud architectures
Mobile applications and virtual worlds• Manage massive amounts of diverse user-created data,
synthesize intelligently and provide real-time services
• Mobile space– Large user bases– Emergence of mobile search and social networks
• Timely information to users depending on locations, preference, social circles, extraneous factor and context in which operate
• Synthesize user input and behavior to determine location and intent
• Virtual worlds – Second Life– Began as simulations for multiple users• Blur distinction with real-world• Co-space, for both virtual and physical worlds
– Events in physical captured by sensors, materialized in virtual– Events in virtual can affect physical
• Need to process heterogeneous data streams• Balance privacy against sharing person RT info• Virtual actors requires large-scale parallel programs
– Efficient storage, data processing, power sensitive
Moving Forward• DB research community doubles in size last decade• Increasing technical scope make it difficult to keep track of
field• Review load for papers growing– Quality of reviews decreasing over time
• Need more technical books, blogs, wikis• Open source software development in DB– Competition: system components for cloud computing– Large-scale information extraction