Big Data Small Testing - IIT Hyderabad
Transcript of Big Data Small Testing - IIT Hyderabad
Big Data, Small Testing ?
Jayant Haritsa
Database Systems Lab
Indian Institute of Science
January 2016 Indo-Japan DST Workshop 1
NYT Op-ed Article [April 2014]
• Eight (No, Nine!) Problems With Big Data • Gary Marcus, Ernest Davis (NYU faculty)
“big data is prone to giving scientific-sounding solutions to hopelessly imprecise questions”
Who’s Bigger? Where Historical Figures Really Rank
(Book by MIT/Google: Hitler ranks higher than Aristotle!)
We need to ensure that Big Data does not wind up
becoming Huge Nonsense …
January 2016 Indo-Japan DST Workshop 2
Research Landscape
• Current Focus: Architecting the “plumbing”
infrastructure for Big Data environments • programming models, stream processing and summarization,
sketching and approximation algorithms, storage architectures,
cloud hosting, analytics, security …
• These techniques are unlikely to work in practice
• The elephant in the room is the lack of testing methodologies for such deployments
3 January 2016 Indo-Japan DST Workshop
Quotes†
50% of our cost is on testing (QA)
(Bill Gates @ Opening of Gates Building)
Testing alone takes up six months of the
18 month product release cycle
(SAP Executive)
Estimated damage of 60 billion dollars
per year in USA caused by software bugs
(US Department of Commerce, 2004)
January 2016 Indo-Japan DST Workshop 4
† From Donald Kossmann’s Stanford talk
Big Data Disasters
January 2016 Indo-Japan DST Workshop 5
1. UK Immigration [2013]
A Home Office text message campaign accusing people of being illegal immigrants has received numerous complaints after several people were contacted in error. Officials have sent messages to almost 40,000 people they suspect of not having a right to be in the UK, instructing them to contact border officials to discuss their immigration status. Government commissioned Capita, the outsourcing company, to trace people believed to have outstayed their visas.
January 2016 Indo-Japan DST Workshop 6
UK Immigration (contd)
In a few months, Capita was accused of mishandling cases and getting just as mixed up as the bureaucrats it was supposed to be replacing!
In November, Capita admitted a backlog of 150,000 notifications to foreign students it hadn't been able to process and therefore determine if they should or shouldn't still be in the country.
In IT terms, it's been at the center of a billion dollar botched "e-borders" system, which has been missing deadlines and delivery dates since the middle of the last decade and which may not even be legal under European Union legislation!
January 2016 Indo-Japan DST Workshop 7
2. Obama HealthCare.gov [2013]
Severe problems were caused by unexpected high volume when the site drew 250,000 simultaneous users instead of the 50,000-60,000 expected. More than 8 million people visited the site from October 1 to 4. White House officials subsequently conceded that it was not just an issue of volume, but involved software and systems design issues. Also, stress tests done by the contractors one day before the launch date revealed that the site became too slow with only 1,100 simultaneous users !
HealthCare.gov problems persisted even weeks after the launch. For example, a networking error at the related data services hub killed the website's functionality. This occurred the exact day after Health & Human Services head Kathleen Sebelius had highlighted designing that data hub as a government success.
January 2016 Indo-Japan DST Workshop 8
3. Flipkart → Flopkart [Oct 6, 2014]
Deccan Herald: Big Apology Day follows Flipkart's Big Billion Day – After its Big Billion Day on Monday, which fetched Flipkart.com $100
million by way of sales and the ire of hordes of angry customers who complained of technical glitches and false promises on discounts, the Bangalore-based online giant was quick to apologise for its drawbacks on Tuesday.
– “Though we saw unprecedented interest in our products and traffic like never before, we also realised that we were not adequately prepared for the sheer scale of the event. We didn't source enough products and deals in advance to cater to your requirements. To add to this, the load on our server led to intermittent outages, further impacting your shopping experience on our site,” the Bansals said.
– Noting that it took enormous effort from everyone at Flipkart, many months of preparation and pushing its “capabilities and systems to the limit” for the big day, the Bansals said that they were looking at deals and offers painstakingly put together for months.
January 2016 Indo-Japan DST Workshop 9
Flipkart → Flopkart [Oct 6, 2014]
Price Changes
– Even as Flipkart prepares various deals and promotional pricing in the lead-up to the sale, the pricing of several products gets changed to non-discounted rates for a few hours.
Out of stock – The website ran out of stock for many products within a few minutes
(and in some cases, seconds) of the sale going live. Most special deals were sold out as soon as they went live.
Cancellations
– A large number of people bought specific products simultaneously. This led to some instances of orders getting overbooked for a product sold out just a few seconds ago.
Website Issues
– Nearly 5000 servers were deployed and had prepared for 20 times the traffic growth. But the volume of traffic at different times of the day was much higher than this.
January 2016 Indo-Japan DST Workshop 10
Testing Times
January 2016 Indo-Japan DST Workshop 11
Software Mindset†
Everybody loves writing code
Everybody hates testing it
– emphasis on developing new models
than on evaluating current setups
– solution: automate the testing
computers are cheap and do not complain
January 2016 Indo-Japan DST Workshop 12 12 12
† From Donald Kossmann’s Stanford talk
Basic Question
How do you know the output delivered
for the user objective is correct?
Checking is hard because of the
magnitude of data involved and the
complexity of the queries
January 2016 Indo-Japan DST Workshop 21 21
Types of Errors
English-to-SQL translation errors
“Public demands change” Public is demanding change in society
Public demands are changing over time
Public is demanding loose change (coins)
– Big problem (only about 40% are correct !)
– Further, more than 80% are written correctly
only after two to four attempts!
January 2016 Indo-Japan DST Workshop 22 22
Types of Errors (contd)
Syntactic errors
– easy to check with automatic parser generators
Semantic errors
– Schema/type errors (easy to check from catalogs)
– Arithmetic errors (easy to check at runtime)
– Optimizer rewriting errors
e.g. infamous Count Bug [1986]
– Operator implementation errors
– Index maintenance errors
– Transaction management errors
e.g. ARIES checkpoint error
January 2016 Indo-Japan DST Workshop 23 23
Hard to find
Library Approach
SQL test libraries designed by the
engine developers or application
specialists
Run regression tests on this workload
– Very limited coverage
January 2016 Indo-Japan DST Workshop 24 24
Moving on to Big Data World
January 2016 26 Indo-Japan DST Workshop
Test Environment
• Underlying infrastructure is a hybrid of
ETL/IR/KM/DB components
• e.g. IBM Infosphere (DataStage, QualityStage, MDM,
DB2, Big Insights, Metadata repository, …)
• Need to test
• “functionality” (programs/data)
• “compilation” (query/model planning)
• “execution” (query/model processing)
January 2016 Indo-Japan DST Workshop 27 27
Sample Scenario
• Wish to test “yottabyte” (1024 byte) scale
Big Data environment for InfoSphere
• Metrics: Functionality, Correctness,
Performance, Scalability
• Impractical (time) or infeasible (space) to
explicitly create and process test data
January 2016 Indo-Japan DST Workshop 28 28
Pie-in-the-sky
A complete testing environment for Big
Data management systems, wherein the
entire data and meta-data is virtual or
transient, supporting efficient evaluation
of arbitrary deployment scenarios.
January 2016 Indo-Japan DST Workshop 29
Metadata Testing
January 2016 Indo-Japan DST Workshop 30
Our Approach
• Build metadata construction tools that
“fool” the underlying information systems
into thinking that the data is actually
present even though it had never been
created or stored
• Developed tool called CODD (Constructing
Dataless Databases) for this purpose • Edgar Codd, IBM, father of RDBMS / Turing awardee
• In archaic English, “cod” means “empty shell”
31 January 2016 Indo-Japan DST Workshop
CODD Metadata Processor
• Easy-to-use graphical tool for the automated
creation, verification, retention, scaling and
porting of database meta-data configurations
• Entirely written in Java (~50K LOC) and
operational on industrial-strength db engines (DB2, Oracle, SQL Server, SQL-MX)
• Released as free software after receiving
copyright from the Indian government
• In use at several industrial and academic
research labs
32 January 2016 Indo-Japan DST Workshop
January 2016 Indo-Japan DST Workshop
Metadata Construction
33
• Users can directly input statistics on:
• Relational Tables (row cardinality, row length, disk blocks)
• Attribute Columns (column width, number of distinct values,
value distribution histograms)
• Attribute Indexes (number of leaf blocks, clustering factor)
• System Parameters (cores, memory size, CPU utilization)
33 33
January 2016 Indo-Japan DST Workshop
Graphical Histogram
34
34 34
January 2016 Indo-Japan DST Workshop
Metadata Validation
35
Need to ensure that the input information is
– Legal (valid type and range)
– Consistent (compatible with other metadata values)
Validation Approach – Construct a directed acyclic constraint graph CG(V,E)
– V is the set of individual metadata entities while E is the set of
statistical value dependencies
– Super Nodes: used to represent collapsed chain of nodes for
compactness
– Run topological sort on CG to obtain CGlinear
– CODD uses this linear ordering to guide the user
35 35
January 2016 Indo-Japan DST Workshop
Constraint Graph [DB2]
36
Legality
Constraint
Statistical Dependency:
Direction chosen as per
abstraction hierarchy
Super
Nodes
Dashed edges represent
missing constraints
Node Processing Order
36 36
Unique features of CODD
• Supports creation of arbitrary “what-if” scenarios
• Carries out automatic validation of user input
• Supports both space-based scaling and time-
based scaling
• Provides graphical histogram operations
• Supports inter-engine metadata transfer
• Successfully simulated yottabyte environment on
a laptop
• Demonstrated deep bug in a popular commercial
DBMS that only surfaces at Big Data scale
42 January 2016 Indo-Japan DST Workshop
Take Away
Research on Automating Big Data Testing is great technical fun with immediate practical relevance ...
Stop Protesting, be Pro-Testing !
January 2016 Indo-Japan DST Workshop 46 46