Capybara Hive Integration Testing. Issues We’ve Seen at Hortonworks Many tests for different...

CapybaraHive Integration Testing

Issues We’ve Seen at Hortonworks

• Many tests for different permutations– e.g. does it work with Orc, with Parquet, with Text

• Can’t run Hive tests on a cluster– Forces QE to rewrite tests from scratch, hard to share resources with

dev

• Tests are all small, no ability to scale• Golden files are a grievous evil

– Test writers have to eye-ball results, error prone– Small change in query plan forces hundreds of expected output

changes

• QE and dev working in different languages and frameworks• It’s hard to get user queries with user-like data into the framework

– Tests built based on feature testing and bug fixing, not user experience

Proposed Requirements

• One test should run in all reasonable permutations– Spark/Tez, Orc/Parquet/Text, secure/non-secure, etc.– Tests can specify which options make no sense for them

• Same tests locally and on cluster• Auto-generation of data and expected results

– At varying scales– Expected results generated by source of truth, won’t work for all

but should cover 80%

• Programmatic access to query plan– Add tools to make it easy to find tasks, operators, and patterns

• Java, runs in Junit• Ability to simulate user data and run user queries

What’s There Today

• Automated data generation (random, stats based, dev specified)• Data loaded into Hive and benchmark

– State remembered so that tables not created for every test

• Queries run against Hive and benchmark• Comparison of select queries and insert statements• Works on dev’s machine or against a cluster

– Dev’s machine: miniclusters and Derby– Cluster: user provided cluster and Postgres

• A few basic tables provided for tests – alltypes, capysrc, capysrcpart, TPC-H like tables

• UserQueryGenerator– Takes in set of user queries– Reads user’s metastore (user has to first run analyze table on included tables)– Generates Java test file that builds simulated data

What’s There Today Continued

• SQL Ansifier – takes Hive query and converts to ANSI SQL to run against benchmark (incomplete)

• A given run of tests can be configured with a set of features– e.g. file format=orc, engine=tez

• Annotations– ignore a test when inappropriate with configured features (e.g. no acid

when spark is the engine)– set configuration for features (e.g. @AcidOn)

• Scale can be set• User can provide custom benchmark and comparator• Programmatic access to query plan

– very limited tools today, need more work here

• Initial patch posted to HIVE-12316

https://issues.apache.org/jira/browse/HIVE-12316

Missing Pieces

• Limited working options– Need to add HBase metastore, LLAP, Spark, security, Hive

Streaming, ...– Tez there but SUPER slow– JDBC in process– binary data, complex types don’t work– parallel data generation and comparison written but not yet

tested– Not yet a way to set or switch users (for security tests)

• Limited usage testing– Many options haven’t been tried and I’m sure some don’t work– Limited qfiles converted

Example Test

@Testpublic void simple() throws Exception { TableTool.createAllTypes();

runQuery("select cvarchar from alltypes"); sortAndCompare();}

Example Test

@Testpublic void simpleJoin() throws Exception { TableTool.createPseudoTpch();

runQuery("select p_name, avg(l_price) " + "from ph_lineitem join ph_part " + "on (l_partkey = p_partkey) " + "group by p_name " + "order by p_name"); compare();}

Example Test

@Testpublic void q1() throws Exception { set("hive.auto.convert.join", true); runQuery("drop table if exists t"); runQuery("create table t (a string, b bigint); "); runQuery("insert into t select c, d from u;"); IMetaStoreClient msClient = new HiveMetaStoreClient(new HiveConf()); Table msTable = msClient.getTable("default", "t"); TestTable tTable = new TestTable(msTable); tableCompare(tTable);}

Example Explain

@Testpublic void explain() throws Exception { TableTool.createCapySrc();

Explain explain = explain("select k,value from capysrc order by k");

// Expect that somewhere in the plan is a MapRedTask. MapRedTask mrTask = explain.expect(MapRedTask.class);

// Find all scans in the MapRedTask. List<TableScanOperator> scans = explain.findAll(mrTask, TableScanOperator.class); Assert.assertEquals(1, scans.size());}

Run a Test

• Locally, use default optionsmvn test -Dtest=TestSkewJoin

• Locally, specify using tezmvn test -Dtest=TestSkewJoin -Dhive.test.capybara.engine=tez

• On a clustermvn test -Dtest=TestSkewJoin-Dhive.test.capybara.use.cluster=true -DHADOOP_HOME=your_hadoop_path-DHIVE_HOME=your_hive_path

Simulate User Queries

• Select queries create, one file for each test (may contain more than 1 query)

• Run analyze table with collect column stats for each table with source data

• Then run, outputs TestQueries.javahive --service capygen -i queries/*.sql -o TestQueries

Questions

Capybara Hive Integration Testing. Issues We’ve Seen at Hortonworks Many tests for different...

Documents

Transcript of Capybara Hive Integration Testing. Issues We’ve Seen at Hortonworks Many tests for different...