Big Data Testing

34
Emerging trends in DW Testing Big data Mobile BI

description

Big data testing

Transcript of Big Data Testing

Page 1: Big Data Testing

Emerging trends in DW Testing

Big dataMobile BI

Page 2: Big Data Testing

Big Data

3 VsVolume , Velocity and Variety

Silent 4th V“Value”

Page 3: Big Data Testing

Hadoop – Big data framework• Hadoop is an opensource framework for handling big data processing.

(inspired by a paper published by Google on GFS and mapreduce)• 2 main components are-• HDFS- Hadoop distributed file system• Mapreduce – Programming framework for aggregation, counts written in

Java.

• Other tools supported by Hadoop-• Apache Pig – Higher level abstraction language for processing data.• Apache Hive – SQL like language to process data stored in Hadoop. Not fully SQL compilant.

Page 4: Big Data Testing

HDFS key points• HDFS was designed to be a scalable, fault-tolerant, distributed storage

system that works closely with MapReduce. • By distributing storage and computation across many servers, the

combined storage resource can grow with demand while remaining economical at every size.

• Files are split into blocks (single unit of • storage) – Managed by Namenode, stored by Datanode – Transparent to user• Replicated across machines at load time – Same block is stored on multiple machines – Good for fault-tolerance and access – Default replication is 3

Page 5: Big Data Testing

HDFS CLI• Mkdir— It will take path uri’s as argument and creates directory or directories.• hadoop fs -mkdir <paths> • Eg- hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2 hadoop fs -mkdir

hdfs://nn1.example.com/user/hadoop/dir

• Ls - Lists the contents of a directory• hadoop fs -ls <args>• Eg- hadoop fs -ls /user/hadoop/dir1

• Put -Copy single src file, or multiple src files from local file system to the Hadoop data file system

• hadoop fs -put <localsrc> ... <HDFS_dest_Path>

• Get - Copies/Downloads files to the local file system• hadoop fs -get <hdfs_src> <localdst>

Page 6: Big Data Testing

Sample Map reduce

Page 7: Big Data Testing

Sample map reduce

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context

context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

Page 8: Big Data Testing

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }

Page 9: Big Data Testing

Sample Apache Pig query

• inpt = load '/path/to/corpus‘ using TextLoader as (line: chararray); words = foreach inpt generate flatten(TOKENIZE(line)) as word; grpd = group words by word; cntd = foreach grpd generate group, COUNT(words); dump cntd;

• The above pig script, first splits each line into words using the TOKENIZE operator. The tokenize function creates a bag of words. Using the FLATTEN function, the bag is converted into a tuple. In the third statement, the words are grouped together so that the count can be computed

Page 10: Big Data Testing

Sample Hive query

• SELECT [ALL | DISTINCT] select_expr, select_expr, ...

FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING

having_condition] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] [LIMIT number]

Page 11: Big Data Testing

Other Big data technologies• No SQL – Characteristics• 1. No particular schema• 2. Highly scalable• 3. Availability• 4. Speed of access

• Monogo DB(Document oriented), Cassandra, Hbase (Hadoop based)

• All supports Java Programming API. 2 of the above built in Java.

• All have its own query language – CQL, Hbase query language

• This technology deserves a entire presentation itself

Page 12: Big Data Testing

Reference architecture

Page 13: Big Data Testing
Page 14: Big Data Testing

Different layers

• 1. Source to Hadoop data ingestion (EL)• 2. Hadoop processing using mapreduce, Pig

hive (T)• 3. Loading of processed data from hadoop to

EDW.• 4. BI reports and analytics using the processed

data stored in hadoop via Hive

Page 15: Big Data Testing
Page 16: Big Data Testing

Validation of pre Hadoop processing-

• Data from various sources like weblog, social media, transactional data etc are loaded into hadoop.

• Methodology-• Compare the input data file against source system data to ensure

data is extracted correctly.• Validating the files are loaded into HDFS correctly• Comparison using script to read data from HDFS(Java API) and check

against the respective source data.• Sample testing. But identifying the sample can be challenge since

volume is huge. Consider critical business scenarios.

Page 17: Big Data Testing

Code for accessing HDFS file system public static void main(String[] args) throws IOException { FileSystem hdfs =FileSystem.get(new Configuration()); Path homeDir=hdfs.getHomeDirectory(); //Print the home directory System.out.println("Home folder -" +homeDir); }Code for copying File from local file system to HDFSPath localFilePath = new Path("c://localdata/datafile1.txt");Path hdfsFilePath=new (“hdfs://localhost:9000/test/file1.txt");hdfs.copyFromLocalFile(localFilePath, hdfsFilePath);Copying File from HDFS to local file system localFilePath=new Path("c://hdfsdata/datafile1.txt");hdfs.copyToLocalFile(hdfsFilePath, localFilePath);

Reading data From HDFS FileBufferedReader bfr=new BufferedReader(new InputStreamReader(hdfs.open(newFilePath))); String str = null;while ((str = bfr.readLine())!= null) { System.out.println(str); }

Page 18: Big Data Testing

Validation of Hadoop processing• Validating the business logic on standalone mode.• Validating the mapreduce process to verify the key value pairs is

generated correctly.• Validating the output data against source files to ensure data processing is

done correctly.• Creating MRUnit ( testing framework) for mapreduce. Low level – unit

testing.

• Alternate approach - Write Pig scripts to implement the business logic and verify the results generated from mapreduce.

Page 19: Big Data Testing

Validating of data extract and load into EDW (Sqoop)

• Validating that the transformation rules are applied correctly

• Validating that there is no data corruption by comparing target table data against HDFS file data. (Hive can be used )

• Validating the aggregation of data and data integrity.

Page 20: Big Data Testing

Validation of reports (Hive/EDW)

• Validating by firing queries against HDFS using HIVE

• Validating by running SQL against the EDW

• Normal report testing approach.

Page 21: Big Data Testing

Challanges• Volume- Prepare compare script to validate the data.

• To reduce the time we can run all the comparison script in parallel just like data is processed in mapreduce. https://community.informatica.com/solutions/file_table_compare_utility_hdfs_hive

• Sample data ensuring maximum scenarios is covered.

• Variety- Unstructured data can be converted to structured form and then compared.

Page 22: Big Data Testing

Non Functional testing

• Performance testing• Failover testing• Realtime support

Page 23: Big Data Testing

Summary• 1. Data ingestion into hadoop via Sqoop,Flume,Kafka

• 2.Data processing within Hadoop using mapreduce,Pig , Hive. (or ETL tools like Informatica Big data edition,Talend)

• 3.Reporting using reporting tools like Tableau, Microstrategy.(via HIVE)

• 4.Loading of data from Hadoop to EDW (Teradata/Oracle) or analytical database (GreenPlum/Netezzea)

Page 24: Big Data Testing

Why Hadoop is gaining popularity

• 1. Open source• 2. Can run on commodity servers• 3. Can support any type of unstructured data.

(which makes it win over parallel database)• 4.Fault tolerant

Page 25: Big Data Testing

Use cases• 1. ETL processing moved to Hadoop to take advantage of

processing of structured/unstructured data

• 2. Machine learning over Hadoop. Eg Recommendation engine (Amazon,Flipkart)

• 3.Fraud detection in credit card or insurance industry

• 4. Retail – Understanding customers buying pattern, Market basket analysis etc.

Page 26: Big Data Testing

Emerging trends in Big data• 1. Yarn – A generic framework over Hadoop which will allow

other applications(other than mapreduce) to be built over Hadoop and also allow integration of other applications with Hadoop

• 2. Stream Processing using Strom.(again opensource)

• 3. In memory cluster computing using Apache Spark over Hadoop clusters.

Page 27: Big Data Testing

Data Mining in Hadoop – Market Basket Analysis

• Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items.

• Collect list of pair of transactions items most frequently occurring together at a store

Page 28: Big Data Testing

Initial data

• Transaction 1: cracker, icecream, soda• Transaction 2: chicken, pizza, coke, bread• Transaction 3: baguette, soda, hering, cracker, soda• Transaction 4: bourbon, coke, turkey,bread• Transaction 5: sardines, soda, chicken, coke• Transaction 6: apples, peppers, avocado, steak• Transaction 7: sardines, apples, peppers, avocado,• steak• …

Page 29: Big Data Testing

Data setup

• < (cracker, icecream), (cracker, soda),(icecream and soda) >

• < (chicken, pizza), (chicken, coke), (chicken, bread),(pizza,coke) ….. >

• < (baguette, soda), (baguette, hering), (baguette,• cracker), (baguette, soda) >• < (bourbon, coke), (bourbon, turkey) >• < (sardines, soda), (sardines, chicken), (sardines, coke)• >

Page 30: Big Data Testing

Map phase

• < ((cracker, icecream),1), ((cracker, soda),1),((icecream and soda),1)>

• <((chicken, pizza),1), ((chicken, coke),1), ((chicken, bread),1),…((cracker,soda),1)>

Output –• ((cracker,icecream),1)• ((cracker,soda),1) ((cracker,soda),1) ((key) , value)

Page 31: Big Data Testing

Reduce phase• Input -• ((cracker,icecream),<1,1,1….>)• ((cracker,soda) ,<1,1>)

• Output -• ((cracker,icecream),540)• ((cracker,soda), 240)

Result-1.Icecream should be placed nearby to cracker.2.Keep some combo offers for the combination to increase sale.

Page 32: Big Data Testing

How to test the above code

• Create a sample dataset covering different types of items combination and take count. (using excel )

• Run the mapreduce jobflow for MBA

• Verify the results from step1 and step2

Page 33: Big Data Testing

What next

• 1. Learn Java and mapreduce (not mandatory but will definitely help)

• 2. Learn Hadoop .(Hadoop – Definitive Guide by Tom white, Hadoop in Action – Chuck lam)

• 4.Learn some stuff related to Pig and Hive.• 5. Plenty of tutorials in net• 3. Install a free VM (Cloudera/Hortonworks)

Page 34: Big Data Testing

Thank you

• Questions