Big Data Testing

Emerging trends in DW Testing

Big dataMobile BI

Big Data

3 VsVolume , Velocity and Variety

Silent 4th V“Value”

Hadoop – Big data framework• Hadoop is an opensource framework for handling big data processing.

(inspired by a paper published by Google on GFS and mapreduce)• 2 main components are-• HDFS- Hadoop distributed file system• Mapreduce – Programming framework for aggregation, counts written in

Java.

• Other tools supported by Hadoop-• Apache Pig – Higher level abstraction language for processing data.• Apache Hive – SQL like language to process data stored in Hadoop. Not fully SQL compilant.

HDFS key points• HDFS was designed to be a scalable, fault-tolerant, distributed storage

system that works closely with MapReduce. • By distributing storage and computation across many servers, the

combined storage resource can grow with demand while remaining economical at every size.

• Files are split into blocks (single unit of • storage) – Managed by Namenode, stored by Datanode – Transparent to user• Replicated across machines at load time – Same block is stored on multiple machines – Good for fault-tolerance and access – Default replication is 3

HDFS CLI• Mkdir— It will take path uri’s as argument and creates directory or directories.• hadoop fs -mkdir <paths> • Eg- hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2 hadoop fs -mkdir

hdfs://nn1.example.com/user/hadoop/dir

• Ls - Lists the contents of a directory• hadoop fs -ls <args>• Eg- hadoop fs -ls /user/hadoop/dir1

• Put -Copy single src file, or multiple src files from local file system to the Hadoop data file system

• hadoop fs -put <localsrc> ... <HDFS_dest_Path>

• Get - Copies/Downloads files to the local file system• hadoop fs -get <hdfs_src> <localdst>

Sample Map reduce

Sample map reduce

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context

context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }

Sample Apache Pig query

• inpt = load '/path/to/corpus‘ using TextLoader as (line: chararray); words = foreach inpt generate flatten(TOKENIZE(line)) as word; grpd = group words by word; cntd = foreach grpd generate group, COUNT(words); dump cntd;

• The above pig script, first splits each line into words using the TOKENIZE operator. The tokenize function creates a bag of words. Using the FLATTEN function, the bag is converted into a tuple. In the third statement, the words are grouped together so that the count can be computed

Sample Hive query

• SELECT [ALL | DISTINCT] select_expr, select_expr, ...

FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING

having_condition] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] [LIMIT number]

Other Big data technologies• No SQL – Characteristics• 1. No particular schema• 2. Highly scalable• 3. Availability• 4. Speed of access

• Monogo DB(Document oriented), Cassandra, Hbase (Hadoop based)

• All supports Java Programming API. 2 of the above built in Java.

• All have its own query language – CQL, Hbase query language

• This technology deserves a entire presentation itself

Reference architecture

Different layers

• 1. Source to Hadoop data ingestion (EL)• 2. Hadoop processing using mapreduce, Pig

hive (T)• 3. Loading of processed data from hadoop to

EDW.• 4. BI reports and analytics using the processed

data stored in hadoop via Hive

Validation of pre Hadoop processing-

• Data from various sources like weblog, social media, transactional data etc are loaded into hadoop.

• Methodology-• Compare the input data file against source system data to ensure

data is extracted correctly.• Validating the files are loaded into HDFS correctly• Comparison using script to read data from HDFS(Java API) and check

against the respective source data.• Sample testing. But identifying the sample can be challenge since

volume is huge. Consider critical business scenarios.

Code for accessing HDFS file system public static void main(String[] args) throws IOException { FileSystem hdfs =FileSystem.get(new Configuration()); Path homeDir=hdfs.getHomeDirectory(); //Print the home directory System.out.println("Home folder -" +homeDir); }Code for copying File from local file system to HDFSPath localFilePath = new Path("c://localdata/datafile1.txt");Path hdfsFilePath=new (“hdfs://localhost:9000/test/file1.txt");hdfs.copyFromLocalFile(localFilePath, hdfsFilePath);Copying File from HDFS to local file system localFilePath=new Path("c://hdfsdata/datafile1.txt");hdfs.copyToLocalFile(hdfsFilePath, localFilePath);

Reading data From HDFS FileBufferedReader bfr=new BufferedReader(new InputStreamReader(hdfs.open(newFilePath))); String str = null;while ((str = bfr.readLine())!= null) { System.out.println(str); }

Validation of Hadoop processing• Validating the business logic on standalone mode.• Validating the mapreduce process to verify the key value pairs is

generated correctly.• Validating the output data against source files to ensure data processing is

done correctly.• Creating MRUnit ( testing framework) for mapreduce. Low level – unit

testing.

• Alternate approach - Write Pig scripts to implement the business logic and verify the results generated from mapreduce.

Validating of data extract and load into EDW (Sqoop)

• Validating that the transformation rules are applied correctly

• Validating that there is no data corruption by comparing target table data against HDFS file data. (Hive can be used )

• Validating the aggregation of data and data integrity.

Validation of reports (Hive/EDW)

• Validating by firing queries against HDFS using HIVE

• Validating by running SQL against the EDW

• Normal report testing approach.

Challanges• Volume- Prepare compare script to validate the data.

• To reduce the time we can run all the comparison script in parallel just like data is processed in mapreduce. https://community.informatica.com/solutions/file_table_compare_utility_hdfs_hive

• Sample data ensuring maximum scenarios is covered.

• Variety- Unstructured data can be converted to structured form and then compared.

https://community.informatica.com/solutions/file_table_compare_utility_hdfs_hive

https://community.informatica.com/solutions/file_table_compare_utility_hdfs_hive

Non Functional testing

• Performance testing• Failover testing• Realtime support

Summary• 1. Data ingestion into hadoop via Sqoop,Flume,Kafka

• 2.Data processing within Hadoop using mapreduce,Pig , Hive. (or ETL tools like Informatica Big data edition,Talend)

• 3.Reporting using reporting tools like Tableau, Microstrategy.(via HIVE)

• 4.Loading of data from Hadoop to EDW (Teradata/Oracle) or analytical database (GreenPlum/Netezzea)

Why Hadoop is gaining popularity

• 1. Open source• 2. Can run on commodity servers• 3. Can support any type of unstructured data.

(which makes it win over parallel database)• 4.Fault tolerant

Use cases• 1. ETL processing moved to Hadoop to take advantage of

processing of structured/unstructured data

• 2. Machine learning over Hadoop. Eg Recommendation engine (Amazon,Flipkart)

• 3.Fraud detection in credit card or insurance industry

• 4. Retail – Understanding customers buying pattern, Market basket analysis etc.

Emerging trends in Big data• 1. Yarn – A generic framework over Hadoop which will allow

other applications(other than mapreduce) to be built over Hadoop and also allow integration of other applications with Hadoop

• 2. Stream Processing using Strom.(again opensource)

• 3. In memory cluster computing using Apache Spark over Hadoop clusters.

Data Mining in Hadoop – Market Basket Analysis

• Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items.

• Collect list of pair of transactions items most frequently occurring together at a store

Initial data

• Transaction 1: cracker, icecream, soda• Transaction 2: chicken, pizza, coke, bread• Transaction 3: baguette, soda, hering, cracker, soda• Transaction 4: bourbon, coke, turkey,bread• Transaction 5: sardines, soda, chicken, coke• Transaction 6: apples, peppers, avocado, steak• Transaction 7: sardines, apples, peppers, avocado,• steak• …

Data setup

• < (cracker, icecream), (cracker, soda),(icecream and soda) >

• < (chicken, pizza), (chicken, coke), (chicken, bread),(pizza,coke) ….. >

• < (baguette, soda), (baguette, hering), (baguette,• cracker), (baguette, soda) >• < (bourbon, coke), (bourbon, turkey) >• < (sardines, soda), (sardines, chicken), (sardines, coke)• >

Map phase

• < ((cracker, icecream),1), ((cracker, soda),1),((icecream and soda),1)>

• <((chicken, pizza),1), ((chicken, coke),1), ((chicken, bread),1),…((cracker,soda),1)>

Output –• ((cracker,icecream),1)• ((cracker,soda),1) ((cracker,soda),1) ((key) , value)

Reduce phase• Input -• ((cracker,icecream),<1,1,1….>)• ((cracker,soda) ,<1,1>)

• Output -• ((cracker,icecream),540)• ((cracker,soda), 240)

Result-1.Icecream should be placed nearby to cracker.2.Keep some combo offers for the combination to increase sale.

How to test the above code

• Create a sample dataset covering different types of items combination and take count. (using excel )

• Run the mapreduce jobflow for MBA

• Verify the results from step1 and step2

What next

• 1. Learn Java and mapreduce (not mandatory but will definitely help)

• 2. Learn Hadoop .(Hadoop – Definitive Guide by Tom white, Hadoop in Action – Chuck lam)

• 4.Learn some stuff related to Pig and Hive.• 5. Plenty of tutorials in net• 3. Install a free VM (Cloudera/Hortonworks)

Thank you

• Questions

Big Data Testing

Documents

Transcript of Big Data Testing