Outline of Tutorial • Hadoop and Pig Overview • Hands-on
-
Upload
nguyenkhue -
Category
Documents
-
view
221 -
download
1
Transcript of Outline of Tutorial • Hadoop and Pig Overview • Hands-on
![Page 1: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/1.jpg)
Outline of Tutorial
• Hadoop and Pig Overview • Hands-on
1
![Page 2: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/2.jpg)
Hadoop and Pig Overview
Lavanya Ramakrishnan Shane Canon
Lawrence Berkeley National Lab
October 2011
![Page 3: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/3.jpg)
Overview
• Concepts & Background – MapReduce and Hadoop
• Hadoop Ecosystem – Tools on top of Hadoop
• Hadoop for Science – Examples, Challenges
• Programming in Hadoop – Building blocks, Streaming, C-HDFS API
3
![Page 4: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/4.jpg)
Processing Big Data
• Internet scale generates BigData – Terabytes of data/day – just reading 100 TB can be overwhelming
• using clusters of standard commodity computers for linear scalability
• Timeline – Nutch open source search project
(2002-2004) – MapReduce & DFS implementation and
Hadoop splits out of Nutch (2004-2006)
4
![Page 5: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/5.jpg)
MapReduce
• Computation performed on large volumes of data in parallel – divide workload across large number of
machines – need a good data management scheme to
handle scalability and consistency • Functional programming concepts
– map – reduce
5 OSDI 2004
![Page 6: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/6.jpg)
Mapping
• Map input to an output using some function
• Example – string manipulation
6
![Page 7: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/7.jpg)
Reduces
• Aggregate values together to provide summary data
• Example – addition of the list of numbers
7
![Page 8: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/8.jpg)
Google File System
• Distributed File System – accounts for component failure – multi-GB files and billions of objects
• Design – single master with multiple chunkservers
per master – file represented as fixed-sized chunks – 3-way mirrored across chunkservers
8
![Page 9: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/9.jpg)
Hadoop
• Open source reliable, scalable distributed computing platform – implementation of MapReduce – Hadoop Distributed File System (HDFS) – runs on commodity hardware
• Fault Tolerance – restarting tasks – data replication
• Speculative execution – handles stragglers
9
![Page 10: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/10.jpg)
HDFS Architecture
10
![Page 11: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/11.jpg)
HDFS and other Parallel Filesystems
HDFS GPFS and Lustre Typical Replication 3 1 Storage Location Compute Node Servers Access Model Custom (except with
Fuse) POSIX
Stripe Size 64 MB 1 MB Concurrent Writes No Yes Scales with # of Compute Nodes # of Servers Scale of Largest Systems
O(10k) Nodes O(100) Servers
User/Kernel Space User Kernel
![Page 12: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/12.jpg)
Who is using Hadoop?
• A9.com • Amazon • Adobe • AOL • Baidu • Cooliris • Facebook • NSF-Google
university initiative
• IBM • LinkedIn • Ning • PARC • Rackspace • StumbleUpon • Twitter • Yahoo!
12
![Page 13: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/13.jpg)
Hadoop Stack
Core Avro
MapReduce HDFS
Pig Chukwa Hive HBase
Source: Hadoop: The Definitive Guide
Zoo Keeper
13
Constantly evolving!
![Page 14: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/14.jpg)
Google Vs Hadoop
Google Hadoop MapReduce Hadoop MapReduce GFS HDFS Sawzall Pig, Hive BigTable Hbase Chubby Zookeeper Pregel Hama, Giraph
14
![Page 15: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/15.jpg)
Pig
• Platform for analyzing large data sets • Data-flow oriented language “Pig Latin”
– data transformation functions – datatypes include sets, associative arrays,
tuples – high-level language for marshalling data
• Developed at Yahoo!
15
![Page 16: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/16.jpg)
Hive
• SQL-based data warehousing application – features similar to Pig – more strictly SQL-type
• Supports SELECT, JOIN, GROUP BY, etc
• Analyzing very large data sets – log processing, text mining, document
indexing • Developed at Facebook
16
![Page 17: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/17.jpg)
HBase
• Persistent, distributed, sorted, multidimensional, sparse map – based on Google BigTable – provides interactive access to information
• Holds extremely large datasets (multi-TB)
• High-speed lookup of individual (row, column)
17
![Page 18: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/18.jpg)
ZooKeeper
• Distributed consensus engine – runs on a set of servers and maintains
state consistency • Concurrent access semantics
– leader election – service discovery – distributed locking/mutual exclusion – message board/mailboxes – producer/consumer queues, priority
queues and multi-phase commit operations
18
![Page 19: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/19.jpg)
Other Related Projects [1/2]
• Chukwa – Hadoop log aggregation • Scribe – more general log aggregation • Mahout – machine learning library • Cassandra – column store database on a P2P
backend • Dumbo – Python library for streaming • Spark – in memory cluster for interactive and
iterative • Hadoop on Amazon – Elastic MapReduce
19
![Page 20: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/20.jpg)
Other Related Projects [2/2]
• Sqoop – import SQL-based data to Hadoop • Jaql – JSON (JavaScript Object Notation)
based semi-structured query processing • Oozie – Hadoop workflows • Giraph – Large scale graph processing on
Hadoop • Hcatlog – relational view of HDFS • Fuse-DS – POSIX interface to HDFS
20
![Page 21: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/21.jpg)
Hadoop for Science
21
![Page 22: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/22.jpg)
Magellan and Hadoop
• DOE funded project to determine appropriate role of cloud computing for DOE/SC midrange workloads
• Co-located at Argonne Leadership Computing Facility (ALCF) and National Energy Research Scientific Center (NERSC)
• Hadoop/Magellan research questions – Are the new cloud programming models
useful for scientific computing?
– 22
![Page 23: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/23.jpg)
Data Intensive Science
• Evaluating hardware and software choices for supporting next generation data problems
• Evaluation of Hadoop – using mix of synthetic benchmarks and
scientific applications – understanding application characteristics
that can leverage the model • data operations: filter, merge, reorganization • compute-data ratio
23
(collaboration w/ Shane Canon, Nick Wright, Zacharia Fadika)
![Page 24: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/24.jpg)
MapReduce and HPC
• Applications that can benefit from MapReduce/Hadoop – Large amounts of data processing – Science that is scaling up from the
desktop – Query-type workloads
• Data from Exascale needs new technologies – Hadoop On Demand lets one run Hadoop
through a batch queue 24
![Page 25: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/25.jpg)
Hadoop for Science
• Advantages of Hadoop – transparent data replication, data locality
aware scheduling – fault tolerance capabilities
• Hadoop Streaming – allows users to plug any binary as maps
and reduces – input comes on standard input
25
![Page 26: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/26.jpg)
BioPig
• Analytics toolkit for Next-Generation Sequence Data
• User defined functions (UDF) for common bioinformatics programs – BLAST, Velvet – readers and writers for FASTA and FASTQ – pack/unpack for space conservation with
DNA sequenceså
26
![Page 27: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/27.jpg)
Application Examples
• Bioinformatics applications (BLAST) – parallel search of input sequences – Managing input data format
• Tropical storm detection – binary file formats can’t be handled in
streaming • Atmospheric River Detection
– maps are differentiated on file and parameter
27
![Page 28: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/28.jpg)
“Bring your application” Hadoop workshop
• When: TBD • Send us email if you are interested
– [email protected] – [email protected]
• Include a brief description of your application.
28
![Page 29: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/29.jpg)
HDFS vs GPFS (Time)
0
2
4
6
8
10
12
0 500 1000 1500 2000 2500 3000
Tim
e (m
inut
es)
Number of maps
Teragen (1TB) HDFS
GPFS
Linear(HDFS) Expon.(HDFS) Linear(GPFS) Expon.(GPFS)
29
![Page 30: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/30.jpg)
• Wikipedia data set • On ~ 75 nodes,
GPFS performs better with large nodes
Application Characteristic Affect Choices
• Iden%cal data loads and processing load
• Amount of wri%ng in applica%on affects performance
![Page 31: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/31.jpg)
• Deployment – all jobs run as user “hadoop” affecting file permissions – less control on how many nodes are used - affects allocation policies
• Programming: No turn-key solution – using existing code bases, managing input formats and data
• Additional benchmarking, tuning needed, Plug-ins for Science
Hadoop: Challenges
31
![Page 32: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/32.jpg)
Comparison of MapReduce Implementations
32
40 50 60 70 80 90
050
100
150
Output data size (MB)
Proc
essin
g tim
e (s
)
64 core Twister Cluster64 core Hadoop Cluster64 core LEMO−MR Cluster
Collaboration w/ Zacharia Fadika, Elif Dede, Madhusudhan Govindaraju, SUNY Binghamton
0 10 20 30 40 50 60
010
2030
4050
Cluster size (cores)
Spee
dup
!
!
!
!
!
!! 64 core Twister Cluster64 core LEMO−MR Cluster64 core Hadoop Cluster
Hadoop Twister LEMO−MR
010
20 node1node2node3
Hadoop Twister LEMO−MR
Num
ber o
f wor
ds p
roce
ssed
(Billi
on)
010
20
node1: (Under stress)node2node3
Producing random floating point numbers
Load balancing
Processing 5 million 33 x 33 matrices
![Page 33: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/33.jpg)
Programming Hadoop
33
![Page 34: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/34.jpg)
• Map and reduce as Java programs using Hadoop API • Pipes and Streaming can help with existing applications in other languages • C- HDFS API • Higher-level languages such as Pig might help with some applications
Programming with Hadoop
34
![Page 35: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/35.jpg)
Keys and Values
• Maps and reduces produce key-value pairs – arbitrary number of values can be output – may map one input to 0,1, ….100 outputs – reducer may emit one or more outputs
• Example: Temperature recordings – 94089 8:00 am, 59 – 27704 6:30 am, 70 – 94089 12:45 pm, 80 – 47401 1 pm, 90
35
![Page 36: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/36.jpg)
Keys divide the reduce space
36
![Page 37: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/37.jpg)
Data Flow
37
![Page 38: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/38.jpg)
Mechanics[1/2]
• Input files – large 10s of GB or more, typically in HDFS – line-based, binary, multi-line, etc.
• InputFormat – function defines how input files are split up and
read – TextInputFormat (default), KeyValueInputFormat,
SequenceFileInputFormat • InputSplits
– unit of work that comprises a single map task – FileInputFormat divides it into 64MB chunks
38
![Page 39: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/39.jpg)
Mechanics [2/2]
• RecordReader – loads data and converts to key value pair
• Sort & Partiton & Shuffle – intermediate data from map to reducer
• Combiner – reduce data on a single machine
• Mapper & Reducer • OutputFormat, RecordWriter
39
![Page 40: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/40.jpg)
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text();
public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
Word Count Mapper
40
![Page 41: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/41.jpg)
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
Word Count Reducer
41
![Page 42: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/42.jpg)
public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); …. Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Word Count Example
42
![Page 43: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/43.jpg)
• Allows C++ code to be used for Mapper and Reducer • Both key and value inputs to pipes programs are provided as std::string • $ hadoop pipes
Pipes
43
![Page 44: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/44.jpg)
• Limited C API to read and write from HDFS #include "hdfs.h" int main(int argc, char **argv) { hdfsFS fs = hdfsConnect("default", 0); hdfsFile writeFile = hdfsOpenFile(fs, writePath,
O_WRONLY|O_CREAT, 0, 0, 0); tSize num_written_bytes = hdfsWrite(fs, writeFile,
(void*)buffer, strlen(buffer)+1); hdfsCloseFile(fs, writeFile); }
C-HDFS API
44
![Page 45: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/45.jpg)
• Generic API that allows programs in any language to be used as Hadoop Mapper and Reducer implementations • Inputs written to stdin as strings with tab character separating • Output to stdout as key \t value \n • $ hadoop jar contrib/streaming/hadoop-[version]-streaming.jar
Hadoop Streaming
45
![Page 46: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/46.jpg)
• Test core functionality separate • Use Job Tracker • Run “local” in Hadoop • Run job on a small data set on a single node • Hadoop can save files from failed tasks
Debugging
46
![Page 47: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/47.jpg)
Pig – Basic Operations
• LOAD – loads data into a relational form
• FOREACH..GENERATE – Adds or removes fields (columns)
• GROUP – Group data on a field • JOIN – Join two relations • DUMP/STORE – Dump query to
terminal or file There are others, but these will be used
for the exercises today
![Page 48: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/48.jpg)
Pig Example
Find the number of gene hits for each model in an hmmsearch (>100GB of output, 3 Billion Lines)
bash# cat * |cut –f 2|sort|uniq -c
> hits = LOAD ’/data/bio/*' USING PigStorage() AS (id:chararray,model:chararray, value:float);!
> amodels = FOREACH hits GENERATE model;!> models = GROUP amodels BY model;!> counts = FOREACH models GENERATE group,COUNT
(amodels) as count;!> STORE counts INTO 'tcounts' USING PigStorage();!
![Page 49: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/49.jpg)
Pig - LOAD
Example:
hits = LOAD 'load4/*' USING PigStorage() AS (id:chararray, model:chararray,value:float);!
• Pig has several built-in data types (chararray, float, integer)
• PigStorage can parse standard line oriented text files. • Pig can be extended with custom load types written in
Java. • Pig doesn’t read any data until triggered by a DUMP or
STORE
![Page 50: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/50.jpg)
Pig – FOREACH..GENERATE, GROUP
Example:
amodel = FOREACH model GENERATE hits;!models = GROUP amodels BY model;!counts = FOREACH models GENERATE group,COUNT
(amodels) as count;!
• Use FOREACH..GENERATE to pick of specific fields or generate new fields. Also referred to as a projection
• GROUP will create a new record with the group name and a “bag” of the tuples in each group
• You can reference a specific field in a bag with <bag>.field (i.e. amodels.model)
• You can use aggregate functions like COUNT, MAX, etc on a bag
![Page 51: Outline of Tutorial • Hadoop and Pig Overview • Hands-on](https://reader033.fdocuments.us/reader033/viewer/2022051716/58a1aba81a28ab2a038b75e7/html5/thumbnails/51.jpg)
Pig – Important Points
• Nothing really happens until a DUMP or STORE is performed.
• Use FILTER and FOREACH early to remove unneeded columns or rows to reduce temporary output
• Use PARALLEL keyword on GROUP operations to run more reduce tasks