Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In...
Transcript of Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In...
Lecture IV
More on Distributed File Systems, Space Filling Curves and MapReduce
Apache Hadoop
● Initially, infrastructure centered around HDFS for running MapReduce jobs
● However, grown to a general-purpose big data framework.
● For modern versions of Hadoop, the most important components for Hadoop/MapReduce are – YARN
– HDFS
– MapReduce Services
Hadoop MapReduce
The aspects of a MapReduce invocation are split into● YARN components
– One central ResourceManager
– One NodeManager per Node
● HDFS components– One central Name Node
– One DataNode per Node
● MapReduce components– One Central JobTracker – One Task-Tracker per Node
Hadoop Architecture
Example: Single Node Setup
● Running in Local Mode: Only unpack the file, for example, preferably on Linux or Mac.
● Windows users might need to install Cygwin.● Follow
https://hadoop.apache.org/docs/r1.2.1/single_node_setup.html
WordCount in Java
● WordCount consists of three components– Mapper taking the Input and creating pairs <Word,
1>
– Reducer / Combiner summing up the second value of the pairs
– Main Method setting up the infrastructure
Element 1: Map
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}// Map
Element 2: Reduce
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Element 3: Main
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
Job...
Output ...
Computation...
Input
Run
Compilation & Invocation
● Classpath: Use bin/hadoop classpath● Compile:
– In hadoop/wordcount (download from git)– mkdir classes– javac -cp $(../bin/hadoop classpath) WordCount.java -d classes– jar -cvf wordcount.jar -C classes/ .
● Run:– In hadoop:
– bin/hadoop jar wordcount/wordcount.jar de.uni_hannover.ikg.WordCount input output
Note that Hadoop refuses to overwrite. So delete output before running your code.
Output
[…]● "Hell," 6● "Hell? 2● "Hell?" 1● "Hellas," 1● "Hellburner 2
[...]
Job Statistics
File System Counters
FILE: Number of bytes read=11819944640
FILE: Number of bytes written=5082579924
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Job Statistics
MapReduce Framework
Map input records=1525758
Map output records=12256351
Map output bytes=119763898
Map output materialized bytes=32703470
Input split bytes=26562
Combine input records=12256351
Combine output records=2265115
Reduce input groups=359638
Reduce shuffle bytes=32703470
Reduce input records=2265115
Reduce output records=359638
[...]
Running without Combine
Communication● With Combine,
Reduce shuffle bytes=32,703,470
● Without Combine
Reduce shuffle bytes=144,278,025
● This is 4.4 times more communication
Job Complexity
Combine input records=12256351
Combine output records=2265115
● Without Combine: Reduce is invoked 12,256,351 times● With Combine: Reduce is invoked only 2,265,115 times
This is 5.4 times more invocations
4.4x
5.4x
Assignment
● Modify WordCount such that it– Removes all non-alphabetic characters during Map
– Only stores words that are more often than a threshold given on Command Line
● Note that this case needs either a more flexible Reducer or a different Reducer and Combiner as we cant reject words that are not often enough in a Map result!
● Run Wordcount– In Standalone Mode
– In Pseudo-Distributed Mode
File Systems and Distributed File Systems
What is a file?
● A computer file is a block of data, typically on a persistent storage.
● It is usually accessed via the Operation System API with operations such as– Create a new file (POSIX: creat)
– Open a file (POSIX: open)
– Read data from a file (POSIX: read)
– Write data to a file (POSIX: write)
– Close a file (POSIX: close)
File Systems
● File Systems organize files into directories and take care of ownership and security
● Typical concepts and operations– Directory Tree, Path, and Working Directory
– Links (Hard, Symbolic)
– Move, Delete, Rename files
– File Attributes
– Special Files (Device Nodes, Memory Mapping)
Distributed? No problem.
There are many DFS. Some remarkable examples:● Microsoft Distributed File System (DFS)
extens MS infrastructure with consistent views of distributed directories. Low consistency.
● Andrew File System (AFS, Carnegie Mellon University)widely used by researchers and universities.
● GlusterFS (bought by Red Hat)collects free space across servers into a new virtual file system.
● HDFS (Hadoop file system)used in the Hadoop Big Data Platform for distributed storage
● XtreemFS (German Research Highlight) POSIX-compatible, fault-tolerant, scalable, reliable
Andrew File System
● Scales to tenth of thousands of servers● Supports replication, though not real-time● Supports consistent and persistent caching.● Assumes that the working set of each user fits into the
cache. A pre-big-data assumption!● Assumes non-database access (the majority of writes is
non-conflicting)● Whole-File Assumption: On opening a file, the complete
file is transferred to clientTo be fair: Striping is under development.
Andrew FS Overview
● Program opens file. If in cache and cache valid, gets served from local drive.● Otherwise, Venus asks Vice for a copy of the file.● Vice remembers to notify Venus in case of file change on server● After communication, program again works with local cache only; commiting changes back to Vice● For the program, the file looks similar to a file on the local drive. In fact, the program works with a
UNIX file descriptor.
Program Venus Vice
Unix / Linux Kernel Unix / Linux Kernel
Gluster FS
● Supported for Hadoop via Plugin (just a JAR)● Eliminates central NameNode● Fault-tolerant on File System Level● Works like a full FS (FUSE mountable, writes actual files)● Supports Striping● No changes to Hadoop/MapReduce code● Allows to run Hadoop over multiple Namespaces● Real Data Access through anything:
– FUSE (including Google Drive, Amazon Drive, AWS EBS, …)– NFS– SMB– SWIFT– and even HDFS ;-)
GlusterFS Deployment Example
Management Server
AmbariSSH
GlusterFS console
YARN Master
YARN Resource ManagerJob History Server
1 2 3 4
5 6 7 8
9 10 11 12
Each worker running• YARN Node Manager• glusterd
HDFS (Recall...)
1
1
2 3 4
NameNode
File
1 22
2
1
File Metadata and Access ismanaged by a dedicated NameNode
File contents is split into piecesSmaller than a predefined constant, e.g., 128 MB
Chunks are stored on a distributed system of data nodes with replication protecting from data loss due to expected node failures.
XtreemFS
● The only DFS that handles all failure modes including network splits! A good candidate for the future...
● But still under development, try it out...
Beyond File Systems
● File systems often implicitly have some assumptions, which are valid in most cases:– Files are read more often than they are written
– Files are written without conflicts at most times (e.g., file locking has to be done per application)
– Files have no structure beyond their size
– Files are seldom used partially (though striping is a remarkable example, remember city night lights in R)
Big Data is not about Files
● However, Big Data has three properties– Volume
● Can easily be handled by files, though the whole-file assumption can become problematic
● Counter-Measure: Semantically Sensible Splitting Mechanism
– Variety● Sometimes, only non-evenly spaced information in a file is relevant.
However, this cannot be retrieved by the file read API
– Velocity● High Speed Updates to Data break the typically only one writer per
file assumption or lead to unmanagable amounts of small files
Databases
In the era before Big Data, relational databases have been invented to mitigate problems arising from working with files.
RDBMS are structured in tables and typically provide● Atomicity
Either a whole transaction happens, or nothing● Consistency
Data can be constrained and linked between tables, for example, deleting a user can delete all associated data. Furthermore, anyone reading from the database at any time will see the same.
● IsolationConcurrent Transactions are guaranteed not to influence each other.
● DurabilityTransactions that happened are already persistent.
In short: RDBMS provide ACID semantics.
However...
● There is no scalable way of providing ACID semantics over distributed systems.
● The CAP theorem states, that we can only have two out of – Consistency
– Availability
– Partition Tolerance
and classical RDBMS choose Consistency over Availability. (In fact, you have to wait ;-)
NoSQL
● How can we go further?● We want
– Availability
– Partition Tolerance
– Multi-Writer
and are willling to sacrifice consistency (at least a bit)
NoSQL by Example:Apache Cassandra
Classical Relational DBMS Apache CassandraHandles moderate incoming data velocity Handles high incoming data velocity
Data arriving from one/few locations Data arriving from many locations
Manages primarily structured data Manages all types of data
Supports complex/nested transactions Supports simple transactions
Single points of failure with failover No single points of failure; constant uptime
Supports moderate data volumes Supports very high data volumes
NoSQL by Example:Apache Cassandra
Classical Relational DBMS Apache CassandraCentralized deployments Decentralized deployments
Data written in mostly one location Data written in many locations
Supports read scalability (with limited consistency)
Supports read and write scalability
Deployed in vertical scale up fashion Deployed in horizontal scale out fashion
Cassandra Basic Structure
● The nodes of a Cassandra Key-Value Store form a simple headless ring.
● All nodes have equal functionality● Ring Communication is done
via a Gossip protocol
Cassandra Writing
● Full Data Durability with High Performance: Commit Log, MemTable and SSTables:
Cassandra Reading
● Reads are done using a structure called Bloom Filter to avoid useless communication / disk IO:
Cassandra Replication
Replication is a central element in Cassandra to allow for nodes to join or leave the cluster at any time. Replication is influenced by four components:● Virtual Nodes
Assigns Data Ownership to Physical Machines● Partitioner
Partitions the Data● Replication Strategy
Defines, to which points in the ring replicas go● Snitch
Defines additional topology information, for example, from Cloud provider (such as Amazon Availability Zones)
Partitioner
● All data in Cassandra is a key/value pair. ● In Cassandra tables, the key is derived from the table PRIMARY
KEY, which can even be a compound key of multiple columns.● A Token represents a range of Keys and every node is
responsible for certain ranges of keys.● The Partitioner takes the key and calculates a Token, where
the key falls in. This can be done in various ways:– Uniformly (OrderPreservingPartitioner)
– Random (RandomPartitioner)
– Fast Random (Murmur3Partitioner)
Replication
● Data Replication in Cassandra is very simple, not to say minimalistic:– Simple Strategy: Find out the node holding the
token, where the Partitioner sent the current row and replicate it to those nodes around the ring, who don't have the data until ReplicationFactor number of nodes have the data.
– Network Topology Strategy: Find out the node holding the token. Then continue to place replica to hosts around the ring in different racks
A Cassandra Ring
Ring without Vnodes, Replication Factor 3.
● Each Token stands for a Range of Keys● Each Node is responsible for a Token● Each Row is replicated around the ring to the next nodes
Introducing Virtual Nodes (Vnodes)
● More Tokens, as each node is split into several virtual nodes● Flexible Movement of Tokens in case of failures● Efficient and Simple Load Balancing
Bloom Filter
Storing Sets with Limited Memory
Bloom Filter
● Cassandra Data is organized using Tokens calculated from row-key, which can be random– An MD5 hash of the key for RandomPartitioner
– A Murmur3 hash for Murmur3Partitioner (the default)
– The key itself for OrderPreservingPartitioner
● For Big Data, not all data can stay in the Memtable cache. Therefore, a query might need to visit SSTables.
● Apache Cassandra employs Bloom Filters to find out, which SSTables could contain a given key.
Switch to BloomFilter.pdf
Hadoop / Cassandra / MapReduce
● As Cassandra is just a highly efficient, user-friendly, headless storage engine, it can be used as data source and data sink for Apache Hadoop MapReduce jobs.
● While the TaskTracker has been able to use Data Locality in Cassandra, this feature has gone.
● However, you can still rely on the read and write everywhere structure of Cassandra or modify the TaskTracker to support your pattern of data locality.
Apache Hive (Overview)
● There is another large Apache project called Hive, which is optimized for Data Warehousing.– In Data Warehouses, a lot of different data is being collected into a single DBMS for running
analytics on it.
– Typical Workload: Extract, Transform, Load (ETL)
● Apache Hive is a Data Warehouse on Top of Hadoop HDFS with the following key features:– Large support for SQL
– ODBC/JDBC Connections (e.g., Excel, Access)
– Execution of Queries are Compiled to MapReduce jobs automatically, or are run against Spark (not yet fully optimized, but already faster than MR)
– structure projection for unstructured data (no need to fill the data in again)● CSV / Text● Apache Parquet● Apache ORC
Warning
● Though Hive supports many modern SQL features, it is not meant for using relational queries, for example, extensively joining or foreign keys.
● In fact, Hive should always be used in a denormalization way. Materializing Query results is not unwanted redundancy, it is highly efficient processing.
● Tip: Use Compression wherever supported, as I/O is the big data bottleneck and not CPU.
Apache Big DataLow Level Projects
● Apache HDFS: Distributed File System● Apache Ignite: In-Memory Real-Time Processing● Apache MapReduce: MapReduce framework● Apache Pig: Data Flow Programming (similar to functional programming● Apache Spark: Solves similar problems as MapReduce, but faster using a functional
philosophy● Apache Storm: Stream Computation Framework● Apache Flink: Large Computation Graphs (DAG)● Apache HBase: The Hadoop Database● Apache Cassandra: The central Key-Value-Store ● Apache Hive: SQL over Hadoop / MapReduce● Apache Phoenix: SQL over Hbase● Apache Kafka: Publish-Subscribe Stream Processing● Apache Oozie: DAG Scheduler for multiple MapReduce jobs (e.g., trigger on availability)
Apache Big DataApplication Level
● Apache Mahout: Scalable Machine Learning over MapReduce
● WEKA3: Allows for Data Mining using some Weka implementations
● Cloudera Oryx: Machine Learning with a Business Perspective
● Apache Spark MLLib: Machine Learning over Spark– Widest support: Java, Scala, Python, R
Spatial Data Distribution
Spatial Data Distribution
● Spatial Data Distribution is the central question of Geospatial Big Data– How do I distribute my data between nodes in
order to increase data locality?
● Two general strategies can be differentiated:– Ring-based Architectures: Similar to Cassandra
– Block-based Architectures: Similar to HDFS
Ring vs. Block
● Ring-based Architectures:– Data is distributed between nodes according to an
ordering of the data
Example: The ordering in Cassandra is given by the key.
● Block-based Architectures– Data is first split into meaningful blocks, which are
then distributed between computing nodes.
Ordering Spatial Data
● Central Idea: Space Filling Curves● A space-filling curve is a curve, often defined as
the limit of a sequence of curves, which provides a continuous map from the unit interval [0,1] to the unit square [0,1]x[0,1]
● We will most often use some real curves not from the limit, which visit all cells of a cell decomposition of the unit square.
Peano Curve
Giuseppe Peano (1858-1932)
Hilbert Curve
David Hilbert (1862-1943)
Morton Order
Guy Macdonald Morton (1966)
Properties
● Peano– Complex, medium locality
● Hilbert– Very Good locality
– Complex to project to and from
● Z-Curve– Good locality
– Easy to project into and from
– Constant time neighbors
– Spatially known: It is the same as a Depth-First traversal of a quadtree
Geohash
● The Z-curve has been used to derive the Geohash mapping cells in the world to strings using Z-order and Base32 coding of the bit sequences.
Geohash Cells
High-Level Z Curve
● Encoding: Given a point P(X,Y) in some spatial reference system– Discretizer: First, discretize X and Y into positive integer numbers of fixed
length, e.g. 16bit.
– Zip: Mix the bits of both alternatingly into a new integer.
– Encode (Optional): Encode the bit string to get a concise, human-readable reresentation.
● Decoding: Given a string or integer– Decode (Optional): Decode Base32 to get an integer
– Dezip: Create two integers by dezipping the bits
– Reverse:● Either return the Center Point of the cell repesented by the given Z Curve key● Or return the bounding box.
High-Level Z Curve
● Neighbors:– Given a Base32 string, neighbors can be calculated by
table lookup on character basis. See the various implementations of Geohash for details.
– Given an integer, use the formulas● top = ((z & 0b10101010) - 1 & 0b10101010) | (z &
0b01010101)● bottom = ((z | 0b01010101) + 1 & 0b10101010) | (z &
0b01010101)● left = ((z & 0b01010101) - 1 & 0b01010101) | (z & 0b10101010)● right = ((z | 0b10101010) + 1 & 0b01010101) | (z & 0b10101010)
Assignment
● Implement or Download– A Zcurve encoder / decoder, possibly Geohash.
● Think about what would happen, if we:– Use the Geohash as the key for ordering points
– Use the OrderPreserving Strategy of Apache Cassandra
– Calculate the number of points within 1km distance for each point using MapReduce
● Measure– Whether the Zcurve gains speed for this query on a real cluster
Block Distribution Strategies
Spatial Indexing Structures
● Spatial Indexing Structures enable fast spatial access to points, typically supporting– Range Queries: Retrieve all geometries that are
within a specific Range, often a Rectangle or Sphere
– kNN Queries: Retrieve the k nearest neighbors to a point.
Grid Index
● Overlay the data with a grid and collect all points falling into specific cells of the grid.– Specific Queries only need to get neighboring grids
– Tradeoff between● Number of Empty Cells● Number of Points in each Grid Cell
Good baseline index for (nearly) uniform data
Example: BSP
● Binary Space Partitioning Tree (BSP)– For a given tree node, split all geometries at a
central hyperplane (line for 2D) and create two new nodes containing all geometries from the sides of the hyperplane.
● Central can be defined in various ways.
– Recurse, until a given tree node contains only few points.
Example: Ball Trees
● Ball Tree– Each node in the tree represents a ball (circle in
2D) and all points falling into this ball.
– If there are too many points inside some ball, split the ball into two or more balls covering all points and build subtree nodes according to this splitting.
R trees and R* trees
● R tree– Each node in the tree represents the Minimum
Bounding Rectangle (MBR) of all points inside (or below) this node.
– If a node contains too many points, split it into two MBRs with similarly many points.
● R* tree – Advanced insertion algorithm making the tree one of the
most efficient and widely used spatial indices for nearest neighbor queries.
Lifting those Indizes to Hadoop?
● A matter of much work and open research– Hotspots can easily arise for simple indizes such as
grids
– Non-Uniform access patterns / varying chunk sizes can be results
– Efficient spatial replication and data locality has not yet been fully thought through...
Open Research Topic, possibly worth some Master theses, if you are interested.
SpatialHadoop is one approach...
Spatial Block Distribution
SpatialHadoop
● Uses spatial indizes for distributing blocks in HDFS, hence, jobs can be put to nodes that already have the data locally.
● MapReduce Components– SpatialFileSplitter
– SpatialRecordReader
● Query Components– Range Queries– kNN Queries– Spatial Joins
Distributed Index Model
Data Mining from Location
Classification and Clustering Based on Point Distance
Data Mining
● Data Mining is the process of extracting structures and information from large bodies of data.
● Data Mining can be split into – Supervised Approaches
– Unsupervised Approaches
Supervised Data Mining
● In Supervised Data Mining, a given dataset contains the intended result.
● This dataset can be used as a training dataset in order to extract a structural representation of the data
● This structural representation is called model and can be applied to unknown data assigning what the model would predict
Example: Classification
● Given a table with k attributes and a nominal class variable, learn to infer the class from the attributes.
● Classical Example: Iris Dataset– Question: Can we infer from four measurements of
sizes of a flower its species?
The IRIS Dataset
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. _Annals of Eugenics_, *7*, Part II, 179-188.
1NN classificationA very simple Classification scheme● Nearest Neighbor Classification is a very simple
though spatially quite useful classification algorithm
● It takes all training data and assigns to a location the class of the nearest training instance.
● Let us look at how it works:
kNN in R
data(iris)# Select 10 random training rowstrain = iris[sample(1:nrow(iris),10),]# Apply 1nn on trainingset to full iris setres = knn(train[,1:4],iris[,1:4],train$Species,1)summary(res==iris$Species) Mode FALSE TRUE NA's logical 16 134 0
plot(iris$Petal.Length,iris$Petal.Width, col=iris$Species)points(Petal.Width~Petal.Length, data=iris[which(iris$Species != res),], pch=4, cex=2, col="red", lwd=2)