Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In...

81
Lecture IV More on Distributed File Systems, Space Filling Curves and MapReduce

Transcript of Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In...

Page 1: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Lecture IV

More on Distributed File Systems, Space Filling Curves and MapReduce

Page 2: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Apache Hadoop

● Initially, infrastructure centered around HDFS for running MapReduce jobs

● However, grown to a general-purpose big data framework.

● For modern versions of Hadoop, the most important components for Hadoop/MapReduce are – YARN

– HDFS

– MapReduce Services

Page 3: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Hadoop MapReduce

The aspects of a MapReduce invocation are split into● YARN components

– One central ResourceManager

– One NodeManager per Node

● HDFS components– One central Name Node

– One DataNode per Node

● MapReduce components– One Central JobTracker – One Task-Tracker per Node

Page 4: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Hadoop Architecture

Page 5: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)
Page 6: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Example: Single Node Setup

● Running in Local Mode: Only unpack the file, for example, preferably on Linux or Mac.

● Windows users might need to install Cygwin.● Follow

https://hadoop.apache.org/docs/r1.2.1/single_node_setup.html

Page 7: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

WordCount in Java

● WordCount consists of three components– Mapper taking the Input and creating pairs <Word,

1>

– Reducer / Combiner summing up the second value of the pairs

– Main Method setting up the infrastructure

Page 8: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Element 1: Map

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

    String line = value.toString();

    StringTokenizer tokenizer = new StringTokenizer(line);

    while (tokenizer.hasMoreTokens()) {

           word.set(tokenizer.nextToken());  

           output.collect(word, one);

    }

}

    }// Map

Page 9: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Element 2: Reduce

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

    int sum = 0;

    while (values.hasNext()) {

           sum += values.next().get();

    }

    output.collect(key, new IntWritable(sum));

}

}

Page 10: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Element 3: Main

public static void main(String[] args) throws Exception {

   JobConf conf = new JobConf(WordCount.class);

   conf.setJobName("wordcount");

   conf.setOutputKeyClass(Text.class);

   conf.setOutputValueClass(IntWritable.class);

   conf.setMapperClass(Map.class);

   conf.setCombinerClass(Reduce.class);

   conf.setReducerClass(Reduce.class);

   conf.setInputFormat(TextInputFormat.class);

   conf.setOutputFormat(TextOutputFormat.class);

   FileInputFormat.setInputPaths(conf, new Path(args[0]));

   FileOutputFormat.setOutputPath(conf, new Path(args[1]));

   JobClient.runJob(conf);

}

Job...

Output ...

Computation...

Input

Run

Page 11: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Compilation & Invocation

● Classpath: Use bin/hadoop classpath● Compile:

– In hadoop/wordcount (download from git)– mkdir classes– javac -cp $(../bin/hadoop classpath) WordCount.java -d classes– jar -cvf wordcount.jar -C classes/ .

● Run:– In hadoop:

– bin/hadoop jar wordcount/wordcount.jar de.uni_hannover.ikg.WordCount input output

Note that Hadoop refuses to overwrite. So delete output before running your code.

Page 12: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Output

[…]● "Hell," 6● "Hell? 2● "Hell?" 1● "Hellas," 1● "Hellburner 2

[...]

Page 13: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Job Statistics

File System Counters

FILE: Number of bytes read=11819944640

FILE: Number of bytes written=5082579924

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

Page 14: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Job Statistics

Map­Reduce Framework

Map input records=1525758

Map output records=12256351

Map output bytes=119763898

Map output materialized bytes=32703470

Input split bytes=26562

Combine input records=12256351

Combine output records=2265115

Reduce input groups=359638

Reduce shuffle bytes=32703470

Reduce input records=2265115

Reduce output records=359638

[...]

Page 15: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Running without Combine

Communication● With Combine,

Reduce shuffle bytes=32,703,470

● Without Combine

Reduce shuffle bytes=144,278,025

● This is 4.4 times more communication

Job Complexity

Combine input records=12256351

Combine output records=2265115

● Without Combine: Reduce is invoked 12,256,351 times● With Combine: Reduce is invoked only 2,265,115 times

This is 5.4 times more invocations

4.4x

5.4x

Page 16: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Assignment

● Modify WordCount such that it– Removes all non-alphabetic characters during Map

– Only stores words that are more often than a threshold given on Command Line

● Note that this case needs either a more flexible Reducer or a different Reducer and Combiner as we cant reject words that are not often enough in a Map result!

● Run Wordcount– In Standalone Mode

– In Pseudo-Distributed Mode

Page 17: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

File Systems and Distributed File Systems

Page 18: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

What is a file?

● A computer file is a block of data, typically on a persistent storage.

● It is usually accessed via the Operation System API with operations such as– Create a new file (POSIX: creat)

– Open a file (POSIX: open)

– Read data from a file (POSIX: read)

– Write data to a file (POSIX: write)

– Close a file (POSIX: close)

Page 19: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

File Systems

● File Systems organize files into directories and take care of ownership and security

● Typical concepts and operations– Directory Tree, Path, and Working Directory

– Links (Hard, Symbolic)

– Move, Delete, Rename files

– File Attributes

– Special Files (Device Nodes, Memory Mapping)

Page 20: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Distributed? No problem.

There are many DFS. Some remarkable examples:● Microsoft Distributed File System (DFS)

extens MS infrastructure with consistent views of distributed directories. Low consistency.

● Andrew File System (AFS, Carnegie Mellon University)widely used by researchers and universities.

● GlusterFS (bought by Red Hat)collects free space across servers into a new virtual file system.

● HDFS (Hadoop file system)used in the Hadoop Big Data Platform for distributed storage

● XtreemFS (German Research Highlight) POSIX-compatible, fault-tolerant, scalable, reliable

Page 21: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Andrew File System

● Scales to tenth of thousands of servers● Supports replication, though not real-time● Supports consistent and persistent caching.● Assumes that the working set of each user fits into the

cache. A pre-big-data assumption!● Assumes non-database access (the majority of writes is

non-conflicting)● Whole-File Assumption: On opening a file, the complete

file is transferred to clientTo be fair: Striping is under development.

Page 22: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Andrew FS Overview

● Program opens file. If in cache and cache valid, gets served from local drive.● Otherwise, Venus asks Vice for a copy of the file.● Vice remembers to notify Venus in case of file change on server● After communication, program again works with local cache only; commiting changes back to Vice● For the program, the file looks similar to a file on the local drive. In fact, the program works with a

UNIX file descriptor.

Program Venus Vice

Unix / Linux Kernel Unix / Linux Kernel

Page 23: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Gluster FS

● Supported for Hadoop via Plugin (just a JAR)● Eliminates central NameNode● Fault-tolerant on File System Level● Works like a full FS (FUSE mountable, writes actual files)● Supports Striping● No changes to Hadoop/MapReduce code● Allows to run Hadoop over multiple Namespaces● Real Data Access through anything:

– FUSE (including Google Drive, Amazon Drive, AWS EBS, …)– NFS– SMB– SWIFT– and even HDFS ;-)

Page 24: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

GlusterFS Deployment Example

Management Server

AmbariSSH

GlusterFS console

YARN Master

YARN Resource ManagerJob History Server

1 2 3 4

5 6 7 8

9 10 11 12

Each worker running• YARN Node Manager• glusterd

Page 25: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

HDFS (Recall...)

1

1

2 3 4

NameNode

File

1 22

2

1

File Metadata and Access ismanaged by a dedicated NameNode

File contents is split into piecesSmaller than a predefined constant, e.g., 128 MB

Chunks are stored on a distributed system of data nodes with replication protecting from data loss due to expected node failures.

Page 26: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

XtreemFS

● The only DFS that handles all failure modes including network splits! A good candidate for the future...

● But still under development, try it out...

Page 27: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Beyond File Systems

● File systems often implicitly have some assumptions, which are valid in most cases:– Files are read more often than they are written

– Files are written without conflicts at most times (e.g., file locking has to be done per application)

– Files have no structure beyond their size

– Files are seldom used partially (though striping is a remarkable example, remember city night lights in R)

Page 28: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Big Data is not about Files

● However, Big Data has three properties– Volume

● Can easily be handled by files, though the whole-file assumption can become problematic

● Counter-Measure: Semantically Sensible Splitting Mechanism

– Variety● Sometimes, only non-evenly spaced information in a file is relevant.

However, this cannot be retrieved by the file read API

– Velocity● High Speed Updates to Data break the typically only one writer per

file assumption or lead to unmanagable amounts of small files

Page 29: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Databases

In the era before Big Data, relational databases have been invented to mitigate problems arising from working with files.

RDBMS are structured in tables and typically provide● Atomicity

Either a whole transaction happens, or nothing● Consistency

Data can be constrained and linked between tables, for example, deleting a user can delete all associated data. Furthermore, anyone reading from the database at any time will see the same.

● IsolationConcurrent Transactions are guaranteed not to influence each other.

● DurabilityTransactions that happened are already persistent.

In short: RDBMS provide ACID semantics.

Page 30: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

However...

● There is no scalable way of providing ACID semantics over distributed systems.

● The CAP theorem states, that we can only have two out of – Consistency

– Availability

– Partition Tolerance

and classical RDBMS choose Consistency over Availability. (In fact, you have to wait ;-)

Page 31: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

NoSQL

● How can we go further?● We want

– Availability

– Partition Tolerance

– Multi-Writer

and are willling to sacrifice consistency (at least a bit)

Page 32: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

NoSQL by Example:Apache Cassandra

Classical Relational DBMS Apache CassandraHandles moderate incoming data velocity Handles high incoming data velocity

Data arriving from one/few locations Data arriving from many locations

Manages primarily structured data Manages all types of data

Supports complex/nested transactions Supports simple transactions

Single points of failure with failover No single points of failure; constant uptime

Supports moderate data volumes Supports very high data volumes

Page 33: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

NoSQL by Example:Apache Cassandra

Classical Relational DBMS Apache CassandraCentralized deployments Decentralized deployments

Data written in mostly one location Data written in many locations

Supports read scalability (with limited consistency)

Supports read and write scalability

Deployed in vertical scale up fashion Deployed in horizontal scale out fashion

Page 34: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Cassandra Basic Structure

● The nodes of a Cassandra Key-Value Store form a simple headless ring.

● All nodes have equal functionality● Ring Communication is done

via a Gossip protocol

Page 35: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Cassandra Writing

● Full Data Durability with High Performance: Commit Log, MemTable and SSTables:

Page 36: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Cassandra Reading

● Reads are done using a structure called Bloom Filter to avoid useless communication / disk IO:

Page 37: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Cassandra Replication

Replication is a central element in Cassandra to allow for nodes to join or leave the cluster at any time. Replication is influenced by four components:● Virtual Nodes

Assigns Data Ownership to Physical Machines● Partitioner

Partitions the Data● Replication Strategy

Defines, to which points in the ring replicas go● Snitch

Defines additional topology information, for example, from Cloud provider (such as Amazon Availability Zones)

Page 38: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Partitioner

● All data in Cassandra is a key/value pair. ● In Cassandra tables, the key is derived from the table PRIMARY

KEY, which can even be a compound key of multiple columns.● A Token represents a range of Keys and every node is

responsible for certain ranges of keys.● The Partitioner takes the key and calculates a Token, where

the key falls in. This can be done in various ways:– Uniformly (OrderPreservingPartitioner)

– Random (RandomPartitioner)

– Fast Random (Murmur3Partitioner)

Page 39: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Replication

● Data Replication in Cassandra is very simple, not to say minimalistic:– Simple Strategy: Find out the node holding the

token, where the Partitioner sent the current row and replicate it to those nodes around the ring, who don't have the data until ReplicationFactor number of nodes have the data.

– Network Topology Strategy: Find out the node holding the token. Then continue to place replica to hosts around the ring in different racks

Page 40: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

A Cassandra Ring

Ring without Vnodes, Replication Factor 3.

● Each Token stands for a Range of Keys● Each Node is responsible for a Token● Each Row is replicated around the ring to the next nodes

Page 41: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Introducing Virtual Nodes (Vnodes)

● More Tokens, as each node is split into several virtual nodes● Flexible Movement of Tokens in case of failures● Efficient and Simple Load Balancing

Page 42: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Bloom Filter

Storing Sets with Limited Memory

Page 43: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Bloom Filter

● Cassandra Data is organized using Tokens calculated from row-key, which can be random– An MD5 hash of the key for RandomPartitioner

– A Murmur3 hash for Murmur3Partitioner (the default)

– The key itself for OrderPreservingPartitioner

● For Big Data, not all data can stay in the Memtable cache. Therefore, a query might need to visit SSTables.

● Apache Cassandra employs Bloom Filters to find out, which SSTables could contain a given key.

Page 44: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Switch to BloomFilter.pdf

Page 45: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Hadoop / Cassandra / MapReduce

● As Cassandra is just a highly efficient, user-friendly, headless storage engine, it can be used as data source and data sink for Apache Hadoop MapReduce jobs.

● While the TaskTracker has been able to use Data Locality in Cassandra, this feature has gone.

● However, you can still rely on the read and write everywhere structure of Cassandra or modify the TaskTracker to support your pattern of data locality.

Page 46: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Apache Hive (Overview)

● There is another large Apache project called Hive, which is optimized for Data Warehousing.– In Data Warehouses, a lot of different data is being collected into a single DBMS for running

analytics on it.

– Typical Workload: Extract, Transform, Load (ETL)

● Apache Hive is a Data Warehouse on Top of Hadoop HDFS with the following key features:– Large support for SQL

– ODBC/JDBC Connections (e.g., Excel, Access)

– Execution of Queries are Compiled to MapReduce jobs automatically, or are run against Spark (not yet fully optimized, but already faster than MR)

– structure projection for unstructured data (no need to fill the data in again)● CSV / Text● Apache Parquet● Apache ORC

Page 47: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Warning

● Though Hive supports many modern SQL features, it is not meant for using relational queries, for example, extensively joining or foreign keys.

● In fact, Hive should always be used in a denormalization way. Materializing Query results is not unwanted redundancy, it is highly efficient processing.

● Tip: Use Compression wherever supported, as I/O is the big data bottleneck and not CPU.

Page 48: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Apache Big DataLow Level Projects

● Apache HDFS: Distributed File System● Apache Ignite: In-Memory Real-Time Processing● Apache MapReduce: MapReduce framework● Apache Pig: Data Flow Programming (similar to functional programming● Apache Spark: Solves similar problems as MapReduce, but faster using a functional

philosophy● Apache Storm: Stream Computation Framework● Apache Flink: Large Computation Graphs (DAG)● Apache HBase: The Hadoop Database● Apache Cassandra: The central Key-Value-Store ● Apache Hive: SQL over Hadoop / MapReduce● Apache Phoenix: SQL over Hbase● Apache Kafka: Publish-Subscribe Stream Processing● Apache Oozie: DAG Scheduler for multiple MapReduce jobs (e.g., trigger on availability)

Page 49: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Apache Big DataApplication Level

● Apache Mahout: Scalable Machine Learning over MapReduce

● WEKA3: Allows for Data Mining using some Weka implementations

● Cloudera Oryx: Machine Learning with a Business Perspective

● Apache Spark MLLib: Machine Learning over Spark– Widest support: Java, Scala, Python, R

Page 50: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Spatial Data Distribution

Page 51: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Spatial Data Distribution

● Spatial Data Distribution is the central question of Geospatial Big Data– How do I distribute my data between nodes in

order to increase data locality?

● Two general strategies can be differentiated:– Ring-based Architectures: Similar to Cassandra

– Block-based Architectures: Similar to HDFS

Page 52: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Ring vs. Block

● Ring-based Architectures:– Data is distributed between nodes according to an

ordering of the data

Example: The ordering in Cassandra is given by the key.

● Block-based Architectures– Data is first split into meaningful blocks, which are

then distributed between computing nodes.

Page 53: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Ordering Spatial Data

● Central Idea: Space Filling Curves● A space-filling curve is a curve, often defined as

the limit of a sequence of curves, which provides a continuous map from the unit interval [0,1] to the unit square [0,1]x[0,1]

● We will most often use some real curves not from the limit, which visit all cells of a cell decomposition of the unit square.

Page 54: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Peano Curve

Giuseppe Peano (1858-1932)

Page 55: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Hilbert Curve

David Hilbert (1862-1943)

Page 56: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Morton Order

Guy Macdonald Morton (1966)

Page 57: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Properties

● Peano– Complex, medium locality

● Hilbert– Very Good locality

– Complex to project to and from

● Z-Curve– Good locality

– Easy to project into and from

– Constant time neighbors

– Spatially known: It is the same as a Depth-First traversal of a quadtree

Page 58: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Geohash

● The Z-curve has been used to derive the Geohash mapping cells in the world to strings using Z-order and Base32 coding of the bit sequences.

Page 59: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Geohash Cells

Page 60: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

High-Level Z Curve

● Encoding: Given a point P(X,Y) in some spatial reference system– Discretizer: First, discretize X and Y into positive integer numbers of fixed

length, e.g. 16bit.

– Zip: Mix the bits of both alternatingly into a new integer.

– Encode (Optional): Encode the bit string to get a concise, human-readable reresentation.

● Decoding: Given a string or integer– Decode (Optional): Decode Base32 to get an integer

– Dezip: Create two integers by dezipping the bits

– Reverse:● Either return the Center Point of the cell repesented by the given Z Curve key● Or return the bounding box.

Page 61: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

High-Level Z Curve

● Neighbors:– Given a Base32 string, neighbors can be calculated by

table lookup on character basis. See the various implementations of Geohash for details.

– Given an integer, use the formulas● top = ((z & 0b10101010) - 1 & 0b10101010) | (z &

0b01010101)● bottom = ((z | 0b01010101) + 1 & 0b10101010) | (z &

0b01010101)● left = ((z & 0b01010101) - 1 & 0b01010101) | (z & 0b10101010)● right = ((z | 0b10101010) + 1 & 0b01010101) | (z & 0b10101010)

Page 62: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Assignment

● Implement or Download– A Zcurve encoder / decoder, possibly Geohash.

● Think about what would happen, if we:– Use the Geohash as the key for ordering points

– Use the OrderPreserving Strategy of Apache Cassandra

– Calculate the number of points within 1km distance for each point using MapReduce

● Measure– Whether the Zcurve gains speed for this query on a real cluster

Page 63: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Block Distribution Strategies

Page 64: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Spatial Indexing Structures

● Spatial Indexing Structures enable fast spatial access to points, typically supporting– Range Queries: Retrieve all geometries that are

within a specific Range, often a Rectangle or Sphere

– kNN Queries: Retrieve the k nearest neighbors to a point.

Page 65: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Grid Index

● Overlay the data with a grid and collect all points falling into specific cells of the grid.– Specific Queries only need to get neighboring grids

– Tradeoff between● Number of Empty Cells● Number of Points in each Grid Cell

Good baseline index for (nearly) uniform data

Page 66: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Example: BSP

● Binary Space Partitioning Tree (BSP)– For a given tree node, split all geometries at a

central hyperplane (line for 2D) and create two new nodes containing all geometries from the sides of the hyperplane.

● Central can be defined in various ways.

– Recurse, until a given tree node contains only few points.

Page 67: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Example: Ball Trees

● Ball Tree– Each node in the tree represents a ball (circle in

2D) and all points falling into this ball.

– If there are too many points inside some ball, split the ball into two or more balls covering all points and build subtree nodes according to this splitting.

Page 68: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

R trees and R* trees

● R tree– Each node in the tree represents the Minimum

Bounding Rectangle (MBR) of all points inside (or below) this node.

– If a node contains too many points, split it into two MBRs with similarly many points.

● R* tree – Advanced insertion algorithm making the tree one of the

most efficient and widely used spatial indices for nearest neighbor queries.

Page 69: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Lifting those Indizes to Hadoop?

● A matter of much work and open research– Hotspots can easily arise for simple indizes such as

grids

– Non-Uniform access patterns / varying chunk sizes can be results

– Efficient spatial replication and data locality has not yet been fully thought through...

Open Research Topic, possibly worth some Master theses, if you are interested.

SpatialHadoop is one approach...

Page 70: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Spatial Block Distribution

Page 71: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

SpatialHadoop

● Uses spatial indizes for distributing blocks in HDFS, hence, jobs can be put to nodes that already have the data locally.

● MapReduce Components– SpatialFileSplitter

– SpatialRecordReader

● Query Components– Range Queries– kNN Queries– Spatial Joins

Page 72: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Distributed Index Model

Page 73: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Data Mining from Location

Classification and Clustering Based on Point Distance

Page 74: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Data Mining

● Data Mining is the process of extracting structures and information from large bodies of data.

● Data Mining can be split into – Supervised Approaches

– Unsupervised Approaches

Page 75: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Supervised Data Mining

● In Supervised Data Mining, a given dataset contains the intended result.

● This dataset can be used as a training dataset in order to extract a structural representation of the data

● This structural representation is called model and can be applied to unknown data assigning what the model would predict

Page 76: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

Example: Classification

● Given a table with k attributes and a nominal class variable, learn to infer the class from the attributes.

● Classical Example: Iris Dataset– Question: Can we infer from four measurements of

sizes of a flower its species?

Page 77: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

The IRIS Dataset

> head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1          5.1         3.5          1.4         0.2  setosa

2          4.9         3.0          1.4         0.2  setosa

3          4.7         3.2          1.3         0.2  setosa

4          4.6         3.1          1.5         0.2  setosa

5          5.0         3.6          1.4         0.2  setosa

6          5.4         3.9          1.7         0.4  setosa

Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. _Annals of Eugenics_, *7*, Part II, 179-188.

Page 78: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)
Page 79: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

1NN classificationA very simple Classification scheme● Nearest Neighbor Classification is a very simple

though spatially quite useful classification algorithm

● It takes all training data and assigns to a location the class of the nearest training instance.

● Let us look at how it works:

Page 80: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)

kNN in R

data(iris)# Select 10 random training rowstrain = iris[sample(1:nrow(iris),10),]# Apply 1nn on trainingset to full iris setres = knn(train[,1:4],iris[,1:4],train$Species,1)summary(res==iris$Species)   Mode   FALSE    TRUE    NA's logical      16     134       0 

plot(iris$Petal.Length,iris$Petal.Width,      col=iris$Species)points(Petal.Width~Petal.Length,        data=iris[which(iris$Species != res),],         pch=4, cex=2, col="red", lwd=2)

Page 81: Lecture IV · Compilation & Invocation Classpath: Use bin/hadoop classpath Compile: – In hadoop/wordcount (download from git) – mkdir classes – javac -cp $(../bin/hadoop classpath)