Data Intensive Computing Frameworks

of 116/116
Data Intensive Computing Frameworks Amir H. Payberah [email protected] Amirkabir University of Technology 1394/2/25 Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 1 / 95
  • date post

    07-Aug-2015
  • Category

    Documents

  • view

    85
  • download

    4

Embed Size (px)

Transcript of Data Intensive Computing Frameworks

  1. 1. Data Intensive Computing Frameworks Amir H. Payberah [email protected] Amirkabir University of Technology 1394/2/25 Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 1 / 95
  2. 2. Big Data small data big data Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 2 / 95
  3. 3. Big Data refers to datasets and ows large enough that has outpaced our capability to store, process, analyze, and understand. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 3 / 95
  4. 4. The Four Dimensions of Big Data Volume: data size Velocity: data generation rate Variety: data heterogeneity This 4th V is for Vacillation: Veracity/Variability/Value Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 4 / 95
  5. 5. Where Does Big Data Come From? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 5 / 95
  6. 6. Big Data Market Driving Factors The number of web pages indexed by Google, which were around one million in 1998, have exceeded one trillion in 2008, and its expansion is accelerated by appearance of the social networks. Mining big data: current status, and forecast to the future [Wei Fan et al., 2013] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 6 / 95
  7. 7. Big Data Market Driving Factors The amount of mobile data trac is expected to grow to 10.8 Exabyte per month by 2016. Worldwide Big Data Technology and Services 2012-2015 Forecast [Dan Vesset et al., 2013] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 7 / 95
  8. 8. Big Data Market Driving Factors More than 65 billion devices were connected to the Internet by 2010, and this number will go up to 230 billion by 2020. The Internet of Things Is Coming [John Mahoney et al., 2013] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 8 / 95
  9. 9. Big Data Market Driving Factors Many companies are moving towards using Cloud services to access Big Data analytical tools. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 9 / 95
  10. 10. Big Data Market Driving Factors Open source communities Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 10 / 95
  11. 11. How To Store and Process Big Data? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 11 / 95
  12. 12. But First, The History Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 12 / 95
  13. 13. 4000 B.C Manual recording From tablets to papyrus, to parchment, and then to paper Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 13 / 95
  14. 14. 1450 Gutenbergs printing press Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 14 / 95
  15. 15. 1800s - 1940s Punched cards (no fault-tolerance) Binary data 1890: US census 1911: IBM appeared Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 15 / 95
  16. 16. 1940s - 1950s Magnetic tapes Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 16 / 95
  17. 17. 1950s - 1960s Large-scale mainframe computers Batch transaction processing File-oriented record processing model (e.g., COBOL) Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 17 / 95
  18. 18. 1960s - 1970s Hierarchical DBMS (one-to-many) Network DBMS (many-to-many) VM OS by IBM multiple VMs on a single physical node. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 18 / 95
  19. 19. 1970s - 1980s Relational DBMS (tables) and SQL ACID Client-server computing Parallel processing Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 19 / 95
  20. 20. 1990s - 2000s Virtualized Private Network connections (VPN) The Internet... Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 20 / 95
  21. 21. 2000s - Now Cloud computing NoSQL: BASE instead of ACID Big Data Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 21 / 95
  22. 22. How To Store and Process Big Data? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 22 / 95
  23. 23. Scale Up vs. Scale Out (1/2) Scale up or scale vertically: adding resources to a single node in a system. Scale out or scale horizontally: adding more nodes to a system. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 23 / 95
  24. 24. Scale Up vs. Scale Out (2/2) Scale up: more expensive than scaling out. Scale out: more challenging for fault tolerance and software devel- opment. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 24 / 95
  25. 25. Taxonomy of Parallel Architectures DeWitt, D. and Gray, J. Parallel database systems: the future of high performance database systems. ACM Communications, 35(6), 85-98, 1992. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 25 / 95
  26. 26. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 26 / 95
  27. 27. Two Main Types of Tools Data store Data processing Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 27 / 95
  28. 28. Data Store Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 28 / 95
  29. 29. Data Store How to store and access les? File System Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 29 / 95
  30. 30. What is Filesystem? Controls how data is stored in and retrieved from disk. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 30 / 95
  31. 31. What is Filesystem? Controls how data is stored in and retrieved from disk. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 30 / 95
  32. 32. Distributed Filesystems When data outgrows the storage capacity of a single machine. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95
  33. 33. Distributed Filesystems When data outgrows the storage capacity of a single machine. Partition data across a number of separate machines. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95
  34. 34. Distributed Filesystems When data outgrows the storage capacity of a single machine. Partition data across a number of separate machines. Distributed lesystems: manage the storage across a network of machines. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95
  35. 35. Distributed Filesystems When data outgrows the storage capacity of a single machine. Partition data across a number of separate machines. Distributed lesystems: manage the storage across a network of machines. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95
  36. 36. HDFS (1/2) Hadoop Distributed FileSystem Appears as a single disk Runs on top of a native lesystem, e.g., ext3 Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 32 / 95
  37. 37. HDFS (2/2) Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 33 / 95
  38. 38. Files and Blocks (1/2) Files are split into blocks. Blocks, the single unit of storage. Transparent to user. 64MB or 128MB. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 34 / 95
  39. 39. Files and Blocks (2/2) Same block is replicated on multiple machines: default is 3 Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 35 / 95
  40. 40. HDFS Write 1. Create a new le in the Namenodes Namespace; calculate block topology. 2, 3, 4. Stream data to the rst, second and third node. 5, 6, 7. Success/failure acknowledgment. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 36 / 95
  41. 41. HDFS Read 1. Retrieve block locations. 2, 3. Read blocks to re-assemble the le. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 37 / 95
  42. 42. What About Databases? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 38 / 95
  43. 43. Database and Database Management System Database: an organized collection of data. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 39 / 95
  44. 44. Database and Database Management System Database: an organized collection of data. Database Management System (DBMS): a software that interacts with users, other applications, and the database itself to capture and analyze data. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 39 / 95
  45. 45. Relational Databases Management Systems (RDMBSs) RDMBSs: the dominant technology for storing structured data in web and business applications. SQL is good Rich language and toolset Easy to use and integrate Many vendors Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 40 / 95
  46. 46. Relational Databases Management Systems (RDMBSs) RDMBSs: the dominant technology for storing structured data in web and business applications. SQL is good Rich language and toolset Easy to use and integrate Many vendors They promise: ACID Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 40 / 95
  47. 47. ACID Properties Atomicity Consistency Isolation Durability Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 41 / 95
  48. 48. RDBMS Challenges Web-based applications caused spikes. Internet-scale data size High read-write rates Frequent schema changes Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 42 / 95
  49. 49. Scaling RDBMSs is Expensive and Inecient [http://www.couchbase.com/sites/default/les/uploads/all/whitepapers/NoSQLWhitepaper.pdf] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 43 / 95
  50. 50. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 44 / 95
  51. 51. NoSQL Avoidance of unneeded complexity High throughput Horizontal scalability and running on commodity hardware Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 45 / 95
  52. 52. NoSQL Cost and Performance [http://www.couchbase.com/sites/default/les/uploads/all/whitepapers/NoSQLWhitepaper.pdf] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 46 / 95
  53. 53. RDBMS vs. NoSQL [http://www.couchbase.com/sites/default/les/uploads/all/whitepapers/NoSQLWhitepaper.pdf] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 47 / 95
  54. 54. NoSQL Data Models [http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 48 / 95
  55. 55. NoSQL Data Models: Key-Value Collection of key/value pairs. Ordered Key-Value: processing over key ranges. Dynamo, Scalaris, Voldemort, Riak, ... Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 49 / 95
  56. 56. NoSQL Data Models: Column-Oriented Similar to a key/value store, but the value can have multiple at- tributes (Columns). Column: a set of data values of a particular type. BigTable, Hbase, Cassandra, ... Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 50 / 95
  57. 57. NoSQL Data Models: Document-Based Similar to a column-oriented store, but values can have complex documents, e.g., XML, YAML, JSON, and BSON. CouchDB, MongoDB, ... { FirstName: "Bob", Address: "5 Oak St.", Hobby: "sailing" } { FirstName: "Jonathan", Address: "15 Wanamassa Point Road", Children: [ {Name: "Michael", Age: 10}, {Name: "Jennifer", Age: 8}, ] } Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 51 / 95
  58. 58. NoSQL Data Models: Graph-Based Uses graph structures with nodes, edges, and properties to represent and store data. Neo4J, InfoGrid, ... Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 52 / 95
  59. 59. Data Processing Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 53 / 95
  60. 60. Challenges How to distribute computation? How can we make it easy to write distributed programs? Machines failure. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 54 / 95
  61. 61. Idea Issue: Copying data over a network takes time. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 55 / 95
  62. 62. Idea Issue: Copying data over a network takes time. Idea: Bring computation close to the data. Store les multiple times for reliability. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 55 / 95
  63. 63. MapReduce A shared nothing architecture for processing large data sets with a parallel/distributed algorithm on clusters. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 56 / 95
  64. 64. Simplicity Dont worry about parallelization, fault tolerance, data distribution, and load balancing (MapReduce takes care of these). Hide system-level details from programmers. Simplicity! Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 57 / 95
  65. 65. Warm-up Task (1/2) We have a huge text document. Count the number of times each distinct word appears in the le Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 58 / 95
  66. 66. Warm-up Task (2/2) File too large for memory, but all word, count pairs t in memory. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 59 / 95
  67. 67. Warm-up Task (2/2) File too large for memory, but all word, count pairs t in memory. words(doc.txt) | sort | uniq -c where words takes a le and outputs the words in it, one per a line Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 59 / 95
  68. 68. Warm-up Task (2/2) File too large for memory, but all word, count pairs t in memory. words(doc.txt) | sort | uniq -c where words takes a le and outputs the words in it, one per a line It captures the essence of MapReduce: great thing is that it is naturally parallelizable. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 59 / 95
  69. 69. MapReduce Overview words(doc.txt) | sort | uniq -c Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
  70. 70. MapReduce Overview words(doc.txt) | sort | uniq -c Sequentially read a lot of data. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
  71. 71. MapReduce Overview words(doc.txt) | sort | uniq -c Sequentially read a lot of data. Map: extract something you care about. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
  72. 72. MapReduce Overview words(doc.txt) | sort | uniq -c Sequentially read a lot of data. Map: extract something you care about. Group by key: sort and shue. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
  73. 73. MapReduce Overview words(doc.txt) | sort | uniq -c Sequentially read a lot of data. Map: extract something you care about. Group by key: sort and shue. Reduce: aggregate, summarize, lter or transform. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
  74. 74. MapReduce Overview words(doc.txt) | sort | uniq -c Sequentially read a lot of data. Map: extract something you care about. Group by key: sort and shue. Reduce: aggregate, summarize, lter or transform. Write the result. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 60 / 95
  75. 75. Example: Word Count Consider doing a word count of the following le using MapReduce: Hello World Bye World Hello Hadoop Goodbye Hadoop Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 61 / 95
  76. 76. Example: Word Count - map The map function reads in words one a time and outputs (word, 1) for each parsed input word. The map function output is: (Hello, 1) (World, 1) (Bye, 1) (World, 1) (Hello, 1) (Hadoop, 1) (Goodbye, 1) (Hadoop, 1) Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 62 / 95
  77. 77. Example: Word Count - shue The shue phase between map and reduce phase creates a list of values associated with each key. The reduce function input is: (Bye, (1)) (Goodbye, (1)) (Hadoop, (1, 1)) (Hello, (1, 1)) (World, (1, 1)) Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 63 / 95
  78. 78. Example: Word Count - reduce The reduce function sums the numbers in the list for each key and outputs (word, count) pairs. The output of the reduce function is the output of the MapReduce job: (Bye, 1) (Goodbye, 1) (Hadoop, 2) (Hello, 2) (World, 2) Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 64 / 95
  79. 79. Example: Word Count - map public static class MyMap extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 65 / 95
  80. 80. Example: Word Count - reduce public static class MyReduce extends Reducer { public void reduce(Text key, Iterator values, Context context) throws IOException, InterruptedException { int sum = 0; while (values.hasNext()) sum += values.next().get(); context.write(key, new IntWritable(sum)); } } Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 66 / 95
  81. 81. Example: Word Count - driver public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(MyMap.class); job.setReducerClass(MyReduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 67 / 95
  82. 82. MapReduce Execution J. Dean and S. Ghemawat, MapReduce: simplied data processing on large clusters, ACM Communications 51(1), 2008. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 68 / 95
  83. 83. MapReduce Weaknesses MapReduce programming model has not been designed for complex operations, e.g., data mining. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 69 / 95
  84. 84. MapReduce Weaknesses MapReduce programming model has not been designed for complex operations, e.g., data mining. Very expensive, i.e., always goes to disk and HDFS. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 69 / 95
  85. 85. Solution? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 70 / 95
  86. 86. Spark Extends MapReduce with more operators. Support for advanced data ow graphs. In-memory and out-of-core processing. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 71 / 95
  87. 87. Spark vs. Hadoop Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 72 / 95
  88. 88. Spark vs. Hadoop Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 72 / 95
  89. 89. Spark vs. Hadoop Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 73 / 95
  90. 90. Spark vs. Hadoop Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 73 / 95
  91. 91. Resilient Distributed Datasets (RDD) Immutable collections of objects spread across a cluster. An RDD is divided into a number of partitions. Partitions of an RDD can be stored on dierent nodes of a cluster. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 74 / 95
  92. 92. What About Streaming Data? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 75 / 95
  93. 93. Motivation Many applications must process large streams of live data and pro- vide results in real-time. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 76 / 95
  94. 94. Motivation Many applications must process large streams of live data and pro- vide results in real-time. Processing information as it ows, without storing them persistently. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 76 / 95
  95. 95. Motivation Many applications must process large streams of live data and pro- vide results in real-time. Processing information as it ows, without storing them persistently. Traditional DBMSs: Store and index data before processing it. Process data only when explicitly asked by the users. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 76 / 95
  96. 96. DBMS vs. DSMS (1/3) DBMS: persistent data where updates are relatively infrequent. DSMS: transient data that is continuously updated. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 77 / 95
  97. 97. DBMS vs. DSMS (2/3) DBMS: runs queries just once to return a complete answer. DSMS: executes standing queries, which run continuously and pro- vide updated answers as new data arrives. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 78 / 95
  98. 98. DBMS vs. DSMS (3/3) Despite these dierences, DSMSs resemble DBMSs: both process incoming data through a sequence of transformations based on SQL operators, e.g., selections, aggregates, joins. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 79 / 95
  99. 99. DSMS Source: produces the incoming information ows Sink: consumes the results of processing IFP engine: processes incoming ows Processing rules: how to process the incoming ows Rule manager: adds/removes processing rules Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 80 / 95
  100. 100. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 81 / 95
  101. 101. What About Graph Data? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 82 / 95
  102. 102. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 83 / 95
  103. 103. Large Graph Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 84 / 95
  104. 104. Large-Scale Graph Processing Large graphs need large-scale processing. A large graph either cannot t into memory of single computer or it ts with huge cost. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 85 / 95
  105. 105. Data-Parallel Model for Large-Scale Graph Processing The platforms that have worked well for developing parallel applica- tions are not necessarily eective for large-scale graph problems. Why? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 86 / 95
  106. 106. Graph Algorithms Characteristics Unstructured problems: dicult to partition the data Data-driven computations: dicult to partition computation Poor data locality High data access to computation ratio Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 87 / 95
  107. 107. Proposed Solution Graph-Parallel Processing Computation typically depends on the neighbors. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 88 / 95
  108. 108. Graph-Parallel Processing Restricts the types of computation. New techniques to partition and distribute graphs. Exploit graph structure. Executes graph algorithms orders-of-magnitude faster than more general data-parallel systems. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 89 / 95
  109. 109. Data-Parallel vs. Graph-Parallel Computation Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 90 / 95
  110. 110. Vertex-Centric Programing Think as a vertex. Each vertex computes individually its value: in parallel Each vertex can see its local context, and updates its value accord- ingly. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 91 / 95
  111. 111. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 92 / 95
  112. 112. Summary Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 93 / 95
  113. 113. Summary Scale-out vs. Scale-up Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 94 / 95
  114. 114. Summary Scale-out vs. Scale-up How to store data? Distributed le systems: HDFS NoSQL databases: HBase, Cassandra, ... Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 94 / 95
  115. 115. Summary Scale-out vs. Scale-up How to store data? Distributed le systems: HDFS NoSQL databases: HBase, Cassandra, ... How to process data? Batch data: MapReduce, Spark Streaming data: Spark stream, Flink, Storm, S4 Graph data: Giraph, GraphLab, GraphX, Flink Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 94 / 95
  116. 116. Questions? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 95 / 95