Data Intensive Computing Frameworks

download Data Intensive Computing Frameworks

of 116

  • date post

    07-Aug-2015
  • Category

    Documents

  • view

    82
  • download

    4

Embed Size (px)

Transcript of Data Intensive Computing Frameworks

  1. 1. Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of Technology 1394/2/25 Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 1 / 95
  2. 2. Big Data small data big data Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 2 / 95
  3. 3. Big Data refers to datasets and ows large enough that has outpaced our capability to store, process, analyze, and understand. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 3 / 95
  4. 4. The Four Dimensions of Big Data Volume: data size Velocity: data generation rate Variety: data heterogeneity This 4th V is for Vacillation: Veracity/Variability/Value Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 4 / 95
  5. 5. Where Does Big Data Come From? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 5 / 95
  6. 6. Big Data Market Driving Factors The number of web pages indexed by Google, which were around one million in 1998, have exceeded one trillion in 2008, and its expansion is accelerated by appearance of the social networks. Mining big data: current status, and forecast to the future [Wei Fan et al., 2013] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 6 / 95
  7. 7. Big Data Market Driving Factors The amount of mobile data trac is expected to grow to 10.8 Exabyte per month by 2016. Worldwide Big Data Technology and Services 2012-2015 Forecast [Dan Vesset et al., 2013] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 7 / 95
  8. 8. Big Data Market Driving Factors More than 65 billion devices were connected to the Internet by 2010, and this number will go up to 230 billion by 2020. The Internet of Things Is Coming [John Mahoney et al., 2013] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 8 / 95
  9. 9. Big Data Market Driving Factors Many companies are moving towards using Cloud services to access Big Data analytical tools. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 9 / 95
  10. 10. Big Data Market Driving Factors Open source communities Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 10 / 95
  11. 11. How To Store and Process Big Data? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 11 / 95
  12. 12. But First, The History Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 12 / 95
  13. 13. 4000 B.C Manual recording From tablets to papyrus, to parchment, and then to paper Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 13 / 95
  14. 14. 1450 Gutenbergs printing press Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 14 / 95
  15. 15. 1800s - 1940s Punched cards (no fault-tolerance) Binary data 1890: US census 1911: IBM appeared Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 15 / 95
  16. 16. 1940s - 1950s Magnetic tapes Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 16 / 95
  17. 17. 1950s - 1960s Large-scale mainframe computers Batch transaction processing File-oriented record processing model (e.g., COBOL) Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 17 / 95
  18. 18. 1960s - 1970s Hierarchical DBMS (one-to-many) Network DBMS (many-to-many) VM OS by IBM multiple VMs on a single physical node. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 18 / 95
  19. 19. 1970s - 1980s Relational DBMS (tables) and SQL ACID Client-server computing Parallel processing Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 19 / 95
  20. 20. 1990s - 2000s Virtualized Private Network connections (VPN) The Internet... Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 20 / 95
  21. 21. 2000s - Now Cloud computing NoSQL: BASE instead of ACID Big Data Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 21 / 95
  22. 22. How To Store and Process Big Data? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 22 / 95
  23. 23. Scale Up vs. Scale Out (1/2) Scale up or scale vertically: adding resources to a single node in a system. Scale out or scale horizontally: adding more nodes to a system. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 23 / 95
  24. 24. Scale Up vs. Scale Out (2/2) Scale up: more expensive than scaling out. Scale out: more challenging for fault tolerance and software devel- opment. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 24 / 95
  25. 25. Taxonomy of Parallel Architectures DeWitt, D. and Gray, J. Parallel database systems: the future of high performance database systems. ACM Communications, 35(6), 85-98, 1992. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 25 / 95
  26. 26. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 26 / 95
  27. 27. Two Main Types of Tools Data store Data processing Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 27 / 95
  28. 28. Data Store Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 28 / 95
  29. 29. Data Store How to store and access les? File System Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 29 / 95
  30. 30. What is Filesystem? Controls how data is stored in and retrieved from disk. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 30 / 95
  31. 31. What is Filesystem? Controls how data is stored in and retrieved from disk. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 30 / 95
  32. 32. Distributed Filesystems When data outgrows the storage capacity of a single machine. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95
  33. 33. Distributed Filesystems When data outgrows the storage capacity of a single machine. Partition data across a number of separate machines. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95
  34. 34. Distributed Filesystems When data outgrows the storage capacity of a single machine. Partition data across a number of separate machines. Distributed lesystems: manage the storage across a network of machines. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95
  35. 35. Distributed Filesystems When data outgrows the storage capacity of a single machine. Partition data across a number of separate machines. Distributed lesystems: manage the storage across a network of machines. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 31 / 95
  36. 36. HDFS (1/2) Hadoop Distributed FileSystem Appears as a single disk Runs on top of a native lesystem, e.g., ext3 Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 32 / 95
  37. 37. HDFS (2/2) Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 33 / 95
  38. 38. Files and Blocks (1/2) Files are split into blocks. Blocks, the single unit of storage. Transparent to user. 64MB or 128MB. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 34 / 95
  39. 39. Files and Blocks (2/2) Same block is replicated on multiple machines: default is 3 Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 35 / 95
  40. 40. HDFS Write 1. Create a new le in the Namenodes Namespace; calculate block topology. 2, 3, 4. Stream data to the rst, second and third node. 5, 6, 7. Success/failure acknowledgment. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 36 / 95
  41. 41. HDFS Read 1. Retrieve block locations. 2, 3. Read blocks to re-assemble the le. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 37 / 95
  42. 42. What About Databases? Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 38 / 95
  43. 43. Database and Database Management System Database: an organized collection of data. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 39 / 95
  44. 44. Database and Database Management System Database: an organized collection of data. Database Management System (DBMS): a software that interacts with users, other applications, and the database itself to capture and analyze data. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 39 / 95
  45. 45. Relational Databases Management Systems (RDMBSs) RDMBSs: the dominant technology for storing structured data in web and business applications. SQL is good Rich language and toolset Easy to use and integrate Many vendors Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 40 / 95
  46. 46. Relational Databases Management Systems (RDMBSs) RDMBSs: the dominant technology for storing structured data in web and business applications. SQL is good Rich language and toolset Easy to use and integrate Many vendors They promise: ACID Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 40 / 95
  47. 47. ACID Properties Atomicity Consistency Isolation Durability Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 41 / 95
  48. 48. RDBMS Challenges Web-based applications caused spikes. Internet-scale data size High read-write rates Frequent schema changes Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 42 / 95
  49. 49. Scaling RDBMSs is Expensive and Inecient [http://www.couchbase.com/sites/default/les/uploads/all/whitepapers/NoSQLWhitepaper.pdf] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 43 / 95
  50. 50. Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 44 / 95
  51. 51. NoSQL Avoidance of unneeded complexity High throughput Horizontal scalability and running on commodity hardware Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 45 / 95
  52. 52. NoSQL Cost and Performance [http://www.couchbase.com/sites/default/les/uploads/all/whitepapers/NoSQLWhitepaper.pdf] Amir H. Payberah (AUT) Data Intensive Computing 1394/2/25 46 / 95
  53. 53. RDBMS vs. NoSQL [http://www.couchbase.com/sites/default/les/uploads/all/whitepapers/NoSQLWhitepaper.pdf] Amir H. Payberah (AUT) Data Intensive Computing 1394/2