CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

23
Big Data Platform Elements - Part 1 CIS 415 Lecture 3 Hina Arora

description

cis lecture

Transcript of CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

Page 1: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

Big Data Platform Elements - Part 1CIS 415 Lecture 3

Hina Arora

Page 2: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

Announcements• We have a Grader!

oAnirudh Dhawan ([email protected])oOffice Hours: Thur 10am-12pm; BA Suite 318

• Show of hands – did you complete last week’s required readings?o Contents of Lecture-1 Deck and any supplemental notes you took in classo Review: “Vocabulary” Section in Lecture-1 Decko Review: List of common data applications hereo Watch: The Beauty of Data Visualization

Page 3: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

Big Data Platform Elements

Virtualization

Cloud Computing

Parallel Programming

Map Reduce

Big Data

Platforms

Page 4: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

What will we cover today?• Virtualization• Cloud Computing

Page 5: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

Virtualization

Page 6: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

What is Virtualization?• “Virtualization means that Applications can use a resource

without any concern for where it resides, what the technical interface is, how it has been implemented, which platform it uses, and how much of it is available” ~Rick F. Van der Lans in Data Virtualization for Business Intelligence Systems

• We’ll look at a few different types of virtualization:o Server Virtualization – can be HW-level or OS-level Virtualizationo Storage Virtualizationo Network Virtualizationo Desktop Virtualizationo Application Virtualization

Page 7: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

Server Virtualization: HW-level virtualization• Ability to run multiple Virtual Machines (VMs or guests) on a single Physical

Machine (host).• Each Virtual Machine emulates the underlying physical hardware and has an

Operating System (OS). • Guest VMs are mostly completely isolated from each other.• Each guest VM can run a different OS.• Hypervisors (or Virtual Machine Monitors or VMMs) are used to create and run

VMs. There are two types of hypervisors:o Type-1, Native or Bare-metal Hypervisors:

‒ Run directly on the host's hardware. ‒ Example: Hyper-V Hypervisor.

o Type-2 or Hosted Hypervisors: ‒ Run on the host’s OS. ‒ Example: VMware Player, VirtualBox

• Server Virtualization provides improved utilization, and scalabilityReference: https://en.wikipedia.org/wiki/Hypervisor

Server

Hypervisor Type-1

Guest OS

Bins/Libs

App

Guest OS

Bins/Libs

App

VM

Server

Host OS

Hypervisor Type-2

Guest OS

Bins/Libs

App

Guest OS

Bins/Libs

App

VM

Page 8: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

Server Virtualization: OS-level virtualization• Ability to run multiple isolated Containers (user-space instances or guests)

on a single Physical Machine (host).• Containers do not emulate the underlying HW and don’t have their own OS

(they share the host OS). This lighter footprint allows hosts to support a higher density of guest Containers (as against guest VMs). But on the flip side raises Security concerns.

• Containers can also share binaries and libraries with other Containers.• Each Container typically runs a single Application.• Example: Docker

Reference: https://en.wikipedia.org/wiki/Operating-system-level_virtualization

ServerHost OS

Bins/Libs

App

Bins/Libs

AppContainer

App

Dock

er

Page 9: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

Review: Storage Definitions• Block

o A sequence of bytes.o Storage systems typically provide access to blocks.o The OS typically abstracts other logical views like files and records.

• Stripingo Sequential blocks of data are stored on different physical storage devices in (typically) round-robin fashion.o Example: Disk1 <A, C, E>; Disk2 <B, D, F>o Striping is useful when requests for data are faster than a single storage device can deliver. Striping data across multiple storage

devices allows for concurrent access to data thereby improving performance.

• Mirroringo Replication of data onto separate disks in real time. o Example: Disk1 <A, B, C>; Disk2 <A, B, C>o Improves data redundancy and reliability.

• Parityo When data on a crashed disk can be reconstructed using data on other disks (using the XOR operation)o Example: Disk1 <A:11010011>; Disk2 <B:10011001>; Disk3 <PAB: 01001010>

Essentially, PAB = A XOR B, so is any one disk crashes, you can reconstruct using XOR operation between other twoo Improves data redundancy

• File System:o Controls how data is managed, stored and retrieved. o Without a file system, we would just have a large blob of data with no way to identify different connected pieces of information. o File systems are organized around groups of data called files, and groups of files called directories or folders.o Distributed files systems are files systems that are spread across multiple servers.

Reference: Wikipedia

Page 10: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

Storage Virtualization• Data is abstracted into what appears to be a single storage unit, while the physical

storage actually spans multiple heterogeneous devices and often locations• Storage Virtualization provides location independence, improved utilization,

performance, reliability and availability• Example: RAID (redundant array of independent/inexpensive disks)

Popular RAID Types

Striping(provides excellent performance)

Mirroring(provides excellent redundancy)

Parity(provides good redundancy)

Minimum Number of Disks

Example(Disk – Blocks)

Comments

RAID 0 Yes No No 2 Disk 1 -- A, C, EDisk 2 -- B, D, F

Excellent Performance. No Redundancy. Do not use for critical applications.

RAID 1 No Yes No 2 Disk 1 -- A, B, CDisk 2 -- A, B, C

Good Performance. Excellent Redundancy.

RAID 5 Yes No Yes(Distributed Parity)

3 Disk 1 – A, C, PEFDisk 2 – B, PCD, EDisk 3 – PAB, D, F

Good Performance.Good Redundancy.Most cost effective.Fast Reads; Slow Writes.

RAID 10 Yes Yes No 4 Disk 1 -- A, C, EDisk 2 -- A, C, EDisk 3 -- B, D, FDisk 4 -- B, D, F

Excellent Performance.Excellent Redundancy.Great for mission critical applications.Not as cost-effective as RAID 5.

Reference: https://en.wikipedia.org/wiki/RAID

Page 11: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

Review: Network Definitions• Local Area Network (LAN):

o A computer network with interconnected devices within a limited geographical area such as a house or building.• Wide Area Network (WAN):

o A computer network that spans large geographical areas• IP Address

o Address of a device participating in a networko IPv4: 32 bits | IPv6: 128 bitso Example: 11000000.10101000.00000101.10000010 (192.168.5.130)o Higher order bits determine network (indicated by subnet mask), and lower order bits determine host (device)

• Subnetting: o Dividing a network into smaller partso This affects the total number of hosts that can be addressed

• Switch: o Connects devices together on a computer network

• Routero Carry traffic from one network/subnet to the othero Routers maintain routing tables to determine whether traffic is meant for this LAN, a connected LAN or a different

network.o Example: the home router connects home computers to the internet (these are similar networks since they both share

TCP/IP protocol)• Gateway

o Typically connects two or more (dissimilar) computer networksReference: Wikipedia Image Source: http://netprivateer.com/lanwan.html

Page 12: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

Network Virtualization• Creation of logical, virtual networks that are decoupled from the (limitations of) underlying

physical hardware.

• Example: VLAN, VPNo Virtual Local Area Network (VLAN)

‒ Allows for grouping of hosts within a virtual LAN regardless of geographical location

‒ Provides scalability, flexibility, simplified administration, and securityo Virtual Private Network (VPN)

‒ Securely extends a private network over a public network such as the internet‒ Users can remotely communicate with the private network as though they were

directly connected to it with the same functionality, security and administrativepolicies

‒ Provides flexibility, simplified administration, and security

Image Source: link Image Source: https://en.wikipedia.org/wiki/Virtual_private_network

Page 13: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

(Remote) Desktop Virtualization• Enables access to applications on a remote OS using a virtual desktop. • The remote OS carries the application and data, and only the display, keyboard, and

mouse information are communicated with the local client device.• Users (on the local client devices) must establish a session and be connected with

the remote server to access the application. • Makes installation, upgrades and management of applications easier for IT.• Two kinds: RDS, VDI

• Remote Desktop Services (RDS) aka Terminal Serviceso Provides remote desktop to multiple users on a Host OSo Provides users session-based isolation (session virtualization) - users share Host OSo Users have no admin privileges on the host OSo Can support higher user density

• Virtual Desktop Infrastructure (VDI)o Provides remote desktop to multiple users on Guest OSso Provides users VM-based isolation - each user gets a dedicated Guest OSo Users have admin privileges on the Guest OSo Support lower users density

Page 14: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

Application Virtualization• Application Virtualization separates the Application from the OS, so Applications can

be more easily deployed and delivered.• The application is packaged and streamed from the server down the network to the

client and, instead of being installed on the client device, is executed on the local device in a virtual bubble that is completely isolated from the client OS.

• Applications are streamed intelligently. o Only required parts are streamed as and when they are used. o Once the application has been streamed, it is cached on the client device so it doesn’t have

to be streamed every time a user uses it on the client. This also means the application can be used even when the client is not connected to the server.

o When an application upgrade is available, the server copy is upgraded, and the upgrades are streamed down to the clients the next time the application is used on the client.

• Makes installation, upgrades and management of applications easier for IT.• Examples: VMware ThinApp, Citrix XenApp and Microsoft App-V

Reference: http://blogs.msdn.com/b/ianm/archive/2010/06/11/microsoft-virtual-desktop-101-making-sense-of-vdi-rds-app-v-med-v-and-desktop-virtualisation.aspx

Page 15: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

Cloud Computing

Page 16: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

Have you used Applications Hosted on the Cloud?

• You typically sign up for service (free with ads, free trial, or subscription)• You connect to the internet for access• You don’t need to “install” application software, and “version upgrades”

are pushed seamlessly• You expect reliable, on-demand, self-service of the application• You expect ability to instantaneously upgrade (eg more storage, no ads,

etc)• You rely on the service provider for infrastructure (eg: you don’t set up mail

server)• You rely on the service provider for security and privacy• You rely on the service provider for backup and recovery

*Note: a lot of these services come with clients apps – we are not considering that scenario here.

What are some characteristics these applications have in common*?

Page 17: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

What is Cloud Computing?• “Cloud computing is a model for enabling convenient, on-demand network

access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.

• Key enabling technologies include: (1) fast wide-area networks, (2) powerful, inexpensive server computers, and (3) high-performance virtualization for commodity hardware.”

Source: http://www.nist.gov/itl/cloud/

http://www.intel.com/content/www/us/en/cloud-computing/cloud-101-video.html

Page 18: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

There are 3 basic deployment models in cloud computing:

• Private Cloudo Two kinds of private clouds:

‒ On-Prem Private Cloud: On-Prem Data Center + Network Virtualization + Cloud Orchestration Software ‒ Externally Hosted Private Cloud (also called Virtual Private Cloud): Logically isolated, user-defined, and

user-controlled portion of a 3rd party hosted cloud (like AWS or Microsoft). o Provides high degree of Controlo Good for highly-sensitive data and applications

• Public Cloudo Third-Party Provides Cloud Services (3 different service models - IaaS, PaaS, or SaaS)o Typically pay-as-you-go model (you pay for what you use)o Service Provider held to agreed upon availability, reliability, privacy and security standardso Provides high degree of Scalabilityo Example: Amazon AWS, Microsoft Azure, Google Cloud

• Hybrid Cloudo Combination of Private and Public Cloudo Allows you to pick desired level of Control vs Scalability

Deployment Models

Page 19: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

• Private: User controls everything from the networking to the applications. Example: user’s on-premise datacenter.

• IaaS: User controls the application down to the underlying OS, and the Cloud Provider manages the virtualization layer and the hardware. Example: getting a virtual server in the cloud.

• PaaS: User controls application and data, and the Cloud Provider provisions the underlying supporting infrastructure, typically including operating system, programming-language execution environment, database, and web servers. This allows developers to focus on application development instead of worrying about underlying hardware and software layers.

• SaaS: User gains access to application software and databases. Cloud providers install and operate application software, and manage the infrastructure and platforms that run the applications. Example: O365 in the cloud.

Image Source: http://cloudcomputing.sys-con.com/node/2932264

There are 4 basic service models in cloud computing, based on what parts of the stack the User controls vs what the Cloud Provider manages.

Reference: https://en.wikipedia.org/wiki/Cloud_computing

Service Models

* Note: “Managed by Microsoft” is just an example – it’s essentially cloud provider of your choice…

Page 20: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

• On-demand self-service: A consumer can provision computing capabilities, as needed automatically without requiring human interaction with each service provider.

• Device and location independence: Users can access service using a web browser regardless of location or device used (e.g., PC, mobile phone).

• Resource pooling: Computing resources are pooled to serve multiple consumers, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand.

• Scalability and elasticity: Dynamic on-demand provisioning of resources on a fine-grained, self-service basis in near real-time without users having to engineer for peak loads.

• Measured service:Resource usage can be monitored, controlled, and reported, providing transparency for both the provider and consumer of the utilized service.

Key Characteristics

Reference: http://www.nist.gov/itl/cloud/ and https://en.wikipedia.org/wiki/Cloud_computing

Page 21: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

• Advantageso Scalability and elasticity by design (dynamic on-demand provisioning of resources)o Convenience by design (device and location independence)o Continuous Availability by design (on-demand self-service)o Improved Reliability due to use of multiple redundant siteso Faster Deployment since infrastructure set up is quick, and software integration is easiero Cost Reduction due to savings on sunk cost of infrastructure, licenses, and maintenance

• Riskso Limited Control over infrastructure, software, and datao Security and Privacy of data is at the mercy of the Service Providero Dependency on the Provider can lead to vendor lock-in and migration challengeso Downtime of service can occur due to Service Provider outage or network access issues

Advantages and Risks

Reference: https://en.wikipedia.org/wiki/Cloud_computing

Page 22: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

What did we learn today?• Four key elements make up big data platforms:

o Virtualization, Cloud Computing, Parallel Programming and Map Reduce.

• “Virtualization means that Applications can use a resource without any concern for where it resides, what the technical interface is, how it has been implemented, which platform it uses, and how much of it is available.”o Virtualization can occur at different levels of the stack: Server, Storage, Network, Desktop and

Application.

• “Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.”o Three Deployment Models: Private, Public, Hybrid.o Four Service Models: Private, IaaS, PaaS, SaaS.o There are Advantages and Risks involved in Cloud Computing that one must be aware.

Page 23: CIS 415 Lecture 3 - Big Data Platform Elements - Part 1

Required Readings for this Lecture• Contents of this Deck

o Note: Anything I’ve linked to as “Source”, “Reference”, or “Optional Reading” in the deck is not required reading.

• Supplemental notes you take during class

• Homework - spend a 5-10 minutes on each of these Sites: Amazon AWS, Microsoft Azure, Google Cloud

o Do you now see a number of familiar terms on these sites?

o What deployment models do they cover?

o What service models do they cover?

o Note how they all have very similar competing offers (including free trials to improve adoption).