Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC

17

description

One of the most popular NoSQL databases, MongoDB is one of the building blocks for big data analysis. MongoDB can store unstructured data and makes it easy to analyze files by commonly available tools. This session will go over how big data analytics can improve sales outcomes in identifying users with a propensity to buy by processing information from social networks. All attendees will have a MongoDB instance on a public cloud, plus sample code to run Big Data Analytics.

Transcript of Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC

Page 1: Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC
Page 2: Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC
Page 3: Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC

We offer MongoDB-as-a-Service on any cloud of your choice. You can read more about our MongoDB-as-a-service in our white paper on our website: http://www.cumulogic.com/resources/mongodb_wp/

Page 4: Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC

The goal of this boot camp is to give you hands-on experience with MongoDB database-as-a-service, how to load the data and show you a sample application to analyze the data. We will use a small sample Twitter application for our hands-on lab, which will help you write a MongoDB application. We will also discuss briefly a few performance-related so you can analyze and tweak performance of your databases. At the same time, you will also see how you can easily launch a fully managed MongoDB instance in the cloud.

Page 5: Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC

About a decade ago, business applications were transactional in nature and most of the issues were related to executing transactions (i.e. credit card processing) with low latency, as a result enterprise data was more “relational” in nature and was therefore “structured”. The nature of business applications has changed and enterprises are trying to figure out how to use all the data in their enterprise systems, social media, machine logs, etc. to understand how all the data impacts their business and how they can get competitive advantage by leveraging nuggets in that data. Fast forward till today and businesses are trying to solve a different problem. And with the diverse nature of data sources and data formats, we need newer technologies that scale and provide answers or identify those nuggets in the data at much faster speed and low cost than traditional SQL database or data warehouse systems. Hence, we see a slew of new database technologies being developed that promise to help solving these problems. Depending on the nature of the data or problem they solve, we can categorize these new database technologies in three major categories. (1) Document oriented databases, which store and crunch data in document formats, (2) Key-value pair databases such as Riak and Redis and (3) Graph databases. Depending on the type of data, we could use one of these databases to solve your data analytics problems. Today, we are focus on MongoDB.

Page 6: Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC

When should we want to use NoSQL database Vs SQL database, and which NoSQL database? As I mentioned before, the problems that NoSQL databases solve is related to the nature and amount of data we want to processes in our next generation applications. We need databases that can scale to petabytes of data at a fraction of the cost of a relational database. We need database systems which can help us quickly analyze petabytes of data and provide results in realtime - hence the speed and velocity of data access is critical. NoSQL database systems can provide high speed access and low latency access to large amount of data. And one key criteria to consider when choosing NoSQL database is the nature of your applications and main issues with them – are they operational or analytical? For example, for batch processing, analytical apps, you may be better off with Hadoop – while for operational issues of scalability and realtime processing, you may want to choose MongoDB database. So consider these criteria in making your decisions and do some experiments and find the best ones that fits your application needs.

Page 7: Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC

1. Let’s take a look at the key feature sets of MongoDB at very high level. MongoDB is a document oriented database server. It stores objects as BSON (pronounced as bison), which is a binary versions of JSON format and it supports dynamic schemas – which essentially means it is schema-less database. There is no rigid SQL-like schema to store the data. This gives flexibility in choosing the data types from different data sources such as social networks, machine logs or CRM systems. 2. MongoDb supports indexing just like traditional SQL indexing, which means you can index data on any field with high fidelity to improve query performance. (FYI – High fidelity here means the field which is a variable in all records. For example, if we are storing data about employees, the data field that varies most is the phone number and not the city name or company name) 3. Most of you may be familiar with the concept of database sharding. MongoDB is a horizontally scalable database and supports sharding – which means it stores data in smaller chunks on several data nodes for low latency access to the data. Hence MongoDB is widely used in the cloud because you can scale the database by adding shards as your data grows and maintain that low latency of data access even as your size of the data grows. 4. MongoDB is designed to be resilient for data durability and supports replica sets which can be geographically distributed 5. MongoDB supports Map-reduce operations and provides fast updates to the data.

FAQs: When do you want to use Hadoop Vs MongoDB for Map-reduce? Answer: You want to use Hadoop for batch jobs, where you can fire up analytics on

offline data, whereas you can use MongoDB for realtime data analytics.

Question: How does Sharding work in MongoDB? Answer: MongoDB sharding works by spreading writes to multiple data nodes.

Mongos, which is the mongoDB proces,s directs data to a different data node to write or read. And show the slide – (refer to the sharding diagram)

Page 8: Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC

Since MongoDB scales very well horizontally, it is the most widely used database in the cloud. And given the complexity of managing mongoDB for maintaining availability, data durability and performance, you may want to leverage platforms which provide you MongoDB-as-a-Service, which is a web service call to provision a dedicated mongoDB server, fully sharded and replicated, which scales automatically. You will get a chance to use MongoDB service shortly in our platform

Page 9: Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC

The specific MongoDB architecture that you choose will impact the performance, availability and data durability. MongoDB is flexible and supports high availability and sharding architectures to provide you tge level of redundancy, performance and SLA you want for your service. MongoDB supports replica sets and sharding deployment architectures. Replica sets provide high availability and data durability while sharding provides scalability. You can configure shards on the replica sets for achieving the best of both, reliability and scalability.

Page 10: Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC

This is a replica set with three replica nodes in two datacenters or two regions of a public cloud. MongoDB uses “eventual consistency” which means there may be a possibility that data on the replicas may be out of sync from the primary node. You may want to use this architecture for data redundancy purposes rather than scaling. In this architecture, you still send reads and writes to the primary node, which means even with multiple nodes, your application wouldn’t necessarily scale better. To maintain this level of redundancy yet improve scalability, you can use sharding as in the next slide.

Page 11: Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC

This is a three shard deployment architecture which uses three replica sets and can be in a single region or datacenter or distributed geographically. With this architecture, you get the benefit of both, the data redundancy with replica sets and high scalability with shards. Each shard itself can be a replica set which provides data redundancy at each node level. But keep in mind, there is a overhead to sharding and replication and you want to choose what’s best for your database

Page 12: Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC

Now let’s take a look at a sample application. We have a sample Twitter app to do hands-on experiment with. We will use MongoDB-as-a-Service on the cloud and use a sample app to analyze twitter dat.

Page 13: Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC
Page 14: Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC

Just like any database, the performance of MongoDB database must be monitored and optimized for a given workload or application type. These are key metrics you want to look for in MongoDb: (1) CPU (2) memory (3) Ops counters – this is the total number of operations over a period of time. This number shows you number of active and pending operations (4) background flush – this is the number of disk writes when MongoDb flushes all in-memory data to the disk. You want to keep an eye on this number and tweak if you wish to reduce the number of times or frequency of disk writes. There are other metrics which we will see during our hands-on lab.

Page 15: Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC
Page 16: Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC
Page 17: Hands on Big Data Analysis with MongoDB - Cloud Expo Bootcamp NYC