Become Data Driven With Hadoop as-a-Service

Become Data-Driven with Hadoop-as-a-Service

www.mammothdata.com | @mammothdataco

The Leader in Big Data ConsultingBe more specific, sure!

● BI/Data Strategy○ Development of a business intelligence/ data architecture strategy.

● Installation○ Installation of Hadoop or relevant technology.

● Data Consolidation○ Load data from diverse sources into a single scalable repository.

● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards, feeds or computer-driven decision making processes to derive insights and make decisions.

● Visualization Tool ○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial

reports and provide training to necessary employees who will analyze the data.

Mammoth Data, based in downtown Durham, NC

http://www.mammothdata.com/

https://twitter.com/mammothdataco

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

Agenda

What is a Data Driven Company?Challenges to Becoming a Data Driven

CompanyReasons Companies Overcome These

ChallengesMoving to Real TimeStages of DevelopmentStage One: Data Consolidation and

AnalyticsStage Two: Computer-Aided Decision

MakingStage Three: Real-Time Decision MakingStage Four: Semantic Web, Natural

Language, Everything as a Service

SparkHadoop/Spark as a ServiceWorkloadsAccommodating WorkloadsVision of Commoditized ComputingYesterday, Today and Tomorrow






What is a Data Driven Company?

To be data-driven means cultivating a mindset throughout the fabric of the business to continually use analytics to make fact-based business decisions.

It is not a “technology” or a product.A data driven company has a mature flow of data from and makes

decisions using cross sections of data such as:Sales to manufacturingIts supply chain Finance Industry and public sources






Challenges to Becoming a Data Driven Company

BusinessTraditionRapidly changing undocumented business processesLack of management supportUndocumented meaning of business data

TechnicalDisparate data sourcesPoorly documented data sourcesNon-integrated proprietary systemsPoorly structured dataTraditional data warehouses were cost prohibitive and scaled

poorly (both up and down)






Reasons Companies Overcome These Challenges

Cost effectiveness - by integrating supply chain, finance, manufacturing, and sales data we can make better, more timely decisions (i.e. supply chain integration).

Opportunity - by using public data, internal data, data about our customers we can find new opportunities (i.e. customer intimacy).

Competitiveness - competitors are making better use of data to target our customers and potential customers.

Regulatory Requirements - regulations are driving innovation in finance and health care. Often times this requires better understanding of the data and more effective means of storing and analyzing the data.






Moving to Real Time

While companies are moving to become more data driven, they are also moving to more “real-time” processing of data and decision making.

This affects not only the processes companies use to make decisions, but the way they process and store data.

This also affects relationships with suppliers and other vendors -- it often involves negotiating new interfaces (technical and personal) with other companies.

Storage systems and data processing systems have to be adapted to summarize or analyze data “on the fly.”

For historical or trend analysis, batch processing is still necessary.






Stages of Development

Data Consolidation and

Analytics

Computer-aideddecision making

Real-TimeComputer-aideddecision making

Semantic Web,Natural Language

Everything as a Service






Stage One: Consolidation and Analytics

Consolidate major sources of data into one meta-schema.Create views into the data that analyzes data across sources.Deploy a data governance system which allows tagging, discovery,

and management of the data.Create dashboards and reports for the data.Technical Example:

Deploy Hadoop into the Cloud (ideally) or on premises.Create Hive or Impala tables that map to the data.Create views that bind multiple tables into an easy form for

reports.Deploy a governance tool (Navigator, Hadoop Revealed, etc.).Create reports using Tableau or a similar tool.






Stage Two: Computer Aided Decision Making

Identify human workflows for data: what are you doing with those reports?

What decisions are made? By whom, when, why?Deploy workflow tools, machine learning algorithms, and similar

technologies to automate those decisions while sending the backing data (“Why did it do that?”) to the responsible parties.

Define “roles” in the decisionmaking process, codify decisions as a set of rules and values ( i.e. if backlog is > 4 then notify_sales(1 month wait time), notify_supplier(up_order 4* last_month) )

This involves a lot of business process re-engineering to not only define processes that are currently “behavioral” or “intuitive,” but the role in which individuals see themselves (manager of the decision or process vs “worker” of the process/decision)






Stage Three: Real Time Computer Aided Decision Making

With well-codified processes and computer aided decision making, systems still must be adapted to provide data “real time” and business processes need to be changed to adapt to reacting to data in “real time”

Suppliers must start providing data as feeds that are accurate to the moment.

The organization must deploy data processing systems that can handle the data in streams of events.

Public data and its effects on the business are understood and reactive systems are put into place (weather, financial, social media, news, regulatory, etc.)

If (DJIA < (DJIA_12 * .92)) then (change_sales_quota(sales_quota * .8), reduce_supplier_order(20 percent), reduce_mfg_qty(20 percent))

This is obtainable today in nearly any business sector with today’s technology.






Stage Four: Semantic Web, NLP, Everything as a Service

The semantics of the business are mapped to ontologies so that every bit of data and the vocabulary of the data are understood in relation to each other.

“Hello Computer, raise the sales quota by 10 percent, reduce the product price by 2 percent, divide the southeast into two territories and add the appropriate capacity to the system to handle the increased data.”

The business itself is broken up into services along with the software/data resources necessary to service it.

Most decisions are computer aided, but new strategies are described and implemented semantically.

New opportunities are identified by the system through machine learning and pattern recognition.

Obtaining this with today’s technology is still cost prohibitive except in limited fields.






Spark




Spark

Spark is based on an in-memory Directed Acyclic Graph (DAG)

Map-Reduce could be said to be edge 10,3,8 of a DAG, but what about 11 and 2 which can happen in parallel to 3?

Spark is still mostly a “batch” processing system

Spark streaming is microbatchingYour whole working dataset needs

to fit into memory

(wikipedia DAG diagram)



https://en.wikipedia.org/wiki/Directed_acyclic_graph


With that said Spark is FAR FAR faster than map-reduceSpark Integrates with YARN and Mesos, but requires neitherGenerally Spark loads from HDFS, S3 or an RDBMS via Spark SQLStorm tends to offer superior performance at scale for Complex

Event ProcessingSpark’s REPL (Read-Evaluate-Printloop) support makes it a superior

choice for your everyday data scientist (i.e. mathematician who writes bad python)

Spark supports Java, Java 8, Scala and PythonGenerally most things will be Spark soon.

Spark




Hadoop / Spark as a Service




Different workloadsStreaming - dedicated “ready” hardware to meet SLA, QoS

requirementsBusiness Intelligence - dedicated predictable storage

requirements, but often random utilization requirementsSpecialized - generally similar to streaming (and often are

streaming systems), specialized analytics (i.e. Fraud Detection) with generally predictable workloads and utilization requirements

Exploratory - before we create a specialized system or regular report, we need to see if it is worthwhile. Could take any shape in disk and CPU requirements, but is transient by nature.

Workloads




We don’t know who, how or when someone may need Hadoop, Spark, Storm, Kafka, etc etc.

Why have a pool of hardware and/or cloud resourcesWe need to make sure they are allowed to use themWe need to control how much, but offer enough to that point and

provision more if we run out.We don’t want every department to have their own Hadoop cluster.

Accommodating multiple workloads




Vision of Commoditized Computing

Infrastructure• A common set of boxes with CPUs

and NICsat most divergent in

classification (think amazon medium, XLarge, etc)

• A common set of boxes with storage that can be dedicated to purpose on demand

• Common switched/bonded network

• Thin management layer (where the workload originates / is managed from)

• Virtualized Docker-style

Workload Management• Work comes in or is scheduled,

delegated to currently purposed assets

• After a configured threshold is reached, more assets are purposed to that workload type.

• If work drops below a configured threshold, assets are de-purposed and returned to the pool.

• Storage is managed in the same way.




Vision of Commoditized Computing

Unused

Storage

Compute




The largest Hadoop as a Service clusters were hand built with devops tools (Chef, Ansible, etc)

Blue Data offers one solution where you can click to deploy based on quotas, etc.

The next generation of “as a Service” will be built with Mesos, Kubernetes and Docker

…No Amazon doesn’t actually give you this, they give you some resources and a bill not really resource management.

Yesterday, Today and Tomorrow




Questions?

Leader in Big Data/NoSQL ConsultingFocused on new data architecturesVendor independent Practical and pragmatic A risk mitigated approach to new technology



Become Data Driven With Hadoop as-a-Service

Technology

Transcript of Become Data Driven With Hadoop as-a-Service