Apply test-driven development to your agile project Become a great ...
Become Data Driven With Hadoop as-a-Service
-
Upload
mammoth-data -
Category
Technology
-
view
84 -
download
0
Transcript of Become Data Driven With Hadoop as-a-Service
Become Data-Driven with Hadoop-as-a-Service
www.mammothdata.com | @mammothdataco
The Leader in Big Data ConsultingBe more specific, sure!
● BI/Data Strategy○ Development of a business intelligence/ data architecture strategy.
● Installation○ Installation of Hadoop or relevant technology.
● Data Consolidation○ Load data from diverse sources into a single scalable repository.
● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards, feeds or computer-driven decision making processes to derive insights and make decisions.
● Visualization Tool ○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial
reports and provide training to necessary employees who will analyze the data.
Mammoth Data, based in downtown Durham, NC
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Agenda
What is a Data Driven Company?Challenges to Becoming a Data Driven
CompanyReasons Companies Overcome These
ChallengesMoving to Real TimeStages of DevelopmentStage One: Data Consolidation and
AnalyticsStage Two: Computer-Aided Decision
MakingStage Three: Real-Time Decision MakingStage Four: Semantic Web, Natural
Language, Everything as a Service
SparkHadoop/Spark as a ServiceWorkloadsAccommodating WorkloadsVision of Commoditized ComputingYesterday, Today and Tomorrow
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
What is a Data Driven Company?
To be data-driven means cultivating a mindset throughout the fabric of the business to continually use analytics to make fact-based business decisions.
It is not a “technology” or a product.A data driven company has a mature flow of data from and makes
decisions using cross sections of data such as:Sales to manufacturingIts supply chain Finance Industry and public sources
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Challenges to Becoming a Data Driven Company
BusinessTraditionRapidly changing undocumented business processesLack of management supportUndocumented meaning of business data
TechnicalDisparate data sourcesPoorly documented data sourcesNon-integrated proprietary systemsPoorly structured dataTraditional data warehouses were cost prohibitive and scaled
poorly (both up and down)
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Reasons Companies Overcome These Challenges
Cost effectiveness - by integrating supply chain, finance, manufacturing, and sales data we can make better, more timely decisions (i.e. supply chain integration).
Opportunity - by using public data, internal data, data about our customers we can find new opportunities (i.e. customer intimacy).
Competitiveness - competitors are making better use of data to target our customers and potential customers.
Regulatory Requirements - regulations are driving innovation in finance and health care. Often times this requires better understanding of the data and more effective means of storing and analyzing the data.
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Moving to Real Time
While companies are moving to become more data driven, they are also moving to more “real-time” processing of data and decision making.
This affects not only the processes companies use to make decisions, but the way they process and store data.
This also affects relationships with suppliers and other vendors -- it often involves negotiating new interfaces (technical and personal) with other companies.
Storage systems and data processing systems have to be adapted to summarize or analyze data “on the fly.”
For historical or trend analysis, batch processing is still necessary.
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Stages of Development
Data Consolidation and
Analytics
Computer-aideddecision making
Real-TimeComputer-aideddecision making
Semantic Web,Natural Language
Everything as a Service
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Stage One: Consolidation and Analytics
Consolidate major sources of data into one meta-schema.Create views into the data that analyzes data across sources.Deploy a data governance system which allows tagging, discovery,
and management of the data.Create dashboards and reports for the data.Technical Example:
Deploy Hadoop into the Cloud (ideally) or on premises.Create Hive or Impala tables that map to the data.Create views that bind multiple tables into an easy form for
reports.Deploy a governance tool (Navigator, Hadoop Revealed, etc.).Create reports using Tableau or a similar tool.
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Stage Two: Computer Aided Decision Making
Identify human workflows for data: what are you doing with those reports?
What decisions are made? By whom, when, why?Deploy workflow tools, machine learning algorithms, and similar
technologies to automate those decisions while sending the backing data (“Why did it do that?”) to the responsible parties.
Define “roles” in the decisionmaking process, codify decisions as a set of rules and values ( i.e. if backlog is > 4 then notify_sales(1 month wait time), notify_supplier(up_order 4* last_month) )
This involves a lot of business process re-engineering to not only define processes that are currently “behavioral” or “intuitive,” but the role in which individuals see themselves (manager of the decision or process vs “worker” of the process/decision)
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Stage Three: Real Time Computer Aided Decision Making
With well-codified processes and computer aided decision making, systems still must be adapted to provide data “real time” and business processes need to be changed to adapt to reacting to data in “real time”
Suppliers must start providing data as feeds that are accurate to the moment.
The organization must deploy data processing systems that can handle the data in streams of events.
Public data and its effects on the business are understood and reactive systems are put into place (weather, financial, social media, news, regulatory, etc.)
If (DJIA < (DJIA_12 * .92)) then (change_sales_quota(sales_quota * .8), reduce_supplier_order(20 percent), reduce_mfg_qty(20 percent))
This is obtainable today in nearly any business sector with today’s technology.
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Stage Four: Semantic Web, NLP, Everything as a Service
The semantics of the business are mapped to ontologies so that every bit of data and the vocabulary of the data are understood in relation to each other.
“Hello Computer, raise the sales quota by 10 percent, reduce the product price by 2 percent, divide the southeast into two territories and add the appropriate capacity to the system to handle the increased data.”
The business itself is broken up into services along with the software/data resources necessary to service it.
Most decisions are computer aided, but new strategies are described and implemented semantically.
New opportunities are identified by the system through machine learning and pattern recognition.
Obtaining this with today’s technology is still cost prohibitive except in limited fields.
www.mammothdata.com | @mammothdataco
Spark
www.mammothdata.com | @mammothdataco
Spark
Spark is based on an in-memory Directed Acyclic Graph (DAG)
Map-Reduce could be said to be edge 10,3,8 of a DAG, but what about 11 and 2 which can happen in parallel to 3?
Spark is still mostly a “batch” processing system
Spark streaming is microbatchingYour whole working dataset needs
to fit into memory
(wikipedia DAG diagram)
www.mammothdata.com | @mammothdataco
With that said Spark is FAR FAR faster than map-reduceSpark Integrates with YARN and Mesos, but requires neitherGenerally Spark loads from HDFS, S3 or an RDBMS via Spark SQLStorm tends to offer superior performance at scale for Complex
Event ProcessingSpark’s REPL (Read-Evaluate-Printloop) support makes it a superior
choice for your everyday data scientist (i.e. mathematician who writes bad python)
Spark supports Java, Java 8, Scala and PythonGenerally most things will be Spark soon.
Spark
www.mammothdata.com | @mammothdataco
Hadoop / Spark as a Service
www.mammothdata.com | @mammothdataco
Different workloadsStreaming - dedicated “ready” hardware to meet SLA, QoS
requirementsBusiness Intelligence - dedicated predictable storage
requirements, but often random utilization requirementsSpecialized - generally similar to streaming (and often are
streaming systems), specialized analytics (i.e. Fraud Detection) with generally predictable workloads and utilization requirements
Exploratory - before we create a specialized system or regular report, we need to see if it is worthwhile. Could take any shape in disk and CPU requirements, but is transient by nature.
Workloads
www.mammothdata.com | @mammothdataco
We don’t know who, how or when someone may need Hadoop, Spark, Storm, Kafka, etc etc.
Why have a pool of hardware and/or cloud resourcesWe need to make sure they are allowed to use themWe need to control how much, but offer enough to that point and
provision more if we run out.We don’t want every department to have their own Hadoop cluster.
Accommodating multiple workloads
www.mammothdata.com | @mammothdataco
Vision of Commoditized Computing
Infrastructure• A common set of boxes with CPUs
and NICsat most divergent in
classification (think amazon medium, XLarge, etc)
• A common set of boxes with storage that can be dedicated to purpose on demand
• Common switched/bonded network
• Thin management layer (where the workload originates / is managed from)
• Virtualized Docker-style
Workload Management• Work comes in or is scheduled,
delegated to currently purposed assets
• After a configured threshold is reached, more assets are purposed to that workload type.
• If work drops below a configured threshold, assets are de-purposed and returned to the pool.
• Storage is managed in the same way.
www.mammothdata.com | @mammothdataco
Vision of Commoditized Computing
Unused
Storage
Compute
www.mammothdata.com | @mammothdataco
Vision of Commoditized Computing
Unused
Storage
Compute
www.mammothdata.com | @mammothdataco
The largest Hadoop as a Service clusters were hand built with devops tools (Chef, Ansible, etc)
Blue Data offers one solution where you can click to deploy based on quotas, etc.
The next generation of “as a Service” will be built with Mesos, Kubernetes and Docker
…No Amazon doesn’t actually give you this, they give you some resources and a bill not really resource management.
Yesterday, Today and Tomorrow
www.mammothdata.com | @mammothdataco
Questions?
Leader in Big Data/NoSQL ConsultingFocused on new data architecturesVendor independent Practical and pragmatic A risk mitigated approach to new technology