Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios
Managing Big Data using New Innovations with …...• Plugin support (Ganglia, Nagios, Kafka,...
Transcript of Managing Big Data using New Innovations with …...• Plugin support (Ganglia, Nagios, Kafka,...
Managing Big Data using New Innovations with HPCC SystemsBob Foreman – Senior Software Engineer/ECL Instructor
Twitter: #ATO2017 #HPCCSystems
Welcome!
• HPCC Systems has been open source since June 2011• Although the base technology has remained consistent, the last 6 years has
seen many new support technologies unfold.
• These technologies have enhanced and extended the base technology, and HPCC Systems remains ahead of the curve with these new innovations.
• We will look at many of them in this presentation.
2 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
Today’s Agenda:
• Quick intro on the platform and history before 2011• Open Source in 2011• Machine Learning February 2012 - 2017, many changes• Continuous updates and improvements in speed and compiler power.• Changes in the ECL Watch (Version 5 and 6)• ECL Playground• New services, like WSSQL• Plugin support (Ganglia, Nagios, Kafka, Security Manager, etc.)• EMBED support - new feature for EMBED• KEL lite• Looking ahead to Version 7
3 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
History of HPCC Systems(High Performance Computing Cluster)
4 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
Open sourcing a long established big data strategy
Why Does HPCC Systems Exist?
• It was NOT developed with the idea of selling the technology to anybody else!
• It was all created only to solve some of the data-handling problems that we encountered as we were developing our products.
5 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
The Result Of All That Development?
HPCC SystemsA single, fully-integrated platform
supporting the entire life cycle of Big Data product development:
• Raw Data Ingest – Thor• Data Transformation to Product – Thor• End-user Query Development – Thor• End-user Query Delivery – ROXIE
6 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
The Complete Big Data Value Chain
• Collection – collecting structured, unstructured and semi-structured data
• Ingestion – consuming vast amounts of data including extraction, transforming and loading
• Discovery & Cleansing - clean up, formatting and statistical analysis of the data
• Integration – linking, indexing and data fusion
• Analysis – statistics and machine learning
• Delivery – querying, visualization, and redundancy, enterprise-class availability
7 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
Collection Ingestion Discovery & Cleansing Integration Analysis Delivery
HPCC Systems Platform
There are two types of clusters in HPCC Systems:
• Data Refinery (THOR) – Processes every one of billions of records in order to create billions of "improved" records – runs one job at a time.
• Rapid Data Delivery Engine (ROXIE) – Searches quickly for a particular record or set of records – handles thousands of concurrent transactions per second.
• Both are tightly coupled to the infrastructure that supports their operation, and the ECL programming language that defines the work done on them.
8 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
Data Flow Oriented Big Data Platform
#ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems9
ESPMiddleware
Services
Raw data from several sources
Batc
h Su
bscr
iber
sPo
rtal
Thor• Shared Nothing MPP Architecture• Commodity Hardware• Batch ETL and Analytics
ECLBatch requests for
scoring and analytics • Easy to use • Implicitly Parallel • Compiles to C++
ROXIE• Shared Nothing MPP Architecture• Commodity Hardware• Real-time Indexed Based Query• Low Latency, Highly Concurrent
and Highly Redundant
Batch ProcessedData
Batc
h Su
bscr
iber
s
Thor
Thor – The Batch Processing Analytics Engine
#ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems10
Raw data from
several sources
Repo
rtin
g
ECLBatch reporting requests
ROXIE
Batch reporting requests
Massively Parallel Extract Transform and Load (ETL) engine• Built from the ground up as a parallel data
environment. • Leverages inexpensive locally attached storage. • Doesn’t require a SAN infrastructure.
Enables data integration on a scale not previously available• Current LexisNexis person data build process
generates 350 billion intermediate results at peak.
Suitable for:• Massive joins/merges• Massive sorts and transformations• Any N2 problem
Batc
h Su
bscr
iber
s
Thor
ROXIE – The Real-Time Analytics Delivery Engine
#ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems11
Raw data from
several sources
Repo
rtin
g
ECLBatch reporting requests
ROXIE
Batch reporting requests
A massively parallel, high throughput, structured query response engine.
Ultra fast due to its read-only nature.
Allows indices to be built onto data for efficient multi-user retrieval of data.
Suitable for:• Volumes of structured queries• Full text ranked Boolean search
HPCC Systems Hardware
• Clusters of commercial off-the-shelf components (COTS). Components are ideally homogeneous (all processing/disk storage components same) and the system is tightly coupled.
• Nodes are managed en masse instead of individually, which allows coordinated processing like global sorts (unlike Grid systems).
12 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
Thor Cluster• Brute force: Thor operates on massive amounts of data where datasets typically
contain billions of records• Open Data Model: The data model is defined by the user, not constrained by the
limitations of a strict key-value paradigm• Scalable: Horizontally linear scalability provides room to accommodate future data
and performance growth• Truly parallel: Datagraph Nodes can be processed in parallel as data seamlessly
flows through them, effectively avoiding the well-known “long tail problem”, resulting in higher and predictable performance.
• Powerful optimizer: The HPCC Systems optimizer ensures submitted ECL code executes at the maximum possible speed for the underlying hardware. Advanced techniques such as lazy execution and code reordering are thoroughly utilized to maximize performance
13 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
ROXIE Cluster
• Low latency: Data queries typically complete sub-second
• Not a key-value store: ROXIE is not limited by the constraints of key-value data stores, allowing for complex queries, multi-key retrieval, fuzzy matching and more
• Highly available: ROXIE operates in critical environments under the most rigorous service level requirements
• Scalable: Horizontally linear scalability provides room to accommodate future data and performance growth
• Highly concurrent: In a typical environment, thousands of concurrent clients can be simultaneously executing transactions on the same ROXIE system
• Redundant: A shared-nothing architecture with no single point of failure provides extreme fault tolerance
14 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
HPCC Systems Platform
• Batteries included: All components create a consistent and homogeneous platform • Over 15 years of experience: The HPCC Systems platform is the technology underpinning
LexisNexis data offerings – its development began in 1999• Few moving parts: HPCC Systems is an integrated solution extending across the entire
data lifecycle, from data ingest and transformation to data delivery – no third party tools needed
• Multiple data formats: Supported out of the box, including fixed and variable length, delimited records, and XML
• ECL inside: One language to describe both: the data transformations in Thor and data delivery strategies in ROXIE. Solutions to complex data problems are expressed easily and directly in terms of high level ECL primitives.
• Consistent tools: Thor and ROXIE share the same set of tools, which provides consistency across the platform.
15 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
Data on HPCC Systems
• Open Data Model: The data model is defined by the user, as standard files, records, and fields (tables, rows, and columns)
• Simple: Solutions to complex data problems can be expressed easily and directly in terms of high level ECL primitives
• Implicitly parallel: Data is always in distributed datasets whose parts are managed by the DFU, eliminating the need for programmers to manage the complexity of working with distributed datasets
16 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
Data on HPCC Systems
• Data is stored in ISAM Files• Native support for:
• Flat files, with fixed or variable-length records• CSV-type files (any delimiters may be used)• XML datasets• New JSON format support
• Each Record is always whole and complete on a single node• A Record may have as many fields as needed• Indexes are always LZW compressed and may contain “payload” fields in
addition to search terms
17 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
What is ECL? (Enterprise Control Language)
• Declarative programming language:“Describes what needs to be done, not how to do it”
• Powerful: Unlike Java, high level primitives such as JOIN, TRANSFORM, PROJECT, SORT, DISTRIBUTE, MAP, etc. are available. Higher level code means fewer programmers and shorter time to deliver complete projects
• Extensible: As new definitions are created, they become primitives that other programmers can use
• Implicitly parallel: Parallelism is built into the underlying platform. The programmer need not be concerned with it
• Maintainable: A High level programming language, no side effects and definition encapsulation provide for more succinct, reliable and easier to troubleshoot code
• Complete: Unlike Pig and Hive, ECL provides for a complete programming paradigm.
• Homogeneous: One language to express data algorithms across the entire HPCC Systems platform, including data ETL and delivery.
18 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
Batc
h Su
bscr
iber
s
Thor
ECL – The Data Flow Oriented Programming Language
#ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems19
Raw data from
several sources
Repo
rtin
g
ECLBatch reporting requests
ROXIE
Batch reporting requests
• An easy to use, data-centric programming language optimized for large-scale data management and query processing
• Highly efficient — automatically distributes workload across all nodes
• 80% more efficient than C++, Java and SQL —1/3 reduction in programmer time to maintain/enhance existing applications
• Benchmark against SQL (5 times more efficient)for code generation
• Automatic parallelization and synchronization of sequential algorithms for parallel and distributed processing
• Large library of built-in modules to handle common data manipulation tasks
Declarative programming language … powerful, extensible, implicitly parallel, maintainable, complete and homogeneous
Machine Learning
20
Machine Learning and HPCC Systems
• The HPCC Machine Learning Library contains an extensible collection of machine learning routines which are easy and efficient to use and are designed to execute in parallel across a cluster.
• In 2012 the first set of modules were released:o Associations (ML.Associate)o Classify (ML.Classify)o Cluster (ML.Cluster)o Correlations (ML.Correlate)o Discretize (ML.Discretize)o Distribution (ML.Distribution)o Field Aggregates (ML.FieldAggregates)o Regression (ML.Regression)o Visualization (ML.VL)
• https://hpccsystems.com/download/free-modules/ecl-ml
21 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
Machine Learning and HPCC Systems
• In 2017, there are several new ML algorithms,some already implemented and others under development.
• These algorithms now use the ECL bundle technology.o PBBlas (parallel block basic linear algebra subprograms)o Time Series (TS)o Neural Networkso Deep Learningo Ensembleo NFold Cross Validationo Population Estimateo LDA (Linear Discriminant Analysis) o LSA (Latent Semantic Analysis)o StepwiseLogistico SVM (Support Vector Machine)
• https://github.com/hpcc-systems/ecl-ml
22 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
Internal Updates and Improvements
23
HPCC Version 6.x Internal Improvements • Virtual Slave THOR (better sharing of resources – e.g. RAM)• Parallel Activity Execution (taking advantage of multiple CPU cores)• Affinity support in Thor (Binding a slave process to a single CPU socket)• Optimized merge sort for large number of cores• LZ4 compression for temporary files• Refresh Boolean option on persist• Parallel child query execution in Thor• Memory management improvements
24 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
• Lookup JOINS in child queries• Compiler optimization• Improved INDEX reads on THOR and ROXIE• Enhanced Performance Test Suite
References:https://hpccsystems.com/resources/blog/lchapman/hpcc-systems-60x-feature-highlights-part-1https://hpccsystems.com/resources/blog/lchapman/hpcc-systems-60x-feature-highlights-part-2https://hpccsystems.com/resources/blog/lchapman/hpcc-systems-62x-here-whats-it-youhttps://hpccsystems.com/blog/performance_improvements_640
ECL Watch (Version 5 and 6)
25
ECL Watch
• Long awaited face lift in Version 5.0• Upgrade was completed in Version 6.0 with even more features• Ability to spray multiple files of same type with one click.• New File Uploader• Hex Previewer• Enhanced filtering throughout• Improved Query Viewer, including Package Maps• New Plug-in interface• Improved Workunit Graphs • Built in Visualization
References:
https://www.youtube.com/watch?v=fupH_to2i84#action=share
https://www.youtube.com/watch?v=wm4xtNsR4bA#action=share
26 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
ECL Playground
27
ECL Playground
• New ESP web service
References:
http://cdn.hpccsystems.com/releases/CE-Candidate-6.4.2/docs/ECL_Playground-6.4.2-1.pdf
http://cdn.hpccsystems.com/podcasts/2012_0904_v1_ECL_Playground.mp3
28 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
WSSQL
29
WSSQL• add-on service that provides an SQL interface into HPCC Systems
• Submit SQL queries directly to HPCC via SOAP
• Access HPCC data files and Published queries
• Analyze HPCC data using familiar SQL syntax
• Supports SQL SELECT or CALL syntax • Access HPCC data files as DB Tables• Access published queries as DB Stored Procedures
• Supports SQL Create and Load Syntax
• Harnesses the full power of HPCC under the covers • Submitted SQL request generates ECL code which is submitted, compiled, and executed on your target cluster• Automatic Index fetching capabilities for quicker data fetches
• Creates entry-point for programmatic data access
• Leverage HPCC data without need to learn and write ECL! • Opens the door for non ECL programmers to access HPCC data.
References:
http://cdn.hpccsystems.com/releases/CE-Candidate-6.4.0/docs/WsSQL_ESP_Web_Service_Users_Guide-6.4.0-1.pdf
https://hpccsystems.com/download/free-modules/WSSQL
30 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
31 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
32 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
Plugins!
33
Library and Datastore PlugIns
• New Plugin interface with ECL Watch
• Built ins (Debug, File Services)
• Audit and Logging
• dMetaphone (double metaphone)
• Apache Kafka
• Security Manager
• Redis
• Memcached
34 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
References:https://github.com/hpcc-systems/HPCC-Platform/tree/master/plugins
Embedded Language PlugIns• C++
• R Integration
• Couchbase
References:https://hpccsystems.com/resources/blog/lchapman/using-your-favorite-language-or-data-source-hpcc-systemshttps://hpccsystems.com/resources/blog/richardkchapman/projecting-fields-embedsUse and abuse of the EMBED feature: https://hpccsystems.com/bb/viewtopic.php?f=41&t=1509
• Java
• JavaScript
• MySQL
• Python/Python 3
• SQLite3
• Cassandra
• AWS SQS (Simple Queue Service)
35
Embedded Language PlugIns
36 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
New Horizons - Working with TensorFlow
37
Working with TensorFlow
• Wonderful blog written by Richard Chapman
• TensorFlow is a new open-source program from Google
• Performs linear algebra operations on tensors (matrices) and connects multiple operations together.
• Particularly suited for machine learning applications and large datasets
• Works with HPCC 6.2 and greater versions
• Implemented in ECL using Python EMBED
• Shows how a TensorFlow model could be used inside an ECL workflow!
• This test resulted in enhanced Python plug-in capabilities.
38 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
References:https://hpccsystems.com/resources/blog/richardkchapman/embedding-tensorflow-operations-eclhttps://www.tensorflow.org/
KEL (Knowledge Engineering Language)
39
Knowledge Engineering Language• Designed for Data Modeling
• KEL expands the ECL specification of data flows and algorithms.
• Presumes that the user wants control over:
• the logical data model
• the analytic logic
• the mathematics
• ENTITY, MODEL, and ASSOCIATION
• Data Mapping (USE)
• Logic (GLOBAL)
• OUTPUT or QUERY
References:https://hpccsystems.com/download/free-modules/kel-lite
#ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems40
Sample KEL:
#ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems41
Coming Soon…HPCC Version 7
42
Coming soon in 2018…• Ease of use – ECL Watch Dashboards, session management, query access• Reliability and Stability• Machine Learning• Security – ROXIE SSL, Encryption in Transit, Restricted file access, dropzone whitelists• Interoperability (Spark, UnicodeLib, R, Tensorflow)• Dali Replacement for DFS• Opportunistic Improvements• Text Search, XML Improvements• Multi-core support• Extended Built-In Visualization • Cloud/Hive 360 Support
And contributions and suggestions from YOU !!!
References:https://track.hpccsystems.com/secure/Dashboard.jspahttps://hpccsystems.com/community/how-to-contribute
#ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems43
Getting Started
Install:
1. Oracle’s VirtualBox:https://www.virtualbox.org/wiki/Downloads
2. ECL IDE and Documentation:https://hpccsystems.com/download/developer-tools/ecl-idehttps://hpccsystems.com/download/documentation
44 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
Getting Started
Run:1. Launch your VM player.
2. Import the HPCC Virtual Machine .ova file:http://hpccsystems.com/download/hpcc-vm-image
3. Note the IP next to the IP Address: prompt at the top of the VM.
This IP address is the key to allowing the HPCC Systems client tools to access the environment.
45 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
That’s All Folks!
And there’s so much more to learn!!!Thanks for Attending!
46 #ATO2017 #HPCCSystems Managing Big Data using New Innovations with HPCC Systems
Email me! [email protected] https://hpccsystems.com/community/events/All-Things-Open-2017