Pivotal Data Lake Architecture & its role in security analytics

44
Principal Data Scientist, Pivotal Derek Lin

Transcript of Pivotal Data Lake Architecture & its role in security analytics

  1. 1. 1 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Principal Data Scientist, Pivotal Derek Lin
  2. 2. 2 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Agenda Information Security Analytics & Use Cases Data Lake Data Science: Extracting Values from Data Lake Demo
  3. 3. 3 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Information Security Analytics Landscape Application Areas
  4. 4. 4 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. CIO Survey: Top Concerns Sources: Barclays September 2013 CIO Survey, KPMG January 2014 CIO/CFO Survey 54% What to Collect 85% How to Analyze
  5. 5. 5 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. A Big Data Analytics Response More sophisticated adversaries and sophisticated methods. Limited human capacity combined with massive amounts of events 40% of all survey respondents are overwhelmed with the security data they already collect 35% have insufficient time or expertise to analyze what they collect Security tools, tactics and defenses becoming outdated: Content is static and not as dynamic as the threat landscape Segregated by too many point products, tool interfaces, disparate data sets 1 EMA, The Rise of Data-Driven Security, Crawford, Aug 2012 Survey Sample Size = 200
  6. 6. 6 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Enterprise Information Security Analytics Insider Threat Asset Risk Malware Threat
  7. 7. 7 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. APT Kill Chain Advanced Persistent Threat (APT) A handful of users are targeted by two phishing attacks: one user opens Zero day payload (CVE-02011-0609) The user machine is accessed remotely by Poison Ivy tool Attacker elevates access to important user, service and admin accounts, and specific systems Data is acquired from target servers and staged for exfiltration Data is exfiltrated via encrypted files over ftp to external, compromised machine at a hosting provider Phishing and Zero Day Attack Back Door Lateral Movement Data Gathering Exfiltrate 1 2 3 4 5
  8. 8. 8 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Anatomy of an Attack | Anatomy of a Response
  9. 9. 9 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Technology Landscape in the Kill Chain Perimeter penetration Malware beacon Lateral movement Staging & exfiltration Single source real-time Proactive, limited-sources rule-based methods over short-range Proactive, multi-sources data-driven methods over long-range Reactive, manual post-incident response Host/Network Analysis & Search/Indexing Data-lake enabled analytics IDS/IPS Anti-virus SIEM DLP
  10. 10. 10 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Analytics Opportunities in Threat Defense Malware discovery Host infection detection Malware Command & Control beaconing activity detection Perimeter penetration prevention/detection Anomalous VPN login detection Denial of service attack mitigation Local IP black list construction Watering hole attack detection Chat room monitoring Phishing attack detection Web server attack detection In-network anomaly detection Anomalous resource access detection Critical server activity monitoring IR efficiency improvement Semi-automated analysis SIEM efficiency improvement Threat feed normalization Alert prioritization Malware Threat
  11. 11. 11 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Information Security Analytics Areas Insider Threat Asset Risk Malware Threat
  12. 12. 12 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Malicious Insider Threat http://www.logrhythm.com/Portals/0/resources/LogRhythm_Survey_15.42014.pdf What do you think is the biggest threat to your organizations confidential data? Does your organization have any systems in place to stop employees accessing confidential information or taking data?
  13. 13. 13 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Analytics Opportunities in Identity, Access, and Management Anomalous user to resource access detection User-resource access data Role and activity auditing + Role and provisioning data Privilege escalation auditing + Privilege escalation data IT support personnel auditing - Support ticket data - Command activity data Insider Threat
  14. 14. 14 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Information Security Analytics Areas Insider Threat Asset Risk Malware Threat
  15. 15. 15 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Information Asset Management
  16. 16. 16 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Analytics Opportunities in Asset Management Document categorization for risk labeling Unstructured data User access data Asset risk profiling Vulnerability scanner data User access data Asset Risk
  17. 17. 17 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Data Lake Needs and Trend Analytics Support
  18. 18. 18 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Life Before Data Lake Departmental Warehouse Enterprise Apps Reporting Non-Agile Models Spread marts Prioritized Operational Processes Errant data and marts Departmental Warehouse Siloed Analytics Data Sources Non-Prioritized Data Provisioning Static schemas grow over time
  19. 19. 19 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Impact of Status Quo High-value data is hard to reach and leverage Data Scientists are last in line for data Queued after prioritized operational processes Data is moving in batches from Warehouse(s) to Data Scientists desktops In-memory analytical work (w/ R, SAS, SPSS, Excel) Sampled, driving model accuracies down There is a cottage industry of analytics, rather than centrally-managed harnessing of analytics Non-standardized initiatives Frequently, not-aligned with corporate business goals Slow time-to- insight & reduced business impact
  20. 20. 20 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Data-Driven Digital Media Analytics Targeting & Retention Social Media Analysis Campaign optimization 0 Transaction History Purchases Clickstream Customer Data Unified data supporting re-usable predictive models GB TB PB Data Size
  21. 21. 21 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Data-Driven Financial Protection Analytics Unified data supporting re-usable predictive models TB Data Size Web Optimization Fraud Detection Product Recommendation ATM Member Data Transactional Log Firewall Clickstream Phone Channel GB
  22. 22. 22 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Data-Driven IT Operation Analytics Unified data supporting re-usable predictive models GB TB/ PB Data Size Failure Prediction Root Cause Analysis Project Risk Forecasting Server or VM Performance Metrics CMDB Configuration Setting Alerts & Incident Server logs Network Performance Metrics
  23. 23. 23 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Data-Driven Security Analytics Unified data supporting re-usable predictive models TB/ PB Data Size Insider Threat Detection Malware Detection DDoS Mitigation AD/Auth Asset/Role Netflow DNS/Firewall/Proxy Critical Server Packet Capture GB Defense with breadth in variety and depth in time
  24. 24. 24 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Pivotal Business Data Lake Architecture Centralized Management System monitoring System management Unified Data Management Tier Data mgmt. services MDM RDM Audit and policy mgmt. Processing Tier Workflow Management In-memory MPP database Existing Sources Unified Sources Flexible Actions Real-time ingestion Micro batch ingestion Batch ingestion Real-time insights Interactive insights Batch insights HDFS New Data Sources
  25. 25. 25 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Pivotal Business Data Lake Architecture Centralized Management Unified Data Management Tier Data Dispatch MDM RDM Data Dispatch Processing Tier Spring XD Pivotal GemFire XD HAWQ Unified Sources Flexible Actions Clickstream Sensor Data Weblogs Network Data CRM Data ERP Data Pivotal GemFire Pivotal RabbitMQ Redis Pivotal CFPivotal HD Command Center Existing SourcesNew Data Sources
  26. 26. 26 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Data Lake: More Than a Data Repository To iterate and experiment to fail fast for fast cycle of value generation Analytics Support Fast Query Data Store
  27. 27. 27 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Data Science Tools Commercial Open Source (or Free) PL/R, PL/Python PL/Java
  28. 28. 28 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. MADlib In-Database Functions Predictive Modeling Library Linear Systems Sparse and Dense Solvers Matrix Factorization Single Value Decomposition (SVD) Low-Rank Generalized Linear Models Linear Regression Logistic Regression Multinomial Logistic Regression Cox Proportional Hazards Regression Elastic Net Regularization Sandwich Estimators (Huber white, clustered, marginal effects) Machine Learning Algorithms Principal Component Analysis (PCA) Association Rules (Affinity Analysis, Market Basket) Topic Modeling (Parallel LDA) Decision Trees Ensemble Learners (Random Forests) Support Vector Machines Conditional Random Field (CRF) Clustering (K-means) Cross Validation Descriptive Statistics Sketch-based Estimators CountMin (Cormode- Muthukrishnan) FM (Flajolet-Martin) MFV (Most Frequent Values) Correlation Summary Support Modules Array Operations Sparse Vectors Random Sampling Probability Functions
  29. 29. 29 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Data Science: Extracting Value from Data Lake Technology & Tools People
  30. 30. 30 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Evolution of Data Analytics in Security BI and Compliance- driven Investigation- driven Behavior- metrics driven Data-science driven Data goes in, hard to extract value Fast queries over large data Single source metrics, simple correlation, rule- based, high false positive Leverage full contextual info, multi-source, automatic, for low false positives
  31. 31. 31 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. What is Data Science? The use of statistical and machine learning techniques on big multi-structured data in a distributed computing environment to identify correlations and causal relationships classify and predict events identify patterns and anomalies and infer probabilities, interest and sentiment.
  32. 32. 32 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Data Science: The Next Security Frontier Beyond signatures Beyond simple metrics for thresholding Beyond manual engineering of rules Monitor each and every entity in its environmental context with 360 view over long time window with advanced mathematics
  33. 33. 33 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Pivotal Network Intelligence Demo
  34. 34. 34 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved.
  35. 35. 35 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved.
  36. 36. 36 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved.
  37. 37. 37 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. What is a Data Scientist? ProgrammingSkills Mathematical/Statistical Skills
  38. 38. 38 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Mathematical/Statistical Skills One Team Member ProgrammingSkills
  39. 39. 39 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Mathematical/Statistical Skills Another Team Member ProgrammingSkills
  40. 40. 40 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Yet Another ProgrammingSkills Mathematical/Statistical Skills
  41. 41. 41 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Together ProgrammingSkills Mathematical/Statistical Skills
  42. 42. 42 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved. Take-Home Messages Information Security = Big Data Problem Data Lake is more than a data store Data Science drives value from Information Security Data Lake
  43. 43. 43 Copyright 2014 EMC Corporation. All rights reserved. Copyright 2014 EMC Corporation. All rights reserved.