9781119250555 Big Data Management for Dummies Informatica Ed

download 9781119250555 Big Data Management for Dummies Informatica Ed

of 53

Transcript of 9781119250555 Big Data Management for Dummies Informatica Ed

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    1/53

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    2/53

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.

    http://www.informatica.com/

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    3/53

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    Big DataManagement

    by Mike Wessler

    Informatica Special Edition

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    4/53

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    Big Data Management For Dummies ®, Informatica Special EditionPublished byJohn Wiley & Sons, Inc.111 River St.Hoboken, NJ 07030-5774www.wiley.com

    Copyright © 2016 by John Wiley & Sons, Inc.

    No part of this publication may be reproduced, stored in a retrieval system or transmitted in any formor by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except aspermitted under Sections 107 or 108 of the 1976 United States Copyright Act, without the prior writtenpermission of the Publisher. Requests to the Publisher for permission should be addressed to thePermissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011,fax (201) 748-6008, or online at http://www.wiley.com/go/permissions .

    Trademarks: Wiley, For Dummies, the Dummies Man logo, The Dummies Way, Dummies.com,Making Everything Easier, and related trade dress are trademarks or registered trademarks of JohnWiley & Sons, Inc. and/or its affiliates in the United States and other countries, and may not be usedwithout written permission. Informatica and the Informatica logo are registered trademarks ofInformatica. All other trademarks are the property of their respective owners. John Wiley & Sons,Inc., is not associated with any product or vendor mentioned in this book.

    LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: THE PUBLISHER AND THE AUTHOR MAKENO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE ACCURACY ORCOMPLETENESS OF THE CONTENTS OF THIS WORK AND SPECIFICALLY DISCLAIM ALLWARRANTIES, INCLUDING WITHOUT LIMITATION WARRANTIES OF FITNESS FOR APARTICULAR PURPOSE. NO WARRANTY MAY BE CREATED OR EXTENDED BY SALES ORPROMOTIONAL MATERIALS. THE ADVICE AND STRATEGIES CONTAINED HEREIN MAY NOT BESUITABLE FOR EVERY SITUATION. THIS WORK IS SOLD WITH THE UNDERSTANDING THAT THEPUBLISHER IS NOT ENGAGED IN RENDERING LEGAL, ACCOUNTING, OR OTHER PROFESSIONALSERVICES. IF PROFESSIONAL ASSISTANCE IS REQUIRED, THE SERVICES OF A COMPETENTPROFESSIONAL PERSON SHOULD BE SOUGHT. NEITHER THE PUBLISHER NOR THE AUTHORSHALL BE LIABLE FOR DAMAGES ARISING HEREFROM. THE FACT THAT AN ORGANIZATIONOR WEBSITE IS REFERRED TO IN THIS WORK AS A CITATION AND/OR A POTENTIAL SOURCEOF FURTHER INFORMATION DOES NOT MEAN THAT THE AUTHOR OR THE PUBLISHERENDORSES THE INFORMATION THE ORGANIZATION OR WEBSITE MAY PROVIDE ORRECOMMENDATIONS IT MAY MAKE. FURTHER, READERS SHOULD BE AWARE THAT INTERNETWEBSITES LISTED IN THIS WORK MAY HAVE CHANGED OR DISAPPEARED BETWEEN WHENTHIS WORK WAS WRITTEN AND WHEN IT IS READ.

    For general information on our other products and services, or how to create a custom For Dummies book for your business or organization, please contact our BusinessDevelopment Department in the U.S. at 877-409-4177, contact [email protected] , orvisit www.wiley.com/go/custompub . For information about licensing the For Dummies brand for products or services, contact BrandedRights&[email protected] .

    ISBN: 978-1-119-25053-1 (pbk); ISBN: 978-1-119-25055-5 (ebk)

    Manufactured in the United States of America

    10 9 8 7 6 5 4 3 2 1

    Publisher’s AcknowledgmentsSome of the people who helped bring this book to market include the following:

    Project Editor: Carrie A. JohnsonAcquisitions Editor: Amy FandreiEditorial Manager: Rev Mengle

    Business Development Representative: Karen Hattan

    Special Help from Informatica: JohnHaddad, Murthy Mathiprakasam

    Production Editor: Tamilmani Varadharaj

    http://www.wiley.com/http://www.wiley.com/go/permissionsmailto:[email protected]://www.wiley.com/go/custompubmailto:BrandedRights&[email protected]:BrandedRights&[email protected]://www.wiley.com/go/custompubmailto:[email protected]://www.wiley.com/go/permissionshttp://www.wiley.com/

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    5/53

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    Table of Contents

    Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1About This Book ........................................................................ 1Icons Used in This Book ............................................................ 2Beyond the Book ........................................................................ 2

    Chapter 1: Identifying Big Data . . . . . . . . . . . . . . . . . . . . . .3Evolution of Data over the Years ............................................. 3Introducing Big Data .................................................................. 5The Vs of Big Data ...................................................................... 5Identifying Different Sources of Data ....................................... 6How Big Data Is Used in Business ............................................ 8

    Chapter 2: Understanding the Challengesof Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

    Identifying the Traditional Challenges of Big Data .............. 11Emerging Next-Generation Challenges .................................. 12State of Big Data Projects ........................................................ 13Understanding Why Businesses Are Struggling

    with Big Data ......................................................................... 15Introducing Big Data Management ........................................ 16

    Understanding the layers of big data .......................... 17Defining big data management capabilities ............... 17Overcoming obstacles with big data

    management ............................................................... 18

    Chapter 3: Building Blocks of EffectiveData Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19

    Understanding a Big Data Laboratory versus Factory ........ 19Identifying the Three Pillars of Data Management .............. 22

    Integration ...................................................................... 22Governance..................................................................... 23

    Security ........................................................................... 23Diving Deep into Big Data Management Processes ............. 25Empowering the Big Data Team ............................................. 27

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    6/53

    Big Data Management For Dummies, Informatica Special Editioniv

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    Chapter 4: Using Big Data Managementin the Wild . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29

    Implementing Big Data Management in Business ................ 30Identifying Big Data Tools ....................................................... 32Leveraging the Right Tools ..................................................... 34Considering Commercial Tools Built atop Open Source

    Projects ................................................................................. 35Combining Management with Integration,

    Governance, and Security ................................................... 36

    Chapter 5: Ten Essential Tips for Succeedingwith Big Data Management . . . . . . . . . . . . . . . . . . . . . 39Design Use Cases for Business Value .................................... 39Automate and Centralize Your Data Management .............. 40Leverage Data Lakes ................................................................ 40Create Collaborative Methods for Governance ................... 41Identify Data Quality Issues Early .......................................... 42See Your Data and Relationships

    with a 360-Degree View ....................................................... 42Work with Expert Vendors to Accelerate

    Your Deployments ............................................................... 43Look for Process Repeatability .............................................. 43Align Your Vocabulary ............................................................ 44Automate Key Processes ........................................................ 44

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    7/53

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    Introduction

    B ig data is the subject of great energy and excitement,and for good reason. The prospect of channeling all thedata in the universe (and that is a lot of data) into analyticalengines to understand relationships between entities, iden-

    tify illusive patterns, and predict future events is exciting! Itis changing our lives and altering the way businesses see usas consumers. When used correctly, businesses find that bigdata unleashes a wealth of information and insights whichtranslate to higher profits, reduced costs, and less risk; it isa win!

    The downside is, despite all the hype, many big data projectsstruggle to deliver on those lofty promises. The fact is that

    while technology evolved and data grew at an exponentialpace, the processes to manage big data were left behind. Theresult was frustration with many big data projects.

    This book provides a solution through big data management.Based on three pillars of integration, governance, and secu-rity, big data management provides a set of processes andtechnologies that make big data projects successful. Thesepillars will deliver data that is clean, governed, and secure to

    discover insights and turn them into real business value.

    About This BookBig data management is the solution to struggling big dataprojects in business today. Too much emphasis is placedon individual technologies and not enough focus on founda-tional components of data integration, data governance, anddata security. Applying these foundational components andtheir underlying processes will improve the effectiveness ofbig data projects that rely too much on new and unmanagedpoint-solution technology and “do-it-yourself” (DIY) manualprocesses.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    8/53

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    The focus of this book is learning what big data managementis and how to apply it to real-world big data projects. You will

    learn strategies, processes, and identify tools to implementbig data management so you can deliver big data projectsfaster and with greater value.

    Icons Used in This BookThroughout this book, you will occasionally see special iconsto bring your attention to a point that needs emphasized. I willkeep them brief, and sometimes a little funny, but if you seeone, take note because it’s something you should know.

    Tips indicate information that you may find useful. Often, theyrelate to an experience I had (or I wish I had at the time), orthey add additional context to a topic.

    If you see this icon, it’s probably something that will help youlater. You won’t find the meaning of life here, but you may findsome advice that will make your life easier.

    Warning means just that; be careful! I use warnings to alert youto common mistakes and serious issues for you to avoid.

    I am a technical person at heart and I love to understand howand why things work (or don’t). Yes, this is a Dummies book,but sometimes I delve deeper into a subject so you under-stand the “why” and “how” for a key topic.

    Beyond the BookThis book can’t teach you everything about big data, big datamanagement, analytics, or exciting technologies like Hadoopor NoSQL. I encourage you to research these topics on your

    own, if not from a professional perspective, at least explorebig data and the impact it will have on you as both a con-sumer and as a citizen in our ever-connected society.

    One good place to visit is the Informatica Big Data Ready webpage at informatica.com/bigdataready .

    Big Data Management For Dummies, Informatica Special Edition2

    https://www.informatica.com/ready/bigdatareadyhttps://www.informatica.com/ready/bigdataready

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    9/53

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    Identifying Big DataIn This Chapter

    ▶ Understanding the evolution of data

    ▶ Explaining big data

    ▶ Leveraging big data in business

    D ata has evolved over the years and will continue toevolve. Originally a stable stream of well-structureddata, the growth of technology has unleashed a flood of varieddata from a myriad of sources. The flood of big data can over-whelm those who are unprepared, but for those ready for bigdata, many new business opportunities await. In this chapter,I explore how data has evolved into big data and how big datais used in business.

    Evolution of Dataover the Years In the early days of data processing (early term for IT), datacame from relatively few, well-defined sources; and once itcame into the computers of the day, it was structured. Thatis, structured data is in a known format of data size and type;think alphanumeric data for customer names, identificationnumbers, and sales numbers. Over time, the size and numberof data sources grew to be large, but they grew in a predict-able manner and maintained their structured format.

    Technologies to store and manage data such as RelationalDatabase Management Systems (RDMS) evolved and excelledat managing this kind of structured data. Programming lan-guages and Business Intelligence (BI) reporting tools were

    Chapter 1

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    10/53

    Big Data Management For Dummies, Informatica Special Edition4

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    developed to glean value from the data. As structured datagrew, iterative improvements in technology kept pace.

    As new technologies emerged, they generated a new type ofdata: unstructured data. Unstructured data comes in a varietyof data types, sizes, and formats. Examples of unstructureddata include audio and video files, pictures and images, andunstructured text streams such as mobile texts or socialmedia posts.

    Some people use the term semi-structured data to further dif-

    ferentiate within unstructured data. Social media posts andmobile texts, which are loosely formatted text, are consideredsemi-structured data. Log files and sensor equipment are alsosemi-structured examples.

    Technologies to manage structured data struggled to adaptto support unstructured data as well. Many establishedtechnologies offered support for unstructured data, butoften the implementation wasn’t as mature as what existed

    for structured data. In response, new technologies such asHadoop and NoSQL emerged especially suited for unstruc-tured data.

    Hadoop is an open source software framework from theApache Software Foundation used to store and process largenon-relational data sets using a distributed architecture. NoSQLis a class of databases engineered to process large unstruc-tured and semi-structured datasets. Commercialized and open

    source distributions of these technologies exist includingCloudera, Hortonworks, and MapR for Hadoop and ApacheCassandra, MongoDB, MarkLogic, and Couchbase for NoSQL.

    Beyond formatting structure, data itself is categorized into dif-ferent categories that are useful to understand:

    ✓ Traditional data: Data already existing in legacy systems,corporate databases, and local data stores such as Excel

    spreadsheets. This tends to be structured data and iswell managed using existing technologies.

    ✓ Enrichment data: Data which is specific to a purpose andsupplements traditional data. This is often external suchas demographically available data about the customersalready stored in traditional customer tables.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    11/53

    Chapter 1: Identifying Big Data 5

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    ✓ Emerging data: Data which is new, external, and is oftenbig data; usually in an unstructured, non-traditional

    format. Examples include social media, sensor, or logdata. Data that’s internal to the enterprise includesemails, documents and comments, and machine log files.

    Data is continually evolving and will continue to do so as tech-nology grows. Technology scientists, vendors, and businesseshave the challenge of keeping up with this evolution, and thelatest major evolutionary step is big data.

    Introducing Big DataIn its simplest terms, big data isn’t just a lot of data; it’s a lotof data being generated very rapidly and in a lot of differentformats. Big data is by its nature both structured and increas-ingly unstructured, and it’s being generated at a very fast ratefrom a variety of data sources. These factors ensure that bigdata is “big” in that there’s a lot of it, and it’s increasing at anexplosive rate.

    Beyond raw size, big data exceeds the capacity of existingtraditional systems to store and process it; new technologiesand processes are required to make effective use of big data.Big data is very often unstructured and not stored within anorganization’s corporate databases; it’s external and doesn’tneatly fit into predefined formats.

    Traditional technologies and methodologies are simply notsuited to capture, store, or process big data. The new technol-ogies and methodologies are critical for businesses and dataprofessionals to understand when they enter into the world ofbig data.

    The Vs of Big DataDefinitions of big data vary and will evolve because it’s stilla relatively young field. However, most reputable definitionsinclude reference to the original “Three Vs of Big Data”:

    ✓ Volume: The vast size of the data in terms of actual sizeand number of data items

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    12/53

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    13/53

    Chapter 1: Identifying Big Data 7

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    How much is data growing? Consider these facts:

    ✓ The amount of data in the world doubles every two yearsbased on multiple sources.

    ✓ By 2020, there will be 450 billion transactions on theInternet every day, according to International DataCorporation (IDC).

    ✓ As of 2013, there were between 2.5 to 3 zettabytes (ZB),but by 2020 there will be 44 ZB of data per IDC.

    Where in the world is all this data coming from? New technolo-gies, sensors on existing technologies, metadata (for example,data about data), and data about nearly everything a person ordevice does add to the data universe every second. Examplesinclude the following:

    ✓ The over 7 billion mobile devices in use today allow forone device per person on Earth — wow! Furthermore,consider the many apps on each device that generate

    location, communication, purchase, picture, video, andsocial media data.

    ✓ Online activities for web users ranging from browsing,communications, and commerce are other commonexamples. Usage patterns and preferences contained insessions yield a gold mine of useful data to be harvested.

    ✓ Sensors in the everyday devices you use in your lives.Increasingly, telemetry and location data within cars,

    home appliances, and entertainment devices are addedas new features and capabilities are introduced.

    ✓ Sensors within medical, scientific, and manufacturingdevices. As each device becomes more capable and net-worked, the data generated increases. Everything fromhospital beds tracking a patient’s detailed statistics tosensitive controllers on the factory floor are examples.

    Networked chips embedded within appliances, manufacturingdevices, mobile devices, and others are part of the Internet ofThings (IoT). IoT is a growing contributor to big data and isitself a rapidly expanding technology.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    14/53

    Big Data Management For Dummies, Informatica Special Edition8

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    While traditional data will always grow, unstructured enrich-ment and emerging data are growing even faster. Understanding

    the magnitude of data growth and having an appreciation ofits sources better posture you to understand big data and itsdirection.

    How Big Data Is Usedin Business

    A strong theme for businesses is to leverage technology toenhance operational effectiveness, reduce costs, and reachnew and existing customers with better products and servicesfaster than the competition. Much of the value proposition ofbig data is to be able to identify key information about yourenvironment and customers so you can act more quickly toseize an opportunity.

    Examples of how businesses can use big data are vast andindustry specific, but common use cases include

    ✓ Business revenue generating and risk reduction opera-tions to find new opportunities, reach out to new custom-ers, increase customer loyalty, and increase operationalefficiency.

    ✓ IT cost-reduction activities such as data warehouse opti-mization and offloading, and building centralized datalakes. Additionally, cost reductions can be realized byimplementing projects to update existing IT infrastruc-ture and operations to better leverage external datasources rather than reproducing in-house.

    ✓ Industry-specific activities to improve predictive mainte-nance in manufacturing environments, reduce healthcarecosts while improving outcomes, service improvementsin utilities and telecommunications, and identify and

    reduce fraud and default in insurance and financialindustries.

    The limit of how big data can be leveraged is increasingly lessabout the limits of technology and more about the imagina-tion and management processes of its practitioners.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    15/53

    Chapter 1: Identifying Big Data 9

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    Serving up big data via the cloudCloud computing is about provid-ing something useful as a service .Infrastructure as a Service (IaaS),Platform as a Service (PaaS), andSoftware as a Service (SaaS) are themost common forms of this comput-ing architecture. Current examples

    of IaaS and PaaS from large vendorsinclude Amazon Web Services (AWS)and Microsoft Azure.

    Cloud computing in terms of big dataincludes accessing external data asa service. Rather than attempting to

    store data internally, you must accessit via a service. It can be conceptu-ally simple as customer demographicdata or Twitter feeds or more com-plex industry-specific data in a privatecloud. Another example is providingbig data analytical processing capa-

    bilities as SaaS. Many companiescan’t establish a big data analyticscapability in-house so they wisely usea big data SaaS offering to acceler-ate their implementation at lower costand reduced complexity.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    16/53

    Big Data Management For Dummies, Informatica Special Edition10

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    17/53

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    Understanding theChallenges of Big Data

    In This Chapter ▶ Highlighting the challenges of big data

    ▶ Learning why businesses struggle with big data

    ▶ Overcoming challenges with big data management

    B ig data promises incredible opportunities, but unfor-tunately, a lot of work and complexity are involved in

    unlocking those opportunities. Some challenges are obvious,while others are more subtle. Beyond technical obstacles, theoperational and management challenges are often the mostdifficult to address. Fortunately, the intelligent use of a datamanagement methodology can solve these challenges. In thischapter, I discuss the challenges of big data and introduce big

    data management as a solution.

    Identifying the TraditionalChallenges of Big Data

    Some challenges of big data are obvious and tie back to the

    three Vs of big data: volume, variety, and velocity. These chal-lenges manifest themselves as

    ✓ Large, ever expanding volume of data across a multitudeof sources

    ✓ Different data types, especially with unstructured data

    ✓ Constant generation of new data to the point of systemoverload

    Chapter 2

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    18/53

    Big Data Management For Dummies, Informatica Special Edition12

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    New roles are emerging inside organizations like data scien-tists who need very quick access to data for discovering new

    insights. Automation systems and new types of data-drivenapplications also require constant access to trusted dataassets. All these requirements can be challenging to deliveragainst as new data platforms, like Hadoop, emerge, whichintroduce new skills and new systems to integrate and manage.

    These represent the traditional and obvious challenges ofbig data. On the positive side, while they are indeed daunt-ing challenges, they are well understood by business and

    technical folks alike. Furthermore, evolution of technologyand increases in processing power and storage help mitigatesome of their impact. However, on the negative side, theseobstacles don’t reflect the full spectrum of challenges facedby big data practitioners; there’s another set of more subtleyet equally important next-generation challenges.

    Emerging Next-GenerationChallenges As companies delve into big data projects and technical andbusiness specialists roll up their sleeves and “get dirty,” theyfind a whole new class of challenges they may have neverseriously considered before. These challenges are every bitas daunting as the more traditional challenges, and in fact

    they’re often more difficult to overcome.Next-generation challenges include the following:

    ✓ Varied data sources: Data comes from and resides inmany different internal and external sources, includ-ing data warehouses, data marts, data lakes, generatedreports, the cloud, and third-party resources. Data com-monly originates from business transactions, web and

    machine log files, and social media. ✓ Data silos: Valuable data may not be accessible due tooverly stringent polices and/or politics, or it may be sodistributed, segregated, and walled-off that accessing it istoo resource intensive and not repeatable. Or there maybe useful data, but it’s in a silo you aren’t aware of; thus,you miss a full 360-degree view of the data.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    19/53

    Chapter 2: Understanding the Challenges of Big Data 13

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    ✓ Increasing security risks: Data breaches are big newsand legitimately can spell disaster for a company or

    organization; no one wants their CIO in the newspapersfor a data breach. Sensitive data exists by itself andaggregation of seemingly non-sensitive data can becomea security issue.

    ✓ Lack of data governance: In absence of a unified datagovernance policy, either too many controls or notenough as relate to data quality and data sharing areenforced, and there’s no uniform process of accessingand managing (curating) data. At best, your efforts toaccess, prepare, and curate data are inconsistent andinefficient; at worst, you either can’t access data, can’ttrust the data, or you create a potential security issue.

    ✓ Too many emerging and changing technologies: Bigdata is still evolving with new vendors, technology,and open source projects. Keeping track of this shiftinglandscape is difficult, and standardizing on a big dataplatform and methodology is both technically and often

    politically complex. ✓ Value is difficult to unlock: Data in itself has little value,but finding the important relationships within a data uni-verse to identify actionable information is the real chal-lenge. IT, business, and management stakeholders mustbe equipped with technology, policy, and the will to findand exploit opportunities from data before real value isachieved.

    Next-generation big data challenges are often as process- andpolicy-driven as they are technical. In these cases, faster CPUchipsets won’t help; it takes people breaking down institutionalbarriers, implementing new policies, and leveraging smartertoolsets that simplify work to overcome these obstacles.

    State of Big Data Projects Unfortunately, traditional- and next-generation challengesoften encumber projects and frustrate businesses with a nega-tive impact on the perception of big data. Industry expertsclaim big data will solve everyone’s problems overnight, butmany projects haven’t experienced stellar results.

    Most savvy technology, business, and executive folks under-stand the value of big data someday; regrettably, despite their

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    20/53

    Big Data Management For Dummies, Informatica Special Edition14

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    efforts, that day is not today. Many big data projects havesimilar issues:

    ✓ Increased complexity: Pulling out the valuable insightsfrom data is harder than originally imagined. Finding,accessing, integrating, and preparing diverse data canconsume most of the project resources.

    ✓ Extended delays: Moving from inception to deliveringa production-ready product takes too long. Subsequentprojects also take nearly as long.

    ✓ Moving beyond the Proof of Concept (POC): The POCproject either languishes indefinitely without showingresults, or it’s a one-off event that’s difficult to reproduceon a larger scale.

    ✓ Immature processes: Business processes, data gover-nance, and security compliance standards aren’t yetmature enough to support the effort.

    ✓ Unexpected cost: Time and resources invested exceed

    what was initially expected. ✓ Slow Time to Value: The ROI being delivered is less thanwhat was originally promised.

    You can see in Figure 2-1 the complex ecosystem of big dataprojects with many data inputs, business outputs, cross-system processes, and multi-disciplinary stakeholders; nowonder these are difficult to manage.

    BusinessObjectives

    DataSources

    Analytic Apps, Enterprise Apps, etc.

    Data Lakes, Data Warehouses, Data Marts, NoSQL

    • Takes too long to get the data• Can I trust the data?• We have a lot of sensitive data

    • Too many one-off projects• Hard to build and maintain• Many compliance requirements

    Laboratory (insights) Factory (actions)

    Documentsand Emails

    Social Media,

    Web Logs

    Relational,Mainframe

    Machine Device,Cloud

    ImproveFraud

    Detection

    ReduceSecurity

    Risk

    IncreaseCustomer

    Loyalty

    ImprovePredictive

    Maintenance

    IncreaseOperationalEfciency

    Data Modeler Data Scientist Data Analyst Data Steward Data Engineer Business

    Figure 2-1: Common problems with big data projects.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    21/53

    Chapter 2: Understanding the Challenges of Big Data 15

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    The ultimate effect of these problems is a degradation of faithin big data projects by business leaders. Executives concep-

    tually understand big data can provide value, but for amultitude of reasons, they become hesitant to aggressivelypursue future projects. This perception is unfortunatebecause once expectations are properly set and managed,coupled with the right data management methodology,great results are possible. However, to get past these initialhurdles, it’s necessary to understand why big data projectsexperience problems.

    Beware the magic “cure-all” solution regardless of the tech-nology involved. Vendors love to sell companies tools thatpromise to erase complex problems, make everything faster,and reduce costs while bringing in huge profits, often with thesingle click of a button. The truth is, tools can bring you greatoutcomes, but it takes work to make it happen. Be realisticand know how to separate hype from reality.

    Understanding Why Businesses Are Struggling with Big Data

    Why the disconnect between the wonderful promises of bigdata versus the reality of many big data projects? The answeris usually a combination of factors, but often a set of commonthemes is found in struggling big data projects.

    Companies struggling with big data projects typically havefallen prey to one or more of the following pitfalls:

    ✓ Trying to do it all yourself: Do It Yourself (DIY) is howmany projects begin, but they struggle uphill as teamsstumble through a steep learning curve for technology,processes, and governance. Inefficiencies, delays, andlost opportunities are common with this trial-and-error

    approach. Appealing at first as an attempt to keep costslow or as a POC effort, DIY seldom pays off in the long run.

    ✓ Not enough trained and experienced people: Closelytied to DIY, not having a sufficient pool of expertise onthe project will cause delays and inefficiencies whiledriving up costs. Developing big data skillsets takes timeand resources; it’s not something staff can easily do inaddition to their other duties.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    22/53

    Big Data Management For Dummies, Informatica Special Edition16

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    ✓ Custom, hand-coded solutions: A by-product of DIY,inexperienced teams are developing custom, hand-coded

    solutions, which aren’t repeatable for future efforts.Common with data cleansing and integration processes,these home-grown solutions are hard to initially developand aren’t reusable when new datasets or projects areintroduced. Thus, over time, a myriad of custom solu-tions are written at great expense with limited futurebenefit.

    ✓ Not leveraging appropriate tools and best practices: Reinventing the wheel time and time again while ignor-ing the expertise more experienced big data experts(and vendors) have developed is wasteful in terms oftime and resources. Taking too much of a narrow viewwithout leveraging the greater body of expertise andtools frequently leads to frustration, delay, and reducedpositive results.

    Did you notice these common errors are more about meth-odology than they are about raw technology? And their rootsare in next-generation challenges more than traditional chal-lenges? If you want to be successful with big data, you need tounderstand and embrace big data management.

    Is your big data project entirely IT driven or does it havea business leader as champion? Where is the funding (andhopefully there is funding) coming from and what and whenis the projected ROI? And is the staff trained and allocatedas a dedicated resource or is this yet another side project astime allows? Be sure to ask these questions and consider theconsequences.

    Introducing Big Data Management

    Big data management is defined as the capture, integration,administration, and governance of big data for use in anorganization’s data-related applications, often with an empha-sis on business intelligence and analytical applications. It spansboth technical and non-technical aspects of access, integration,governance, and security of data being used within an organi-zation to highlight relationships, identify trends, or otherwiseprovide a more complete view of the business environment. In

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    23/53

    Chapter 2: Understanding the Challenges of Big Data 17

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    particular, the focus is on the integration, governance,and security of big data. However, to really understand big data

    management, one must first comprehend the hierarchy of bigdata architectural layers.

    Understanding the layersof big dataOperating on big data is best visualized as a three-layered

    architectural hierarchy: ✓ Visualization and analysis: This includes visualiza-tion tools like Tableau and Qlik and analytic tools withadvanced statistics and machine-learning algorithms likeR, SAS, and H2O.

    ✓ Big data management: This includes the technologyneeded to integrate, govern, and secure big data such aspre-built connectors and transformations, data quality,data lineage, data mastering, and data masking. Platformvendors for big data management include Informaticaand a variety of startups with specific point solutions.

    ✓ Storage persistence layer: This includes persistencetechnologies and distributed processing frameworkslike Hadoop, Cassandra, data warehouses, and MPPappliances.

    The world of big data management exists in the middle,between the visualization and analytics and the storage andprocessing framework layers. Big data management interfaceswith the data at the storage layer, processes that data, and pro-vides datasets for visualization and analysis to the upper layer.

    Defining big data management

    capabilities Big data management performs several core functions withina defined hierarchy. Some functions are entirely technical,while others are non-technical. The core functions are

    ✓ Integration: Finding, accessing, integrating, cleansing,preparing, and delivering the data

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    24/53

    Big Data Management For Dummies, Informatica Special Edition18

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    ✓ Governance: Managing and curating the data to ensure itis high quality, clean, trusted, and “fit-for-use”

    ✓ Security: Protecting the data from unauthorized accessand manipulation

    In Chapter 3, I dive deep into each function of big data man-agement as you explore the details of integration, governance,and security.

    Overcoming obstacles withbig data management Assume you have established a successful big data manage-ment solution; what problems can it solve? Big data manage-ment can help you

    ✓ Reduce the time to access, integrate, and prepare data

    ✓ Enhance the trustworthiness and quality of the data ✓ Protect sensitive data and reduce security breaches

    ✓ Enforce compliance and standards across the board

    ✓ Create repeatable processes to accelerate future efforts

    ✓ Bring projects from proof of concept to production faster,and with less risk and effort

    ✓ Adapt as technology and data evolves

    Big data management is the solution needed to get you pastthe traditional and next generation stumbling blocks commonto so many projects today. Using big data management won’tmagically solve your problems overnight, but it will move youmuch closer to meeting the expectations and achieving theROI that has been promised for so long.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    25/53

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    Building Blocks of EffectivData Management

    In This Chapter ▶ Exploring the differences between a big data laboratory and factory

    ▶ Learning the three pillars of big data management

    ▶ Detailing the processes within big data management

    ▶ Enhancing effectiveness through empowerment and intelligentresource management

    I f you’ve been reading this book straight through, at thispoint you know what big data is and why big data projectsencounter problems; you are now ready to take a deep diveinto big data management to understand why it is core to abig data strategy. I cover in detail the three pillars of big datamanagement and what they do and why they’re critical toyour project’s success. Next, you explore the processes asso-ciated with big data management in detail. Finally, I end thechapter with help on how to empower your team and makethe most of the resources you already have in place. Thischapter provides the foundational knowledge to truly under-stand big data management.

    Understanding a Big DataLaboratory versus FactoryBefore I delve into the details of big data management, I mustapply context to the environments in which big data is used.Depending on the environment, the requirements for an

    Chapter 3

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    26/53

    Big Data Management For Dummies, Informatica Special Edition20

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    effective implementation differ in several key ways; I explorethose ways and why they matter.

    In terms of environments, everyone often thinks in terms ofdevelopment, test, and production, but with big data that isn’tentirely accurate. Rather, think in terms of laboratory environ-ment versus a factory environment.

    Big data laboratory environments are typified by data scientiststesting a series of datasets with a set of analytical algorithmsin an attempt to identify key insights that bring potential value

    to the business. This is entirely scientific experimentation withthe intent to find datasets and analytical models that can bepassed to the operations team who will put the solution intoproduction (aka “the factory”); that is where real businessvalue will be derived. I use the term factory, but some organiza-tions refer to it as operations or production.

    Big data factories take the model provided by the laboratoryand put those solutions into use as production. In big data

    analytics uses cases, these environments are managed by ateam of IT specialists with business analysts reviewing andapplying the resulting insights to the business. However,more common use cases (for example, next best offers, frauddetection, new data-driven products, predictive maintenance,and so on) strive to deliver actionable information directly tothe end-user in real time. This eliminates the need for a busi-ness user acting as a middleman layer between the end-usersand data. Big data factories are focused on data products that

    provide actionable information directly to end business usersand consumers.

    There is a relationship and dependency between laboratoriesand factories:

    ✓ Laboratories experiment until they have a solution thatwill provide business value. They pass that solution on tothe factory for production.

    ✓ Factories implement the solution provided by the labora-tories to generate real business value.

    ✓ Without factories, laboratories would have no reasonto exist.

    ✓ Without laboratories, factories would have no solution toimplement in production.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    27/53

    Chapter 3: Building Blocks of Effective Data Management 21

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    There are, however, critical differences between the needs oflaboratories and factories:

    ✓ Self-service autonomy in the laboratory is criticalbecause data scientists will be conducting many, manyexperiments until they find a winning solution. Datascientists need the freedom to set up and execute experi-ments themselves rapidly to pursue their research.

    ✓ Data in the laboratory is less subject to cleansing thanin a factory environment. There’s relatively little impactif data isn’t fully curated in a laboratory environment aslong as it’s an authentic and accurate representation ofthe real world.

    ✓ Operational agility in factory environments is importantbecause after a valuable insight is identified, the businessmust be nimble enough to exploit its temporary advan-tage. It does little good to have key information if you’reunable to take advantage of that knowledge or if a com-petitor implements it first.

    ✓ Data integrity, timeliness, and trustworthiness are alsoimportant requirements in a factory environment; theonly thing worse than no data is bad data. Taking actionbased on incorrect or outdated information can be costlyin terms of time and resources; plus it degrades the trustthe business has on big data for future operations.

    ✓ Laboratories and factories must conform to corporatesecurity policies to protect sensitive data and adhere to

    regulatory compliance.

    During the transition from laboratory to factory, make sureapplicable governance and security policies are followedand put in place. On occasion, a less stringent laboratoryenvironment introduces elements that aren’t permitted in acontrolled factory environment.

    Why does this matter in terms of big data management?

    Assuming you accept that laboratories and factories arecritical to the big data operations within a company, thosetwo distinct environments must be managed appropriately.The needs and differences of the environments must berespected and carefully managed; recognizing these differ-ent approaches to integration, governance, and security isimportant when evaluating big data management platforms.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    28/53

    Big Data Management For Dummies, Informatica Special Edition22

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    Fortunately, big data management can be architected to beflexible enough to meet the needs of laboratories and facto-

    ries once you understand how it works.

    Identifying the Three Pillarsof Data Management

    The power of flexibility of big data management comes from

    its architecture. Rather than attempting an overly complexarchitecture, big data management builds on three founda-tional pillars:

    ✓ Integration

    ✓ Governance

    ✓ Security

    In Figure 3-1, you see the pillars of data management.

    As you can see in Figure 3-1, there are only three pillars, buteach pillar encompasses multiple processes.

    IntegrationIntegration ingests and processes data to achieve a result;this processing must be scalable, repeatable, and agile. Thelongest delays in big data projects occur during integration;smarter integration will reduce these time frames, automateprocesses, and allow for rapid ingestion of new data. Keycomponents of integration include

    Big DataIntegration

    • Simple Visual Environment & Dynamic Templates• Optimized Execution & Flexible Deployment• 100’s of Pre-built Transforms, Connectors & Parsers• Broker-based Data Ingestion

    Big DataSecurity

    • Sensitive Data Discovery & Classication

    • Proliferation Analysis

    • Risk Assessment

    • Persistent & Dynamic Data Masking

    Big DataGovernance

    • Collaboration self-service Capabilities• Universal Metadata Catalog & Business Glossary• Proling, Discovery, and Data Quality• 360˚ Data Relationship Views• End-to-end Data Lineage

    Figure 3-1: Understanding the pillars of data management.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    29/53

    Chapter 3: Building Blocks of Effective Data Management 23

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    ✓ Agile and high performance ingestion of next genera-tion data

    ✓ Automated and scalable integration, cleansing, andmastering of next-generation data

    ✓ Optimized and readily usable tools for ingestion andprocessing coupled with repeatable processes

    GovernanceGovernance defines the processes to access and administerdata, ensures the quality of the data, how it is tagged and cat-aloged, and that it is fit-for-purpose. Essentially, the businessand IT teams must have confidence their data is clean andvalid. Key components of governance include

    ✓ Collaborative governance to allow everyone to partici-pate in holistic data stewardship

    ✓ 360-degree view knowledge graph of data assets showingsemantic, operational, and usage relationships

    ✓ Trust and confidence that the data is fit-for-purpose

    ✓ Data quality, provenance, end-to-end lineage and trace-ability, and audit readiness

    SecuritySecurity identifies and manages sensitive data with a 360-degree ring of risk assessment and analysis. Security mustoccur at the source, not just at the perimeter. Identifyingwhich data is sensitive (credit card information, emailaddresses, addresses, Social Security numbers, and other per-sonally identifiable information ) and which data aggregatedtogether becomes sensitive is a growing challenge. Key com-ponents of security are

    ✓ 360 degrees of sensitive data discovery, classification,and protection

    ✓ Data proliferation and risk analysis

    ✓ Masking and encryption for sensitive data

    ✓ Security policy creation and management

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    30/53

    Big Data Management For Dummies, Informatica Special Edition24

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    Security is huge, and many organizations rightfully protecttheir data like a grizzly bear protecting her cubs. This can

    become an obstacle for data access (ingestion as part ofintegration), especially if you can’t prove you have sufficientsecurity and governance controls in place.

    It is far better to do the security, compliance, and governancework up front to alleviate data owners’ concerns beforerequesting sensitive data. You must demonstrate you haveappropriate security controls in place; otherwise data ownerswill block your efforts.

    Exploring data governance strategyOf the three pillars of data manage-ment, governance is often the mostforeign to people with a technicalbackground. Governance is about the policies, procedures, techniques,and technology you use to administeryour data to ensure it is trustworthy,accurate, available, and actionable.

    Governance is inherently a bureau-cratic process; regulations, laws,and auditors require controls to exist.That frequently concerns people

    because they think governance mustbe a hindrance, and that perceptionisn’t correct. Governance can eitherwork for or against you, dependingon how it is approached.

    If you have weak governance, datawill effectively become “locked up”because there is no established pro-cess to “free” it. Every time you wantdata, it’s a battle to gain access.If you establish strong governanceprocesses allowing access to data

    across your enterprise, you will havecreated a standardized, repeatableprocess that will pay many dividends.Getting data becomes streamlinedbecause you already have polices toaccess that data.

    Governance also improves the qual-ity and trustworthiness of your data,and it helps identify relationshipswithin that data. In this context, youuse governance techniques and technologies to enrich, curate, and

    tag metadata, thus making the datamore useful and actionable.

    Knowing the provenance (origin) ofdata, and tracing it from creation to its current state (end-to-end lin-eage), allows that data to be muchmore transparent and trustworthy to the business. Done correctly, youwill continuously enrich a goldendata record of your customers, and that brings real value to the table.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    31/53

    Chapter 3: Building Blocks of Effective Data Management 25

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    Diving Deep into Big Data Management Processes

    Much of the heavy lifting of big data management occurswithin integration. During integration, data ingestion, cleans-ing, preparation, and processing occur; however, security andgovernance also have processes as well. Understanding theseprocesses will enhance your ability to manage big data moreeffectively.

    Key big data management processes include

    ✓ Access data: Set up repeatable, well-managed processesto acquire data from both traditional and next genera-tion data sources. Multiple data sources will be used, sohaving pre-configured access tools and connectors are agreat timesaver.

    ✓ Integrate data: Establish processes to prepare and nor-malize data for a myriad of data sources. This process isoften very challenging; resist the temptation to rely onmanual methods, and leverage automation and repeat-ability as much as possible.

    ✓ Cleanse data: Review the data to ensure it’s ready foruse; that means checking for incomplete or inaccuratedata and resolving any data errors that may bias analysisor negatively impact business operations and decision

    making. Beware this process can be tedious, and leverageautomation options when available.

    ✓ Master data: Organize your data into logical domainsthat make sense to your business such as customers,products, and services. Furthermore, you can add enrich-ment data to further paint a clearer picture of your cus-tomers, products, and services and their relationships.

    ✓ Secure data: A mix of governance and security allows

    you to establish security rules and then implement thoserules. First, you must determine how you will manageyour sensitive data. Next, you must find and assess therisk of your sensitive data and implement rules via policyand technology. This process is very important but proneto be under-addressed by those inexperienced in big datamanagement.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    32/53

    Big Data Management For Dummies, Informatica Special Edition26

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    ✓ Explore and analyze data: Implement a data laboratoryto perform experiments with a clear business goal in

    mind. Based on your hypotheses, find what data existsand how it can be analyzed to create a model that deliv-ers results. Then determine if the results are beneficialto the business; remember that providing actionableinformation and processes is the goal. Develop best prac-tices to enhance agility and processes before pushing thesolution into the factory.

    ✓ Explore and analyze for business needs: Test outdata products to see if they provide a real value for thebusiness; often you just need to try something to see if itworks. It is common to use A/B testing to determine if anew data product adds value to the business. Make itera-tive improvements over time as you learn what works,what doesn’t work, and what can be improved.

    ✓ Operationalize the insights: Automate and streamlineyour processes to create a steady pipeline of action-able insights to business users. It’s not enough to have

    occasional production runs from the big data factory;the factory must be running regularly to be truly produc-tive, meet business service-level agreements (SLAs), andachieve the expected ROI.

    These processes aren’t necessarily linear, although they havea general flow with reiteration and loopback as necessary.Really, these processes run as a cycle as data is brought intothe system, processed, tested, and then implemented for the

    business; then the next data project or test is started.

    The system will ingest data from data sources, clean, inte-grate, and manage that data, and then pass it to analyticapplications for processing to develop insights and finallyto business applications in the form of actionable informa-tion, all while applying big data management processes.Understanding the processes of big data management enablesyou to better manage environments.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    33/53

    Chapter 3: Building Blocks of Effective Data Management 27

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    Empowering the Big Data TeamIt is not cliché to say a company’s greatest asset is its people;it’s the truth. The challenge is what can be done to increasetheir effectiveness and ability to produce results, and in thiscontext I mean working with big data.

    First, understand the role and needs of each team memberor category of member. There will be a mix of data scientists,modelers, analysts, stewards, engineers, and business users,

    all with different perspectives, skill levels, and needs. Somewill require greater self-service autonomy (in the laboratoryenvironment), while others require operational agility (inthe factory environment); your job is to identify their needswithin the big data environment.

    Next, incorporate the three pillars of big data managementinto the team members’ operating principles and environ-ment. Using a disciplined approach, ensuring that in particu-

    lar governance and security processes are followed, is oneof the biggest favors you can do for your team. Not havinggovernance policies enabling hassle-free access to data willdoom your team to needless headaches negotiating accessto needed data. Failing to have necessary security controls inplace also adds to data access issues, but worse yet it opensup the risk the team could be associated with a data breach.Make sure your team understands the value of governanceand security and uses it to their advantage.

    Next, get help for your team in terms of training, effectivetechnology, outside experts, and vendor experience. Oddsare your team is already overworked; why make them dothings the “hard way” by denying those tools and expertiseto increase their effectiveness? Forcing your team to work inisolation devoid of the great work already done with big datawill send the team down a path of one-off, custom solutions,manual processes, and tedious work that is not reproducible.

    That DIY approach results in frustration for the team andcostly lost opportunities for the business.

    Finally, consider what you can do with what you already haveby creating repeatable, automated processes and standard-ized technologies. Rather than re-inventing the wheel and

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    34/53

    Big Data Management For Dummies, Informatica Special Edition28

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    expending resources for each new project or dataset, seizeopportunities where you can

    ✓ Reuse existing infrastructure and tools

    ✓ Reuse skillsets, expertise, and processes

    ✓ Reuse previous projects’ components

    Working big data projects is initially complex work, but whenquality big data management principles are followed, thatwork can be reused again and again to the benefit of the team

    and the business.

    Taking steps to empower your big data staff isn’t just rightfor them as employees, but it yields benefits for the companyas well.

    Your people are an investment, and those in the big data fieldknow their value. There’s an industry shortage of qualifieddata scientists, data engineers, and those who have knowl-

    edge and experience in the big data world, and that shortageis expected to increase in the near future. You must be willingto develop and retain your highly skilled big data workforce;otherwise they may go elsewhere under favorable marketconditions.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    35/53

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    Using Big DataManagement in the Wild

    In This Chapter ▶ Identifying common use cases and drivers in big data managementprojects

    ▶ Learning more about big data management tools and techniques

    ▶ Highlighting several big data management toolsets in use today

    ▶ Reviewing real-world business experiences of big data management

    U nderstanding the foundational concepts of big datamanagement is essential, but you must also understandhow the concepts exist in the practical world. Businessesinitiate big data management projects for various purposesand from multiple perspectives; it’s important to understandthe drivers of those efforts. The ability to identify key attri-butes in big data management tools and how to effectivelyuse those tools within the principles of big data managementin production environments is critical information I provide.Finally, I identify some useful toolsets and highlight busi-ness experiences with those tools. In this chapter, you gainan understanding of how to merge the big data managementprinciples with real-world business operations.

    Chapter 4

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    36/53

    Big Data Management For Dummies, Informatica Special Edition30

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    Implementing Big Data Management in Business

    Companies initiate big data management projects for a varietyof reasons. While the industries and circumstances may varygreatly, most projects originate from two emphases:

    ✓ Business centric

    ✓ IT centric

    Business-centric big data management projects are just that:focused on generating a business benefit. Often, the intent isto generate new or additional revenue where it is relativelyeasy to calculate the ROI. In other cases, more subtle ben-efits occur such as better understanding the preferences orrelationships of perspective customers or improving existingbusiness processes. Even more subtle benefits are avoidanceor detection of specific conditions such as fraud, claims, orpreventing a component failure via the Internet of Things(IoT). These projects are frequently initiated by businessanalysts and executives.

    Examples of common business-centric uses cases include

    ✓ Using credit card transaction and money transfer data forreal-time fraud detection

    ✓ Analyzing website visitor behavior with clickstream data ✓ Processing customer data for a customer 360 initiativeto gain a complete picture of customers’ demographics,interests, patterns, and behavior

    ✓ Devising and implementing a program to increasecustomer loyalty

    ✓ Using predictive analytics to detect failing componentsand replace them before an assembly line is impacted

    ✓ Improving hospital patient outcomes and reducing thetotal cost of care

    IT-centric big data management projects often seek toimprove a process or provide an analytical or data capa-bility that previously didn’t exist. These projects provide

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    37/53

    Chapter 4: Using Big Data Management in the Wild 31

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    infrastructure cost savings and an indirect business benefitand are more difficult to calculate ROI. For example, IT may

    create a new data lake or build a Hadoop cluster that providesenhanced capability for the organization, but in isolation theseprojects don’t generate revenue unless they support a businessprocess or initiative. These projects are often generated by ITas a consolidation or modernization initiative or in response tobusiness requests to explore a new capability.

    Examples of common IT-centric uses cases include

    ✓ Building a staging environment to offload data storageand Extract, Load, and Transform (ELT) workloads from adata warehouse

    ✓ Extending a data warehouse with a Hadoop-basedOperational Data Store (ODS)

    ✓ Building a centralized data lake that stores all enterprisedata required for big data analytics projects

    These use cases are just the tip of the big data iceberg; beloware examples of real companies’ positive experiences:

    ✓ A large bank wanted to reduce the time required todetect fraudulent events. By better and faster prepara-tion of data before processing with anti-fraud software,fraud events were detected quickly. The bank also inte-grated data from outside sources including credit card,money transfers, and mortgage payment information to

    further enhance their fraud analysis and detection. ✓ A well-known insurance firm sought to better understandtheir customers and to generate more carefully targetedand personalized marketing campaigns. By using bigdata management tools, the company ingested, cleaned,and matched data from customer profiles, partner data,previous history, web logs, and social media activity.This data created a 360-view of customers enabling morecustomized and effective marketing campaigns.

    ✓ A large healthcare insurer wanted to lower the cost ofcare while improving patient outcomes. The companyleveraged big data management tools, improving its ana-lytics infrastructure, which allowed more effective patient-provider collaboration and member engagement. Theseefforts improved patient outcomes and reduced costs,while retaining and increasing the number of those insured.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    38/53

    Big Data Management For Dummies, Informatica Special Edition32

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    Putting big data management in the context of business andIT-centric projects is beneficial in understanding how big data

    management can help your business.

    Identifying Big Data Tools Within big data management architecture, there are processesand tools. Often the processes used are dependent on oneor more tools for implementation. Rather than focusing onspecific products (which can encompass several tools), youmust first identify the categories of tools common in big dataenvironments.

    Big data tools are commonly separated into the followingcategories:

    ✓ Data Ingestion: Ingest (obtain, import, and process) datafrom different sources at various latencies (for example,including real-time), in an efficient usable manner lever-aging pre-built connectors to simplify the data ingestionprocess.

    ✓ Data Management: The end-to-end tools and processesused to integrate, govern, secure, and administer thetransformation of source data into data that is “fit-for-purpose” and in compliance with corporate and regula-tory policies.

    ✓ Data Integration: Combine data from different sourcesusing a variety of transformations such as filtering,joining, sorting, and aggregating while establishing rela-tionships within the datasets to provide a unified view.

    ✓ Data Quality: Clean up data to ensure it’s fit for itsintended purposes to appropriately address incompleteor irrelevant entries, eliminate duplications, standardizeand normalize data, and ensure data exceptions are han-dled properly, preferably in an automated and repeatable

    fashion. ✓ Metadata Catalog: A dedicated repository to collect,manage, and report on data assets, their relationships,and the processes used to integrate, govern, and securethose assets. A universal metadata catalog that spans theentire data infrastructure landscape is the foundation forbig data management.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    39/53

    Chapter 4: Using Big Data Management in the Wild 33

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    ✓ Master Data Management: Enforce and ensure theaccuracy and accountability of the critical data in an

    organization to provide a common point of referenceand truth.

    ✓ Data Masking: De-identify, obfuscate, or otherwiseobscure sensitive data, such as credit card numbers, sorelational integrity is maintained, yet key sensitive valuesaren’t accessible.

    ✓ Data Security Analytics: Analyze and assess the risk of adata security breach by identifying the location and pro-

    liferation, and tracking the usage of sensitive data. ✓ Streaming Analytics: Collect, process, and analyze multi-latency data (including real-time) to provide event-basedinsights and alerts within a time-interval of maximumbusiness impact (often in real-time or near real-time).

    ✓ Big Data Analytics: Apply analytical formulas and algo-rithms to datasets to answer questions based on bigdata; these algorithms are employed by data experts to

    test hypotheses and validate analytic models used toimprove business outcomes.

    ✓ Data Lakes: Collect and store all types of data as origi-nally sourced for use as a live archive, data exploration,and an operational data store for pre-processing andpreparing data for big data analytics.

    ✓ Data Warehouses: Collect and store structured data intoa large repository for the purpose of applying analytics

    and generating reports.

    When evaluating tools and software packages, ask “what doesthis actually do and where does it fit within my big data archi-tecture?” Often, if you can’t find a satisfactory answer of whata product does or how it complements or replaces an existingtechnology within your IT infrastructure, you should bewarethat it may have limited or no value. The same conceptapplies with big data management tools; if the perspective

    tool doesn’t include functionality listed in the above catego-ries, there may be more marketing hype than substance.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    40/53

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    41/53

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    42/53

    Big Data Management For Dummies, Informatica Special Edition36

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    ✓ Features and enhancements made possible only by ven-dors with extensive expertise and large R&D engineering

    departments ✓ Comfort level and compliance assurance that a paidvendor is behind the product

    In the world of big data management, Informatica has taken asimilar approach of integrating their data management exper-tise with open source packages.

    For example, Informatica leverages several open sourcetools as part of their big data management stack. Specifically,Informatica leverages open source MapReduce, Spark, Hive onTez, YARN, Navigator, and Sqoop. As a user, you can expectthe functionally of each open source package, but additionalenhanced capabilities from Informatica.

    Combining Management withIntegration, Governance,

    and SecurityBig data management pillars of integration, governance, andsecurity form the overarching hierarchy of management pro-cesses. Those processes are implemented via technologies —often open source technologies. In Figure 4-1, you see howintegration, governance, and security ride atop big datamanagement processes powered by open source and vendortechnologies.

    Figure 4-1: Integrating big data management processes and technology.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    43/53

    Chapter 4: Using Big Data Management in the Wild 37

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    The approach shown in Figure 4-1 is common in the big dataworld. Using a mix of open source tools (Hadoop, Spark) witha vendor big management engine (Informatica Blaze) and uni-versal metadata catalog (Informatica Live Data Map), big datamanagement processes are applied to technology to deliverintegration, governance, and security.

    Introducing Informatica big datamanagement products v10Informatica, a leader in data man-agement technology, has recentlyreleased its v10 family of new andupgraded products. Providing the three pillars of big data management, these tools merit investigation.

    Tools of particular interest to thebig data management practitionerinclude

    ✓ Big Data Management v10: Complete delivery of integration,governance, and security builton Hadoop

    ✓ Secure at Source v10: Enablesdata security intelligence allow-ing identification, analysis, andmitigation of security risks

    ✓ Master Data Management

    (MDM) v10: Provides MDM capa-bilities to provide a single view ofdata and 360-degree view of rela- tionships and transactions

    To learn more about these bigdata management products, visit the Informatica website at www.informatica.com .

    http://www.informatica.com/http://www.informatica.com/http://www.informatica.com/http://www.informatica.com/http://www.informatica.com/http://f/Wiley/Carrie's%20files/Big%20Data%20Management%20For%20Dummies,%20Informatica%20Special%20Ed/FM%20Author/www.informatica.comhttp://f/Wiley/Carrie's%20files/Big%20Data%20Management%20For%20Dummies,%20Informatica%20Special%20Ed/FM%20Author/www.informatica.com

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    44/53

    Big Data Management For Dummies, Informatica Special Edition38

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    45/53

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    Ten Essential Tips forSucceeding with Big Data

    ManagementIn This Chapter

    ▶ Designing use cases with business value

    ▶ Leveraging data lakes

    ▶ Governing data collaboratively ▶ Working with expert vendors

    ▶ Automating key processes

    M anaging big data is the key to successful big dataprojects. Beyond technology, the management tech-

    niques deployed make the difference between success andfailure. In this chapter, I identify tips and techniques to makeyou more effective at managing big data in the real world.

    Design Use Cases forBusiness Value

    Delivering value early and often is a function of your overallstrategy. Sure, you can develop a very large and ambitiousplan that promises a substantial payout at the end of the proj-ect, but that approach is fraught with risk and is often difficultto sell to senior leadership. Overly sized projects are difficultfor new teams to tackle, and if issues occur (as they usuallydo), the future of the project is jeopardized.

    Chapter 5

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    46/53

    Big Data Management For Dummies, Informatica Special Edition40 A better way is to establish uses cases that deliver smallervictories earlier in the process. Using smaller, agile teams

    who focus on rapid, iterative development practices to showvalue early has many benefits. First, agile development doessolve many of the challenges and mitigate risks found in largerprojects; plus agile is the current, favored development meth-odology in many organizations. Next, a project that showsvalue early is easier to “sell” to management initially and tosustain as the project continues. Finally, smaller, more realisticgoals are easier to achieve while building the confidence andcapability of agile teams. When designing your uses cases and

    assembling your teams, focus on quicker wins that show a ben-efit rather than risking an overly ambitious project.

    Automate and Centralize YourData Management

    Big data management practices are the defining factor in aproject’s success. Too often, organizations find themselveswith fragmented, disparate teams attempting manual or one-off processes under the guise of big data management. Theseefforts often evolve from a lack of centralized direction fromthe top and never define an enterprise toolset to manage bigdata. In these situations, despite individuals’ best efforts,success is often elusive.

    To avoid this situation, at the outset define a core team tomanage big data for the organization. Empower this team andbreak down organizational barriers to their success, oftenrelated to data ownership and access. Next, give them theenterprise-class tools required to do the heavy lifting of inte-gration, governance, and security in an automated manner.The goal is to develop a team of data experts who focus theirtime and the organization’s resources on solving data chal-lenges rather than battling with other internal groups or

    struggling with inefficient tools and labor-intensive processes.

    Leverage Data Lakes Not all efforts require the same characteristics of data qual-ity or access, or are even used for the same purpose. In somecases, data scientists use data experimentation, visualization,and advanced analytic tools to explore possibilities. In other

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    47/53

    Chapter 5: Ten Essential Tips for Succeeding with Big Data Management 41

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    cases, business analysts use reporting and BI tools to makekey decisions. Many other examples exist, and they all have

    different data requirements.

    Data lakes provide powerful capabilities to meet differentrequirements from the same repository of raw data. A datalake stores a large amount of raw data in its native format; it’stagged with unique identifiers and metadata, but it’s still raw.When a business question is asked, the data is queried andthe appropriate, smaller subset of data is returned to answerthe question. This is in contrast to data marts, which offer a

    pre-built subset of data designed for a specific use case; inmany situations the siloed nature of data marts is a liability.

    The power of the data lake is that the same repository of rawdata can be used for different use cases. Data scientists canuse the data lake for their research while business analystsaccess more curated and governed datasets for their opera-tional requirements. Sharing the same data lake for differentpurposes adds flexibility to the organization without the over-

    head cost of redundant, purpose-specific data marts.

    Create Collaborative Methodsfor Governance

    Establishing data governance is not a one-time event. As with

    most policies, after creation it is necessary to continuouslymonitor results and periodically revise policies to ensure theyare relevant and make sense for the organization.

    Establishing data governance methods is a collaborativeapproach between multiple stakeholders, but the two coregroups are IT staff and business experts. The work itself is acontinual cycle with key phases being discovery, definition,execution, and continuous monitoring. During discovery, auto-

    mated tools discover data domains, and data stewards profilethe data to gain insight into the quality of the data. During defi-nition phase, the business glossary, metadata, data taxonomy,data quality, data matching, data access, and retention rules areestablished. In the execution phase, data stewardship, masterdata management, and provisioning policies are applied. Thefinal phase uses measurement and continuous monitoring ofresults. The cycle is repeated as needed, and changes are identi-fied and implemented. Throughout this cycle, careful collabora-

    tion between business and IT stakeholders occurs.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    48/53

    Big Data Management For Dummies, Informatica Special Edition42A universal metadata catalog is a key capability supportingdata governance processes. It provides the ability to holistically

    understand and manage data assets and their relationshipsfacilitating search, discovery, collaboration, and automation.

    Identify Data Quality Issues EarlyFew problems get better by themselves with age; the sameconcept holds true with data quality issues. Data quality issuesdegrade the integrity and trust in big data output. The soonerissues are identified, the sooner they can be addressed.

    Tools and processes exist to identify problem data. Afterdata has been initially ingested, cleansed, and processed, themethod of applying data scorecards begins. You must definedata profiles for the data quality scorecards and rules to beapplied to the data. Once applied, you can address exceptionsin the data both automatically and through alerts that requirehuman intervention, and monitor scorecard results. Use ofdata scorecards will help ensure data quality is maintained.

    See Your Data and Relationshipswith a 360-Degree View

    Seeing the complete picture of your data provides the great-est opportunity to identify opportunities and mitigate risks.Multiple views of data exist, but two primary perspectives toconsider are security and relationships.

    Viewing data from a security perspective entails several differ-ent factors. First, you need to identify and protect your sensi-tive data across the whole of your operations, not just in onearea. Siloed data protection is not effective if one area isn’tprotected; take a holistic approach. Set up effective securitypolicies with audit triggers and notifications for key events.These efforts will give you a much more complete view and,therefore, control of your data.

    Identifying relationships within datasets is where businessvalue and opportunities exist; you must see these relation-ships to seize the greatest benefit possible. Place your effortsinto identifying relationships between customers, intermedi-ate parties, and products to know where best to leveragesales opportunities or anticipate specific events.

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distribution, or unauthorized use is strictly prohibited.

  • 8/18/2019 9781119250555 Big Data Management for Dummies Informatica Ed

    49/53

    Chapter 5: Ten Essential Tips for Succeeding with Big Data Management 43

    These materials are © 2016 John Wiley & Sons, Inc. Any dissemination, distr ibution, or unauthorized use is strictly prohibited.

    Work with Expert Vendors to Accelerate Your Deployments

    A common error in companies engaged in big data projectsis they try to do too much themselves without assistance.Big data management and analytics is an inherently complexdiscipline; it’s not easy. Furthermore, most IT staff are almostentirely occupied just keeping current operations running;engaging in major projects outside their area of expertise often

    results in long-running projects with suboptimal results.

    A wise solution is to intelligently engage vendors specializingin big data technologies, management, and analytics fromthe very start of your project. Include in-house IT and busi-ness experts because they understand business processesand IT peculiarities of the organization better than anyoneelse. However, leverage the deep expertise and technologiesoffered by reputable vendors to do the heavy lifting specific

    to big data management. Leveraging the right mix of in-houseknowledge with outside expertise will yield positive resultsfaster and with less risk for your organization.

    Look for Process RepeatabilityThe most complex and time-consuming steps related to big datamanagement are accessing, cleansing, and integrating the data. Itis a given that the first time you perform these processes, it willbe tricky. That said, successful organizations position themselvesto endure this only once, rather than repeating the exact samestruggles every time a new dataset is identified and accessed.

    Implementing process repeatability is a key to success. First,avoid custom programming solutions unique to a situationand manual processes. Just as code reuse is a key to effectiveprogramming, standardizing and reusing processes and logicare highly beneficial with data management. Processes andlogic related to data cleansing and integration are often goodcandidates to standardize and reuse. Find and document pat-tern