Seize the Data With SAP_final

13
EDITOR’S NOTE CONFUSED BY BIG DATA HYPE? BUZZWORDS DON’T HELP CUT THROUGH THE SAP-HADOOP FOG BENEATH THE SURFACE OF DATA TRANSFORMATION Seize the Data With SAP An ever-growing assortment of business intelligence and data warehousing offerings from SAP—plus the newest swarm of industry buzzwords—can confound organizations exploring ways to house data and mine it for insight. BY ETHAN JEWETT

description

sap data

Transcript of Seize the Data With SAP_final

  • EDITORS NOTE CONFUSED BY BIG DATA HYPE? BUZZWORDS DONT HELP

    CUT THROUGH THE SAP-HADOOP FOG

    BENEATH THE SURFACE OF DATA TRANSFORMATION

    Seize the Data With SAP An ever-growing assortment of business intelligence and data warehousing offerings from SAPplus the newest swarm of industry buzzwordscan confound organizations exploring ways to house data and mine it for insight. BY ETHAN JEWETT

  • HOME

    EDITORS NOTE

    CONFUSED BY BIG

    DATA HYPE?

    BUZZWORDS DONT

    HELP

    CUT THROUGH

    THE SAP-HADOOP

    FOG

    BENEATH THE

    SURFACE OF DATA

    TRANSFORMATION

    SEIZE THE DATA WITH SAP2

    EDITORSNOTE

    A Discipline, Not a Technology

    SAP has so many approaches to business intelligence and data warehousing that one can be forgiven for not grasping them all. With the roster evolving annually, some overarching framework of understanding is needed. Thats what consultant Ethan Jewett, an SAP Men-tor who specializes in BI and data management issues, offers in this three-part guide.Elsewhere on SearchSAP, Jewett has writ-

    ten that the practical methods involve either the data warehouse modeling and management application, Business Warehouse, or a mix of SAP and third-party tools. But data warehous-ing is really a discipline for integrating and managing data over time.First, he puts the biggest buzzwords in data

    management in perspective and shows why plain-English words like honesty, integrity and transparency are better signposts. Honesty, for example, means telling the truth about the accuracy of data so users know how reliable it

    is for prediction and other analysis.Next, Jewett addresses Hadoop, todays most

    talked-about technology for big dataand the target of many BI projects. He lays out how well SAP products such as HANA and Data Services integrate with Hadoops various parts. The guide closes by examining the analyt-

    ics and visualization software through which BI users interact with data. Its impossible to understand these user-interface tools without knowing something about the data transforma-tion that products including BW use to first filter and organize the data. Jewett says separating the layers invites

    trouble by masking data-integrity issues, and he argues for a seamless, integrated approach that gives users the power to fix problems on the spot.

    David EssexExecutive Editor, SearchSAP

  • HOME

    EDITORS NOTE

    CONFUSED BY BIG

    DATA HYPE?

    BUZZWORDS DONT

    HELP

    CUT THROUGH

    THE SAP-HADOOP

    FOG

    BENEATH THE

    SURFACE OF DATA

    TRANSFORMATION

    SEIZE THE DATA WITH SAP3

    ANALYTICS

    Confused By Big Data Hype? Buzzwords Dont Help

    As anyone whos spent any time in IT knows, buzzwords are big. And nowhere are they bigger than in data management. The hype swirling around big data, for exampledriven by declarations like Data scientists running real-time, in-memory predictive analytics on big data will surely be a game changer for your business!are commonplace in todays market. In reality, the products or services behind

    these data buzzwords may disappoint. Gleaning important insights from your data is a difficult, labor-intensive and often tedious process.This isnt to say that all vendorspeak is

    meaningless, but it can be tough to tell the difference between genuine technical terms and cant. The first is a marker of expertise while the latter is indicative of sloppy think-ing. I dont consider myself an expert in sta-tistics or data science, but Ive learned enough about the concepts and techniques behind the

    buzzwords to know that each comes with its own tradeoffs and pitfalls. If were not clear on what these techniques are when embarking on data-driven projects, we run the risk of project failure.

    FALSE PREDICTIONS

    For example, what we often refer to as predic-tive analytics are algorithms that find poten-tial correlations and trends in data. Predictive algorithms arent actually predictive; at best, they tell you what will probably happen if the future is like the past. At worst, they are highly susceptible to false positives, in which correla-tions that dont actually exist are mistakenly identified.False correlations appear because of the

    random distribution of the data or through an error in the analysis method. They dont indicate a real-world phenomenon. Tools that

  • HOME

    EDITORS NOTE

    CONFUSED BY BIG

    DATA HYPE?

    BUZZWORDS DONT

    HELP

    CUT THROUGH

    THE SAP-HADOOP

    FOG

    BENEATH THE

    SURFACE OF DATA

    TRANSFORMATION

    SEIZE THE DATA WITH SAP4

    ANALYTICS

    make predictive algorithms accessible and easier to run can exacerbate the false positive problem because the more analyses are run, the more likely random error will result in the appearance of a correlation. Untrained opera-torsand even trained operatorshave a ten-dency to forget about the analyses that didnt find significant correlation in preference to the analyses that got a result. But finding correla-tions is inevitable when running enough analy-ses. True analytics software should be smart enough to recognize this.Worse, error is rarely random. There is

    always a process by which the data was gath-ered and consolidated. At every step in that process there is the opportunity to introduce errors in the data. These errors will tend to introduce false correlations. For example, you might do an analysis on profitability data, but that data is missing sales figures for several products from the western U.S. because of a bug that was introduced to the system earlier in the year. Your predictive analytics software will show that the regions profit contribution has been going downhill and will probably con-tinue down that path. In reality, your analysis

    is missing sales data but including cost and overhead data for these products. Your soft-ware might make it look like you should cut overhead to compensate when it should really remind you to check your data or suggest that the software deployment introducing the bug seems to be correlated with the change in per-formance for the region and disappearance of revenue for several products. Software vendors havent included this kind of analysis function-ality in their software because its very hard to engineer and it doesnt address the buzzwords that are driving software industry sales.

    UNDERSTANDING THE VALUE OF DATA

    On that note, Id like to introduce a few terms of my own that could help unlock the potential of data:

    Honesty: Always showing the data as it is to the best of our ability. For example, show-ing error bars on our charts and making sure that we dont imply a level of accuracy that doesnt exist, both in the data and in visu-alizations based on our data. The predictive

  • HOME

    EDITORS NOTE

    CONFUSED BY BIG

    DATA HYPE?

    BUZZWORDS DONT

    HELP

    CUT THROUGH

    THE SAP-HADOOP

    FOG

    BENEATH THE

    SURFACE OF DATA

    TRANSFORMATION

    SEIZE THE DATA WITH SAP5

    ANALYTICS

    software mentioned above might give the impression that the data is reliable when its not. This software would fail an honesty test.

    Integrity: Making sure that our data directly reflects reality. This means expending effort to avoid a situation in which the measure-ment, collection and preparation methods we use introduce their own trends into our data. The missing stock keeping unit example above shows a lack of integrity in our data preparation.

    Transparency: Ensuring the honesty and integrity of our data. Ideally, when a person is looking at any data, the details of every step of the processfrom measurement to collec-tion and aggregation to visualizationshould be available so that the viewer can assess the quality of the data. For example, the analytics software mentioned above that shows a profit margin trend line for the western U.S. should also show information about the source of

    the data, which might lead an operator to notice that the beginning of the downward trend correlated with the introduction of a new software deployment. This kind of transparency requires main-

    taining meaningful data lineage information and making that information directly avail-able in the analytic context.

    The bar Im setting here is high, perhaps, but heres the takeaway: The persistent and most common problem in the data management business isnt handling size, providing speed or automatically predicting the future. The prob-lem is getting quality data in front of experts in an honest, transparent format that provides good interactions with the data and helps them draw their own conclusions with confidence.Buzzwords like big data, real time, in-memory

    and predictive analytics dont provide business value on their own, but in the service of hon-esty, integrity and transparency they can make a major contribution to the value of our busi-ness data.

  • HOME

    EDITORS NOTE

    CONFUSED BY BIG

    DATA HYPE?

    BUZZWORDS DONT

    HELP

    CUT THROUGH

    THE SAP-HADOOP

    FOG

    BENEATH THE

    SURFACE OF DATA

    TRANSFORMATION

    SEIZE THE DATA WITH SAP6

    INTEGRATION

    Cut Through the SAP-Hadoop Fog

    Hadoop is hot. But what is Hadoop? Its an umbrella project under The Apache Software Foundation that includes several core tools for handling data processing on large comput-ing clusters. There is also a large ecosystem of related tools around the core Hadoop project, and there are multiple Hadoop distributions from companies like Cloudera, Hortonworks, IBM, Intel and MapR. Each distribution offers some combination of the core tools, ecosystem tools and, often, proprietary replacements for other pieces of the Hadoop pie that the distri-bution packager considers better in some way.There is no one tool or set of tools called

    Hadoop, so it is wise to react cautiously when vendors claim to offer Hadoop integration. The vendor may integrate with a single tool in the Hadoop core or ecosystemor with several or with none at all. SAPs integration with Hadoop suffers from this confusion as much as any vendor, so I thought it would be

    worthwhile to dig into exactly how SAPs soft-ware integrates with the various Hadoop tools. First, lets define Hadoop. It includes a few

    core tools. Those are:

    HadoopDistributedFileSystem(HDFS), a distributed file system that can run on a large cluster of computers to store huge amounts of data. Other Hadoop tools tend to be set up to use data stored on HDFS.

    YARN(YetAnotherResourceNegotiator) is the core cluster resource management framework. Most Hadoop ecosystem tools run on a YARN cluster.

    MapReduce is a system for doing parallel processing of large data sets; its on a Google research paper from 2004. This was the original Hadoop, but few vendors that offer Hadoop integration use MapReduce directly.

  • HOME

    EDITORS NOTE

    CONFUSED BY BIG

    DATA HYPE?

    BUZZWORDS DONT

    HELP

    CUT THROUGH

    THE SAP-HADOOP

    FOG

    BENEATH THE

    SURFACE OF DATA

    TRANSFORMATION

    SEIZE THE DATA WITH SAP7

    INTEGRATION

    DIFFERENCES IN INTEGRATION

    Hadoop also has a massive ecosystem of tools built around or on top of these core tools. Some ecosystem projects are also hosted at Apache. Others live elsewhere. The following are a few key projects hosted at the ASF:

    Hive: Billed as the Hadoop data warehouse, Hive is a distributed database with a data definition and query languagecalled HQLthat is similar to standard SQL. Hive tables can be managed by Hive, or they can be def-ined as external tables on top of files on HDFS, HBase and many other data sources. In this way, Hive is often a gateway to data stored in Hadoop ecosystem tools.

    Pig: A language and execution platform for creating data analysis programs.

    HBase: A massively parallel, short-request database, originally modeled on Googles Big-Table research paper.

    Other projects include Spark (in-memory cluster computing and streaming framework),

    Shark (Hive on Spark), Mahout (analytics algorithms library), ZooKeeper (a centralized service for maintaining information on con-figuration and other factors) and Cassandra (similar to HBase).

    So how do SAPs products integrate with Hadoop tools? At the moment, SAP offers what it calls Hadoop integration in SAP HANA, Sybase IQ, SAP Data Services and SAP Busi-nessObjects Business Intelligence (BI). Each of these integrates with Hadoop tools differently.SAP HANA and Sybase IQ both support

    forwarding queries and other operations to a remote Apache Hive system as if the Hive tables were local tables. In Sybase IQ, this setup is called a remote database and in HANA the setup is through the Smart Data Access mechanism. IQ also supports a type of user-defined function to process data on the database server called a MapReduce API. Despite SAP lumping this API under its Hadoop integration marketing, it has nothing to do with Hadoop.SAP BusinessObjects BI supports access to

    Apache Hive schemas through the universe

  • HOME

    EDITORS NOTE

    CONFUSED BY BIG

    DATA HYPE?

    BUZZWORDS DONT

    HELP

    CUT THROUGH

    THE SAP-HADOOP

    FOG

    BENEATH THE

    SURFACE OF DATA

    TRANSFORMATION

    SEIZE THE DATA WITH SAP8

    INTEGRATION

    concept, much as you might connect to any other database. This type of connection theo-retically allows access to data in many differ-ent storage systems through Hives external table concept, including HBase, Cassandra and MongoDB.

    THE HADOOP INTEGRATION PROMISE

    So far weve seen that SAPs Hadoop integra-tion is usually just Hive integration. Integrating with Hive via HQL is great and is what most vendors mean when they claim Hadoop integra-tion. But its different than the image of deep integration across the varied Hadoop ecosystem tools that these vendors want to project.SAP Data Services starts to deliver on the

    Hadoop integration promise a bit more. In addition to the ability to load data to and from Hive, Data Services can create and read HDFS files directly and do some transformation

    push-down operations using Pig scripts. This means that data can be joined and filtered directly in the Hadoop cluster rather than needing to move to the Data Services server to be processed. Data Services also is able to offload its text data processing onto a Hadoop cluster as MapReduce jobs. So here, SAP is justified in implying deeper integration across multiple Hadoop tools.Lastly, a word of warning: The Hadoop eco-

    system moves fast and enterprise software often lags Hadoop. According to SAPs product availability matrix, support for Hive, Pig and HDFS are limited to fairly old versions that dont support the latest improvements in per-formance, high availability and cluster capacity. Check vendor claims of support for your ver-sions of specific Hadoop tools carefully because Hadoop versioning is confusing and enter-prise software vendor representatives may not understand it fully.

  • HOME

    EDITORS NOTE

    CONFUSED BY BIG

    DATA HYPE?

    BUZZWORDS DONT

    HELP

    CUT THROUGH

    THE SAP-HADOOP

    FOG

    BENEATH THE

    SURFACE OF DATA

    TRANSFORMATION

    SEIZE THE DATA WITH SAP9

    OUTLOOK

    Beneath the Surface of Data Transformation

    Data transformation and preparation, data visualization and business intelligence software are undergoing a sea change, even if it sometimes seems nothing much has changed in the last 15 years.The transformation overtaking the industry

    appears to be in its early days but is driven by persistent problems with IT agility, data qual-ity and the lack of transparency in the systems that manage and display data.We are clearly moving in the direction of

    faster and more visual interaction with data, but we are only scratching the surface with regards to understanding and interacting with it.

    THE OLD STANDBYS

    In current standard software products, data transformation operations like combining, filtering and fixing data are strictly separate

    from data visualization and analysis functions. Transforming or changing data is a task usu-ally reserved for technical people and accom-plished, process-oriented tools like SAPs Data Services and Business Warehouse (BW) and standard computer programming languages like Java or Python.The output of transformation toolsusually

    fairly static database tablesis the input for separate data analysis and visualization. Most tools, like SAPs Crystal Reports, allow users to run prepared queries to illustrate a single aggregated slice of the database. More advanced data analysis tools allow the user to navigate with some flexibility within the bounds of the pre-existing data set. Usually these more flex-ible tools appear as analytics tools (SAPs Anal-ysis for Office or Design Studio dashboards), though there is no reason these types of flex-ible but constrained analyses might not be use-ful in business process contexts.

  • HOME

    EDITORS NOTE

    CONFUSED BY BIG

    DATA HYPE?

    BUZZWORDS DONT

    HELP

    CUT THROUGH

    THE SAP-HADOOP

    FOG

    BENEATH THE

    SURFACE OF DATA

    TRANSFORMATION

    SEIZE THE DATA WITH SAP10

    OUTLOOK

    Some existing tools, usually billed as self-service BI or data exploration, incorporate basic data preparation capabilities, usually using a process- or programming-based view of the data preparation stage. Tableau Software and QlikView were two pioneers of this approach, providing fairly advanced data visualization capabilities on a platform where the user was

    responsible for all data loading and preparation tasks. SAPs Lumira follows in these footsteps, giving users a way to load new data, connect to existing data sets or join some combination of data setsand then visualize the data.But the strict separation of the visualization

    or analysis process from data transformation is a nagging weakness of all these tools. When do people realize theres a problem with data that

    needs to be resolved? When they are visual-izing it or running analytics functions on it. So why not allow a user to fix the problem then and there?

    ON THE HORIZON

    A different approach to data transformation more closely aligned to the actual structure of the data is emerging as a popular alternative. It trades the process-oriented approach to data transformation for one more closely aligned with the internal structure of the data being processed. That approach is to display even very large data sets as spreadsheets and provide the user with data transformation options that are mapped onto the spreadsheet paradigm. This is not a new approach, but the cohort of tools (Open Refine, Data Wrangler, IBMs Big-Sheets) developed around 2010 to 2012, were the first of this type of tool to gain widespread adoption.The idea is that the spreadsheet or table

    is a pretty direct visual representation of the raw structure of many standard data formats. Showing a database table in a tabular format

    A different approach to data transformation trades the process-oriented approach for one more closely aligned with the internal structure of the data being processed.

  • HOME

    EDITORS NOTE

    CONFUSED BY BIG

    DATA HYPE?

    BUZZWORDS DONT

    HELP

    CUT THROUGH

    THE SAP-HADOOP

    FOG

    BENEATH THE

    SURFACE OF DATA

    TRANSFORMATION

    SEIZE THE DATA WITH SAP11

    OUTLOOK

    makes its structure and a small amount of the data in the table explicit. Given the proper tools, that structure and data can be manipu-lated in a way that is immediately visible in the spreadsheet view, and which can be mapped back on to the original data set.It appears that spreadsheet-driven data

    transformation has legs, getting good uptake in the form of OpenRefine. And its receiv-ing significant attention in upcoming products like Trifacta and Spark Cloud, the latter of which uses related concepts of tabular rep-resentations of data. This approach begins to address the severe lack of analytics and visual-ization tools integrated into the data transfor-mation process, giving the people processing data the tools to assess and understand data as they change it. But deep analytics and special-ized visualization tools remain separate.

    THE FUTURE

    The current trend is to make data transfor-mation a more visual experience, making the results of data transformations on the data set itself more explicit and immediate. But the job

    of extracting meaning from data is still left to more specialized interfaces, usually operat-ing on aggregated slices of the full data set and often featuring visual abstractions like charts and graphs.But theres a tension implicit in this arrange-

    ment: As stated already, understanding data and extracting meaning from it is an integral part of the process of transforming data. One cant really know how to transform a data set without understanding it, and its usually in the process of extracting meaning from data that we find problems with the data that need to be fixedor realize that the data is incom-plete for our purposes and has to be augmented with another data set. In other words, the process of visualization is exactly the point at which we want to be able to change the

    One cant really know how to transform a data set without understanding it, and its usually in the process of extracting meaning from data that we find problems that need to be fixed.

  • HOME

    EDITORS NOTE

    CONFUSED BY BIG

    DATA HYPE?

    BUZZWORDS DONT

    HELP

    CUT THROUGH

    THE SAP-HADOOP

    FOG

    BENEATH THE

    SURFACE OF DATA

    TRANSFORMATION

    SEIZE THE DATA WITH SAP12

    OUTLOOK

    underlying data, but our tools prohibit us from doing this. I expect that over the next five to 10 years,

    we will begin to see this tension addressed in earnest, with more products allowing editing or augmentation of data through the visualization interface.Its currently an active area of research,

    including in the Palladio research project, on whichfull disclosureI work as lead developer.In some sense, the products based on the

    spreadsheet paradigm are one of the first mass-market implementations of this approach. Most likely, these products and others like them will continue to improve their visualiza-tion capabilities while maintaining the ability to change data through these visualizations. If visualization-focused vendors are paying attention, they will also start to incorporate data manipulation capabilities into their visu-alization tools. It will be interesting to see who will manage to address this gap most quickly and comprehensively.

  • HOME

    EDITORS NOTE

    CONFUSED BY BIG

    DATA HYPE?

    BUZZWORDS DONT

    HELP

    CUT THROUGH

    THE SAP-HADOOP

    FOG

    BENEATH THE

    SURFACE OF DATA

    TRANSFORMATION

    SEIZE THE DATA WITH SAP13

    ABOUT THE

    AUTHOR

    ETHAN JEWETT is an independent consultant and SAP Mentor who focuses on business intelligence, data man-agement and performance management. Follow him on Twitter: @esjewett.

    Seize the Data With SAP is a SearchSAP.com e-publication.

    Scot Petersen | Editorial Director

    Jason Sparapani | Managing Editor, E-Publications

    Joe Hebert | Associate Managing Editor, E-Publications

    David Essex | Executive Editor

    Linda Koury | Director of Online Design

    Neva Maniscalco | Graphic Designer

    Doug Olender | Publisher | [email protected]

    Annie Matthews | Director of Sales [email protected]

    TechTarget 275 Grove Street, Newton, MA 02466

    www.techtarget.com

    2014 TechTarget Inc. No part of this publication may be transmitted or re-produced in any form or by any means without written permission from the publisher. TechTarget reprints are available through The YGS Group.

    AboutTechTarget: TechTarget publishes media for information technology professionals. More than 100 focused websites enable quick access to a deep store of news, advice and analysis about the technologies, products and pro-cesses crucial to your job. Our live and virtual events give you direct access to independent expert commentary and advice. At IT Knowledge Exchange, our social community, you can get advice and share solutions with peers and experts.

    COVER PHOTOGRAPH: DIGITAL VISION/THINKSTOCK

    STAY CONNECTED!

    Follow @SearchSAP today