Reporting_Data Architecture Strategy

20

Click here to load reader

Transcript of Reporting_Data Architecture Strategy

Page 1: Reporting_Data Architecture Strategy

Reporting - Data Architecture Strategy

(Draft 2012-06-27 for Discussion)

BUILDING THE DATA FOUNDATION FOR REPORTING & ANALYTICS

1

Page 2: Reporting_Data Architecture Strategy

TABLE OF CONTENTS

1 EXECUTIVE SUMMARY ........................................................................................................ 3 1.1 INTRODUCTION.....................................................................................................................31.2 ARCHITECTURE DIAGRAM....................................................................................................31.3 PRIMARY TOPICS.................................................................................................................3

2 WHAT DATA WILL BE MADE AVAILABLE FOR REPORTING & ANALYTICS ................ 4 2.1 SELECTING DATA FROM SOURCE SYSTEMS AT THE TABLE LEVEL........................................42.2 IDENTIFICATION OF DESIRED SAP TABLES...........................................................................5

2.2.1 COMMONLY USED TABLES BY OTHER SAP CUSTOMERS..................................................52.2.2 ADDING SAP TABLES BASED ON ANTICIPATED BUSINESS NEED......................................52.2.3 FLEXIBILITY TO ADD ADDITIONAL SAP TABLES IN FUTURE..............................................6

2.3 IDENTIFICATION OF DESIRED TABLES FROM OTHER SYSTEMS...............................................62.4 [COST OF INCLUDING DATA] [MOVE THIS TO TOOL SELECTION CRITERIA?]..............6

2.4.1 TYPE OF DATA...............................................................................................................62.4.2 TOTAL VOLUME OF DATA...............................................................................................62.4.3 METHOD OF DATA MOVEMENT.......................................................................................7

3 HOW WILL THE DATA BE MOVED? ................................................................................... 8 3.1 PROPRIETARY EXTRACTORS FOR PRE-DEVELOPED CONTENT...............................................83.2 PROPRIETARY/GENERIC EXTRACTORS FOR TABLE LEVEL EXTRACTION AND LOADING............83.3 RFC’S AND BAPI’S.............................................................................................................83.4 REPLICATION.......................................................................................................................9

2

Page 3: Reporting_Data Architecture Strategy

1 Executive Summary

1.1 Introduction

This Data Architecture and Strategy document (the “Data Architecture”) is the second document in a series of three documents that together outline and memorialize the reporting and analytics strategy of Kiewit.

The other two documents are the [Reporting Process Strategy] document, and the [Reporting Tool Selection] document. The Data Architecture is comprised of a diagram (see Section 1.2) and accompanying text that describes each aspect of the diagram.

The purpose of the Data Architecture is to provide an end-to-end view of where Kiewit is headed with respect to the data layer needed to support transactional and operational reporting, as well as a variety of analytical applications.

The Data Architecture needs to allow for rapid, incremental success in the areas of transactional and operational reporting needs, while at the same time lay the groundwork for advanced and sophisticated achievements in areas such as predictive analytics. The architecture must also be the readily adaptable bridge between the more stable domain of data collection in transactional systems, and the evolving marketplace of new front end tools and delivery methods.

1.2 Architecture Diagram

See [attached / appendix]

1.3 Primary Topics

The primary topics described are:

WHAT data will be included? HOW will the data be moved? WHERE will the data be moved? WHAT development will take place on the data prior to consumption for reporting and

analytics? WHO will do that development, and at WHEN during the process?

These questions will be addressed according to key strategies and principles, without regard to specific toolsets.

3

Page 4: Reporting_Data Architecture Strategy

2 What Data Will be Made Available for Reporting & Analytics

The data will consist of both SAP data and non-SAP structured data from other Kiewit systems such as Hard Dollar and Telematics. Unstructured data (e.g., retained email for legal compliance) is not included in the scope.

The first step in the strategy to identify the data needed to support anticipated reporting and analytics needs. Not all of the data collected in SAP and non-SAP systems will have relevance for reporting and analytics. The strategy in this section outlines the process to determine the subset of relevant data to include.

2.1 Selecting Data from Source Systems at the Table Level

The first strategic decision is whether to pull pre-delivered content (i.e., fixed data sets for specific reporting or analytical use) or alternatively, to pull table-level data from the source systems. The primary pros and cons of each alternative are:

“Pre-developed content” o Pros:

Matches to a specific data source and/or or output tool, giving rapid results for the defined scope

o Cons

Only matches to a specific data source and/or output tool Not transparent, which makes modification difficult Pulls may take more time, since pre-defined extractors are more

complicated

Table by table basiso Pros:

Flexible for developers later when reporting needs change, since whole tables are available, and specific new extractors do not need to be developed

Method proven during Kiewit POC for V0 Other?

To lay the proper groundwork for long term viability, the Data Architecture relies on pulling data at the table level. Existing data assets that have been sourced as pre-developed content will be maintained as needed, however new development should focus on a data foundation built from tables pulled from the SAP and non-SAP source systems.

4

Page 5: Reporting_Data Architecture Strategy

2.2 Identification of desired SAP Tables

Of primary importance is determining the desired SAP tables to include in the reporting and analytics universe. Since SAP has well over 80,000 tables, it is impractical and actually counterproductive for the development team to simply pull them all. On the other hand, an adequate “cushion” of tables is sought to ensure that progress in development of reports and analytic outputs is not derailed by the need to stop the process while new tables are brought into the reporting environment.

2.2.1 Commonly Used Tables by other SAP customers

SAP subject matter experts will provide a list of “usual suspects” that are tables commonly used by SAP customers for their reporting and analytics. This list should include both tables with substantive data, and ancillary related tables. This list should be viewed as a starting point, and not as a final list that would meet Kiewit’s needs.

Another list to inform Kiewit’s decision would be the list of SAP tables currently being pulled for reporting by TIC. This list is attached as [Appendix A-1]. There are differences in the SAP environments and the reporting needs of Kiewit and TIC, so this list should also not be viewed as a final list to meet Kiewit’s needs. However, it is a helpful comparison point.

In the next step, the SME list of will be compared with the TIC list. A new list will be created, called the Baseline SAP Table List, which will include all tables that appear on either the SME list or the TIC list. The Baseline SAP Table List will comprise the minimum list of tables to be pulled from SAP into the reporting and analytic environment.

2.2.2 Adding SAP Tables Based on Anticipated Business Need

Since Kiewit is new to SAP, in the near future the business users are unlikely to know SAP table names of interest. Rather than seeking to gather their direct input at this time, the strategy is to anticipate their likely needs and pull adequate tables to cover those needs. In addition to the baseline list of tables described in Section 2.2.1, the following tables will also be included:

Tables utilized for the V0 Proof of Concept (list from KieCore) Tables anticipated for usage in V1 (table list to be developed by consultation among

[WHO] and a SAP subject matter expert.) Other tables of likely business interest and not already included, based on the installed

Modules of SAP (table list to be developed by consultation among [WHO] and a SAP subject matter expert.)

Ancillary tables that are needed for meaningful reporting and analytics (table list to be developed by a subject matter expert with knowledge and tools to find related ancillary tables.)

5

Page 6: Reporting_Data Architecture Strategy

These tables are then added to the table list developed in Section 2.2.1, for the comprehensive initial set of SAP tables. This set is intended to be comprehensive, and need only minimal additions in the future.

2.2.3 Flexibility to Add Additional SAP Tables in Future

Ideally, the process described in Sections 2.2.1 and 2.2.2 will result in the pulling of all of the SAP data needed to meet the near and intermediate term desires of the business for reporting and analytics, without cluttering the reporting landscape with thousands of tables that are clearly not relevant. By pulling tables, and not pre-designed outputs, there is always the flexibility to develop and redevelop data assets starting with the tables, rather than having to go back to the first step of building a new specific extractor. In the event that a handful of tables are not included in the initial universe, and are later identified as important for the reporting landscape, those additional tables could be readily added. The protocol for adding tables needs to be established after the tool selection for data movement.

2.3 Identification of Desired Tables from Other Systems

KieCore has currently identified [670] tables from Hard Dollar, Telematics, and other applications that are desired for the reporting and analytics environment. These tables are attached in [Appendix ___]. Changes to this list would be made based on the [SEE PROCESS DOCUMENT.] As with the SAP tables, there is future flexibility to add additional tables from a variety of non-SAP source databases.

2.4 Considerations related to data type and movement

After the desired initial data scope is determined, an evaluation should be made regarding the characteristics of the data and any particular implications related to the nature of the data and/or the proposed methods of moving the data. Some of the implications are highly dependent on tool selection for data movement, and on tool selection and nature of the target reporting system.

2.4.1 Type of Data

Some categories of data present more technical constraints than others. For example, SAP cluster tables are not accessible in the same ways as SAP non-clustered tables. Another example is that SAP data from different SAP functional modules (e.g., SD and FI) cannot be readily combined for reporting purposes inside the SAP landscape, making cross-functional reporting a challenge.

6

Page 7: Reporting_Data Architecture Strategy

2.4.2 Total Volume of Data

At certain thresholds, very large data volumes become unwieldy and more expensive to manage. Compression of data and design of the database and can reduce the total volume of data in the reporting environment. If the source SAP data is not already compressed, then compression in the target reporting environment would often be in the range of 7:1 to 10:1, depending on the specific SAP table being compressed. [Do Hard Dollar, Telematics and other applications pose similar “big data” challenges?] With regard to database design, a compressed ODS could have a fraction of the data of the source systems, while some multidimensional data warehouse designs could have a data volume rate that increases in a non-linear fashion more rapidly than the rate in the source systems.

2.4.3 Method of Data Movement

In some methods of data movement, there is an economy of scale when moving multiple tables. In other methods of data movement, there is no economy of scale, whether with regard to establishing the initial pull of data or with regard to maintenance.

7

Page 8: Reporting_Data Architecture Strategy

3 How will the Data be Moved?

The Data Architecture analyzes a number of methods to move the data from the source SAP and non-SAP systems into the target, and identifies the recommended methods for both the near and longer term. The methods reviewed are:

Proprietary Extractors for pre-developed content Proprietary/Generic Extractors for Table level extraction and loading RFC’s and BAPI’s Replication

3.1 Extractors for Pre-Developed Content

A variety of proprietary tools exist in the market to act as a go-between for specific data sets and specific outputs. Without reviewing specific tools, which is beyond the scope of the Data Architecture, we reviewed the pros and cons of this approach from an architectural standpoint.

On the positive side, when absolutely no modifications or customizations are required, the use of these proprietary tools can speed the time to deployment. On the negative side, very few implementations are truly “out of the box.” When customizations are required, they make the data movement layer brittle, in addition to negating the primary benefit of speed to solution.

Another downside to extractors is that they work through the central instance of SAP or other application. In the case of SAP, the extractions of pre-defined content can be extremely expensive, and must be done in windows where the added load will not interfere with SAP transactions.

Because the Data Architecture is a long term strategy and not the means to meet a short term specific need, proprietary extractors for pre-defined content are excluded from the going forward road map. They do not provide an adequate groundwork for sourcing data for myriad purposes in the future, from operational reporting to predictive analytics, and for the variety of tools (and versions/variations) that could be expected over the lifecycle of the SAP deployment.

3.2 Extractors for Table level extraction and loading

[Describe method used in VO Proof of Concept]

3.3 RFC’s and BAPI’s

RFC’s (Remote Function Calls) and BAPI’s call the SAP central instance, and communicate data requests via the ABAP coding language. The benefit is the ability to access all types of SAP data, including cluster tables. The downside is potentially tremendous burden placed on

8

Page 9: Reporting_Data Architecture Strategy

the SAP central instance by the requests. Again, the result is often running the RFC’s and BAPI’s during quiet windows.

3.4 Replication

Replication technology is used for a wide variety of SAP purposes, such as mirroring, reporting and HA/DR. (See SAP Technical Note _______________). However, care must be made not to use trigger-based replication strategies with SAP data sources. Replication is also a viable strategy for the non-SAP data sources needed for the reporting environment.

Replication can be of a push or pull variety, but some basics are always present. There is always some type of initializing “snap shot” of data, followed by a continual stream of updates. The intial snap shot presents some, but not all, of the challenges of a batch extract. After the initial snapshot and the placing of tables under replication, there is very little ongoing administration needed. The ongoing administration consists primarily of:

- Monitoring system resources (memory)- Enforcing change management procedures with regard to activities that cause

contention with replication, such as initiating SAP Transports- Following protocols when updates are made to SAP and SAP tables- Updating replication parameters when changes to the network occur (such as

renaming of the source database)

Pros:

elimination of batches, deltas, queues, and other challenges inherent in extractor based approaches

higher uptime for reporting and analytics – less chance of replication failing than of batch failing

data is always “complete” data is “always on” – no shutdowns for lengthy batch runs efficiency of scale in pulling large numbers of tables and large data volumes no development of custom extracts resilient – not brittle makes real-time data available; data “as of” particular time is still available enables new methods to drive SAP workflows not burdensome on the SAP transactional system – “offloads” the reporting burden –

resulting in better performance for both the reports and the SAP transactional system

Cons:

replication alone does not result in easily useable SAP data – additional software needed

limited software vendors offer replication based data movement tools specifically for SAP data

9

Page 10: Reporting_Data Architecture Strategy

perception that the purpose of replication is only for real-time data

3.5 Summary of Data Movement Recommendations

The Data Architecture recommends that initial efforts focus on moving data via extractors for table level extraction and loading (Section 3.2). Longer term, one or more of these extractors could be replaced by a replication based approach (Section 3.4). The criteria for switching tables to replication include:

batch has frequent failure rate (specific batch or various batches in the aggregate) the batch processing time has outgrown the available time window (may occur as data

volumes grow) need to drive up efficiency over longer term (i.e., redeploy resources from batch

management to other initiatives) certain data needs to be refreshed more frequently than the batch window (e.g., to

facilitate month end close)

If some tables are updated by batch and others by replication, time coherence may be managed by the design of the reporting and analytics queries. (Time coherence is also of concern when all tables are loaded by batch, although management techniques may differ.)

Extractors for pre-defined content (Section 3.1) are not a part of the going forward road map for new development, based on the limits described in that section. Any such extractors used for V0 reporting will be maintained unless the cost of maintenance outweighs the benefits, and a determination is made to redevelop the content based on one of the recommended methods of data movement.

RFC’s and BAPI’s (Section 3.3) are recommended only for specific data types, such as SAP cluster tables. Longer term, these RFC’s would be eligible for replacement by replication based methods on the same criteria described above.

10

Page 11: Reporting_Data Architecture Strategy

4 Where the Data will be Moved – Initial Target is ODS

This section of the Data Architecture describes the initial target of the data identified in Section 2, and whose movement is described in Section 3.

This initial target is a single Operational Data Store (“ODS”). This section describes why an ODS is the first target, strategic considerations for the establishment of the ODS, and the basic development that will take place inside the ODS.

4.1 Why an ODS as the Initial Target for Data?

One design constraint is to move data out of the multiple source systems only once (i.e., into one target only). If data is needed in multiple reporting/analytic environments, then it should be moved from the initial target into subsequent downstream systems. This design constraint is necessary to achieve a number of objectives related to data integrity, management of extracts, multi-source data integration, etc. It therefore becomes necessary to carefully select the initial target for the data moved out of the source systems.

An ODS is only one alternative considered as the initial target for the data; other alternatives considered were a multidimensional data warehouse and a tabular database.

The Data Architecture takes an “AND” approach instead of an “OR” approach to database models. Relational and multidimensional databases offer different functionality and benefits, and both are included in the Data Architecture (see diagram in Section 1.2). An ODS (relational database) can act as a steppingstone to data warehouses (multidimensional database) and tabular databases (tabular model with many benefits similar to multidimensional databases). Conversely, neither a data warehouse nor the tabular database is suited to act as a steppingstone to the other alternatives. Therefore, the logical first target of data is into the ODS. This choice sets up the pathway to the downstream multidimensional and tabular environments. It also serves as the data source for operational reporting.

While logic dictates targeting the ODS to receive the data moved from Kiewit’s source systems, there is the positive side benefit that the ODS offers the shortest time to solution. While reporting against the ODS will not meet all of Kiewit’s business needs, it is capable of meeting the vast majority of operational reporting needs. Meeting these needs quickly will provide a win-win scenario, or an upward spiral, where because business users see useful reports from their transactional systems, they put more focus and effort to ensure data is entered properly into the systems.

Multi-dimensional data warehouses and tabular in-memory databases offer functionality and performance that are not available from an ODS. These alternatives will be discussed in later sections of this Data Architecture.

11

Page 12: Reporting_Data Architecture Strategy

4.2 Establishment of the ODS

This Data Architecture includes a single ODS database comprised of multiple schemas related to the SAP and non-SAP data sources.

4.2.1 Single Instance – One ODS database

Having a single database instance will optimize performance, since joins between tables in a single instance take advantage of the database indexes and optimizers.

When joins are made in a single instance, there is no need for distributed transactions, which largely negate the performance improvements of modern ANSI SQL databases. Also, joins made in a single database require less maintenance effort than joins across databases.

Sometimes having multiple databases is useful to allow different security access to be enforced at the database level. Also, having different databases allows for different back up options (by database).

However, this Data Architecture concluded that the benefits of multiple databases are outweighed by the benefits of a single instance.

4.2.2 Schemas

4.2.3 Building Tables to receive the imported tables

Each table from both SAP and non-SAP sources will need to have a corresponding table established in the ODS. A determination will need to be made on whether to add “audit columns” for these topics:

Create date/time Created by [ID of program doing the pull] Update date/time Update by [ID of program doing the update]

[discussion of cost of these audit columns, and benefit of having them]

4.2.4 Catalog Tables

The database engine will automatically create the catalog tables after [STEP X]. [Discuss how the catalogs can help developers find data.] Using one database will result in one set of catalog tables. A unified catalog is one aspect of a streamlined development environment that positions IT for faster cycle times in development and response times for business requests.

12

Page 13: Reporting_Data Architecture Strategy

4.3 Populating the ODS

4.4 Developing Views

4.5 Maintaining the ODS

13

Page 14: Reporting_Data Architecture Strategy

5 Data Warehouse – Sequentially follows the ODS

This section of the Data Architecture describes the strategy for leveraging the data that has been sourced into the ODS, as described in Section 4, via the establishment of a data warehouse. First, there is a discussion of the design considerations for establishing both tabular and multi-dimensional models in [either the same or separate data warehouses]. Next, this section describes strategic considerations for the establishment of each model, populating the data warehouse with data, the further development of the data, and maintenance considerations.

Because the tabular model is relatively new when compared with relational and multidimensional models, this section will provide some additional background information about the tabular model as a “level set.”

5.1 Why Two Data Models for Analytics?

Kiewit’s data is a corporate asset, and to manage that asset for the greatest return requires an “AND” approach rather than an “OR” approach to analytic data models. While there is some overlap in the functionality and benefits of the tabular model and the multidimensional model, each has specific strengths and drawbacks. The Data Architecture makes available both models, allowing the best attributes of each to be available to support business decisions and processes.

The establishment of the Tabular and the Multidimensional models may be performed linearly or concurrently. If done linearly, then the Data Architecture recommends establishing the tabular model first, since it is simpler, faster to solution, and will be a useful tool for prototyping for the multidimensional model.

The Tabular and Multidimensional models can reside in one data warehouse. [DISCUSS MORE ABOUT THIS ARCHITECTURE, AND CRITERIA FOR SEPARATING]

5.2 Tabular Model

A Tabular Database is built on a tabular model and runs in memory. There is less installed base and history with Tabular Databases than with the traditional multidimensional data warehouse model, because the multidimensional model predates in memory database developments.

The popularity of the tabular model has been growing because it offers a unique bundle of benefits when compared with both relational databases and multidimensional databases. These benefits include:

14

Page 15: Reporting_Data Architecture Strategy

Rapid Initial Development – ability to leverage existing relational model, without the need for building star schemas and dealing with the resultant ETL complexities

Faster, simpler development of data for specific reporting and analytic needs when compared with multidimensional development

Eliminates snapshots (e.g., quantity by time period) because the calculations can be done on the fly at the time of the query, thanks to power of the in memory database

Extremely fast end user experience Faster performance for distinct counts when compared with multidimensional

Drawbacks of the tabular model, which can be overcome with the multidimensional model, include:

Not suited for very complex models and data sets Doesn’t support many-to-many relationships No writeback support Other

5.2.1 Establishing the Tabular Model

5.2.2 Populating the Tabular Database

5.2.3 Development of Data Sets for Reporting using the Tabular Model

5.2.4 Maintenance

5.3 Multidimensional Model

5.3.1 Establishing the Multidimensional Model

5.3.2 Populating the Multidimensional Warehouse

[IMPORT FROM DOC #1 REGARDING ETL]

5.3.3 Cube Development

5.3.4 Maintenance

15

Page 16: Reporting_Data Architecture Strategy

6 Reporting Development Environment

[introduction]

6.1 Operational Reporting

6.1.1 Transactional Reports

[discuss reports available directly from source systems]

6.1.2 Operational Reports

6.1.3 Ad Hoc Queries

6.1.4 Self Service

6.1.5 Data Discovery (overlaps ad hoc, self serve & analytics)

6.2 Analytics

6.2.1 Data Mining/Predictive Analytics

6.2.2 Relation to Planning and Forecasting

6.3 Delivery Layer

6.3.1 Portal

6.3.2 Dashboard / Drilldowns

6.3.3 Remote Access to Kiewit network

6.3.4 Mobile Devices and BYOD

16