PPDM data loading best practice
-
Upload
etlsolutions -
Category
Technology
-
view
2.241 -
download
6
description
Transcript of PPDM data loading best practice
www.etlsolutions.com
Data LoadingBest PracticePPDM Association 2012 Data Symposium
Agenda
Data loading challenges
Best practices in tools and
methodology
We’ll be taking a look at loading data into PPDM, but much of this applies to generic data loading too
Agenda
Data loading challenges
Best practices in tools and
methodology
We’ve been listening to the Data Manager’s perspective
o PPDM Conference, Houston
o PNEC, Houstono Data Managers’
challenges:– Education– Certification– Preserving
knowledge– Process
Data management is difficult and important
Different data movement scenarios
Data
Migration
LoadingIntegration
But all require mapping rules for best practice
The business view of data migration can be an issue
• Often started at the end of a programme• Seen as a business issue (moving the filing
cabinet), not technical• However, the documents in the filing cabinet
need to be read, understood, translated to the new system; obsolete files need to be discarded
Different data migration methodologies are available
PDM (Practical Data Migration)
• Johny Morris• Training course• PDM certification• Abstract• V2 due soon
Providers• Most companies
providing data migration services/products have a methodology
• Ours is PDM-like, but more concrete and less abstract
Agenda
Data loading challenges
Best practices in tools and
methodology
Methodology
Core migration
As an example, our methodology
Landscape analysis
Data discovery
Data modelling
Data assurance
Data review
Data cleansing
Requirements analysis
Project scoping
Configuration
Migration design
Testing design
Migration development
Testing development
Execution
Review
Legacy decommissioning
Firstly, review the legacy landscape
Legacy
SAP
ApplicationReport
Archive
Access DBExcel
VBA
Satellites
Eradicate failure points
Beware the virtual waterfall process
Signoff
MigrateAgile DevelopmentRequirements
Agenda
Data loading challenges
Best practices in tools and
methodology
Rules
• In data migration, integration or loading, one area of commonality is the link between source and target
• This requires design, definition, testing, implementation and documentation
• The aim is automated loading of external data into a common store
• This requires best practice
Rules are required
PPDM
Best practice: A single version of truth
PPDM 3.8
• So for each of these data loaders we want a single version of truth
• Whatever artifacts are required, we want to remove duplication, because duplication means errors, inconsistency and additional work
• We want to remove boiler plate components that are only indirectly related to the business rules by which data is loaded
• Let’s look at what goes into a data loader and where the duplication and unnecessary work comes from...
The PPDM physical model
• PPDM comes to us as a physical projection, rather than a logical model – maps directly to a relational database
• Access therefore via SQL, PL/SQL; low level detail is important i.e. how relationships are implemented (e.g. well header to borehole)
• Considerations to access: primary keys, foreign keys, data types – conversions, maximum lengths. Load order required by FKs – PPDM Load of the rings, relationships – cardinality etc
• SQL: only know at runtime, so turnaround can be slow
• All of this metadata is available in machine readable format, so we should use it
External data sources
• Looking at the external files, we need a variety of skills: text manipulation, XML processing, Excel, database
• The data model is unlikely to be as rich as PPDM, but there is some definition of the content e.g. Excel workbooks have a tabular layout with column titles, worksheets are named
• It can be hard to find people with the relevant skills - you sometimes see ad hoc, non-standard implementations because the developer used whatever skills he/she had: perl, python, xslt, sql
• So the next clue is that we should use the model information: what elements, attributes and relationships are defined, rather than details of how we access it
• Abstract out the data access layer; don’t mix data access with the business rules required to move them into PPDM
PPDM 3.8
Challenges with domain expert mapping rules• A common step for defining how a data source is to be loaded is for a domain expert to
write it up in Excel• Not concerned with data access, but some details will creep in, e.g. specifying an xpath• When lookups, merging/splitting values, string manipulation, conditional logic appear,
the description can become ambiguous• Also note the duplication: the model metadata is being written in the spreadsheet; if the
model changes, the spreadsheet needs to be manually updated
Challenges with developer mapping rules• The example here probably wouldn’t pass a code inspection, but it does illustrate the
type of issues that can arise• Firstly, duplication: this is reiterating the Excel rules – they need to match up, but while
a domain expert might follow the simple example previously, low level code can be tricky to discuss
• Secondly, metadata is again duplicated: the names of the tables and columns appear in the SQL statements, the max length of the name column is checked
• Thirdly, boiler plate code: select/update/insert conditional logic• Fourthly, data access code appears in the rules• Finally, the code becomes hard to maintain as the developer moves on to other roles
Documentation of mapping rules
Word document for sign-off
Data Management record How data was loaded Stored in your MDM
data storeCan be queried
PPDM mapping tables
• Here is where you do require some duplication
• Tests are stories:• Define what the system
should do• If it does, the system is good
enough if the tests are complete
• If we use a single version of truth to generate tests, the tests will duplicate errors, not find them
Test artifacts
Agenda
Data loading challenges
Best practices in tools and
methodology
Tools
Use tools
• Use available metadata• Abstract out data access layer• Higher level DSL for the mapping
rules:• Increase team communication
– developer/business• Reduce boiler plate code
• One definition:• Replace Excel and code• Generate documentation
An example of a graphical tool: Altova MapForce
• Tools such as Talend, Mule DataMapper and Altova MapForce take a predominantly graphical approach
• The metadata loaded on the left and right (source/target) with connecting lines
• In addition to the logic gates for more complex processing, code snippets can be added to implement most business logic
• Issues:• Is it really very easy to read? The
example here is a simple mapping; imagine PPDM well log curve, reference data tables etc
• It isn’t easy to see what really happens: a+b versus an “adder” – e.g. follow the equal() to Customers – what does that actually do?
• But: can generate documentation and executable from that single definitive mapping definition
• Typing errors etc are mostly eliminated
ETL Solutions’ Transformation Manager
• An alternative is to use a textual DSL: again the metadata has been loaded
• No data access code• Metadata is used extensively: for example
warnings, primary key for identification; relationships
• Typing errors are checked at designtime, and model or element changes affecting the code are quickly detected e.g. PPDM 3.8 to 3.9
• Rels used to link transforms: a more logical view with no need to understand underlying constraints; complexity of the model doesn’t matter, as the project becomes structured naturally
• FK constraints used to determine load order• Metadata pulled in directly from the source
e.g. PPDM, making use of all the hard work put in by the PPDM Association
Generated documentation
One of the many questions a data manager has about the data he/she manages:
Keeping the PPDM data manager happy
PPDM 3.8
Data lineage: How did this data get here?
PPDM provides tables to record data lineage
Transformation Manager can generate documentation for the PPDM metadata module
Agenda
Data loading challenges
Best practices in tools and
methodology
Project management
Key points
• Be aware• Look at data migration
methodologies• Select appropriate
components
• Look for and remove large risky steps
• Start early• Ensure correct
resources will be available
• No nasty budget surprises
• Use tools• Build a happy virtual team
• Did you know about these tables?
• Who uses them? • How do you use them? • What features would be truly
useful in a data loader tool?
Questions
www.etlsolutions.com
Raising data management
standards
www.etlsolutions.com
Contact us for more information:
Karl Glenn, Business Development [email protected]+44 (0) 1912 894040
Read more on our website:
http://www.etlsolutions.com/what-we-do/oil-and-gas/
Images from Free Digital Photos freedigitalphotos.net