LabKey Server ETL Workshop
LabKey SoftwareFriday September 20, 2013
1
2
Understand basic workings of LabKey Server Administrator & developer views
Know how to use LabKey’s Query capability Build a module to extend LabKey
Update data model with incremental scripts Expose data & metadata to LabKey Server
Learn ETL Options Run ETLs Create Simple ETLs
Objectives
3
Alternate talking & doing Using Amazon-hosted VMs running LabKey
Server + SQL Server Run via Remote Desktop Everyone has VM with full admin rights Everyone has own SQL Server instance
Workshop not one-way training
Course format
4
Never done this before Probably “bugs” in course material
The code is fresh Code from LabKey “trunk” Basic ETL Services in Place Extending over next few months
Keeping fingers crossed for reliable wifi
Caveats
5
About LabKey Server Getting Connected LabKey Folder Setup Data in LabKey LabKey SQL Database & Module Architecture Building a Module ETL in Modules Q & A https://hosted.labkey.com/project/ETLTraining/begin.view?
Agenda
LabKey Server
Labkey
File System 2 SAS Share
Data 1
Data 2
File SystemLabKey Database
(PostgreSQL, MS SQL)
LabKey Schemas
More Schemas
OracleMS SQL
DatabaseMy SQL
LabKey ServerModular, Java-based
Web App
Nelson et al., LabKey Server: An open source platform for scientific data integration, analysis and collaboration
7
See instructions on getting to your server at Amazon Should connect via Remote Desktop You can use SQL Management Studio to get direct
access to database Full admin gives you power to break anything
Won’t be true in FHCRC environment
Getting Connected
8
Start server with icon on desktop Production installs use a Windows Service
Use web-browser on remote desktop machine You’ll connect to http://localhost:8080/labkey
Set up a site administrator password Server will “upgrade itself”
Run SQL Scripts to initialize modules We’ll go over this process later when you build your own
modules
Starting The Server
9
Site is an server administration level Connectivity to resources, site wide groups
Projects are top-level folders Add groups, customized interfaces
Subfolders secure subsets of data Physically each container is a row in a database with a GUID
Other tables often have “container” column Try the tutorial
Basic Organization and Security
Data Connectivity in LabKey
A relational data store designed for scientists Built on a robust SQL database Property and vocabulary service Secure SQL query service Data grid for exploring data File sharing and linking
10
Relational DB
LabKey Query Service
UI, ETL orCustom
Application
SQL Query or Table + Column List
API Layer
Translated SQL
LabKey Server
11
LabKey data model terminology
Tabular data: data in the form of rows and columns Schema: a named collection of related tables and queries Metadata: information about the data contained in a tabular
data set, including field names, types, formats, links Query: a named, saved SQL SELECT statement written in
LabKey SQL, can be parameterized Custom grid view
Subset of query functionality (field list, sort, filter) Intended for UI definition (not defined in SQL) Can do implicit joins via lookups
12
Tutorial: Data Analysis
Import a spreadsheet into a list Explore the data grid view of the list
Sort Filter Paging
Create a scatter plot of the data View the plot over subsets of the data Change the ARVRegimen field to be a lookup
Lookups in LabKey Server
“Lookup” is special field type A field in one table whose values consist of key values from
another table Target: the table whose key values are kept in the lookup Title field: attribute of the target, specifies the field of the
target that will be displayed in place of the key values contained in the lookup
In SQL terms, known as a single-column FOREIGN KEY Always many-to-one or one-to-one from lookup field table
to target
13
14
Display more meaningful data values Allow users to explore data without writing SQL To constrain user input to a fixed set of choices Allow updating display values in one place Add expression columns to base data sets
Uses of lookups
Configuring fields
15
The Field Editor is the main UI for configuring field-level properties For developer-defined tables, data is supplied in XML
16
LabKey allows folks to write SQL But they don’t get access to the underlying database
Within any folder, the available schemas can be browsed
Create new Queries Equivalent to database views
SQL In LabKey
Query Schema Browser
17
New Query
18
Query Web Part
19
20
Full SELECT Syntax Update/Insert/Merge accessible via ETL pipeline, APIs, UI
Easy lookup syntax replaces JOIN in many cases Use || for string concat (like Oracle, PostgreSQL) PIVOT Queries GROUP_CONCAT PARAMETERS
LabKey SQL vs MS SQL
21
Joins Group_Concat – All visits for a patient PIVOT – one column for each visit
Queries to Try
22
LabKey Server is Based on Modules Look in Admin->Folder Management->FolderType Each module can provide
HTML Views Javascript/CSS LabKey SQL Queries
Enables easy movement of sets of queries between servers ETL Definitions Reports in R and JavaScript Database level schema definition
Only run at restart so DBAs can approve XML to add metadata to database schema
Java code
LabKey Modules
23
See tutorial
Building first Module
24
Provenance For every row in HIDRA_Prime, know when & how it got there
Auditing For every row that leaves HIDRA_Prime, know when & how it
left Down to individual patient info History of all runs Clear packaging & deployment
Re-invent the axle, but not the wheel… Use Stored Procs (coming soon) Wrap existing ETL Frameworks
ETLs: Why In LabKey
25
Still under development Basic functionality is in place
Query based ETLs Checkers (identify whether work is to be done) Scheduling Logging all output
LabKey ETL Infrastructure
26
User Interface Management User Interface
Scheduling Lists of Transform Runs Detail views
ETL Creation Stored Procedure-based ETLs Support for external ETL packages yet (SSIS, Kettle)
Still Not Done
27
Change identification Initiation Query Transformation Staging Load/Merge Finalize
ETL Steps (from Design Spec)
28
ETLs are defined in etls directory of a module Each ETL is an XML file
Each ETL consists of a set of Transform Steps Key Components of a Transform
Source Query (LabKey SQL for now) Destination Table
May be in unrelated database Filter Strategy
Identifies rows to transform & if there is work to do Schedule
ETL Basics
29
Choose which rows to move to target table SelectAllFilterStrategy
Just get all the data, every time ModifiedSinceFilterStrategy
Rows with a DateTime column newer than last run Records most recent value
RunFilterStrategy Based on Incrementing Integer Value (e.g. Run ID) Any rows with higher value than last time are transferred Useful for rows written by previous ETLs
But can “forget” previous runs and re-run from scratch “Reset State” in the UI
Filter Strategies
30
How to add data to target table truncate
Delete all rows and add the selected ones append
Add new rows to the target table Will fail if duplicate primary keys
merge Update or Insert Matches Primary Keys
Target Options
31
<?xml version="1.0" encoding="UTF-8"?><etl xmlns="http://labkey.org/etl/xml"> <name>Overwrite</name> <description>Replaces target with source query.</description> <transforms> <transform id="1hour"> <source schemaName="external" queryName="etl_source" /> <destination schemaName="patient" queryName="etl_target" targetOption=”truncate"/> </transform> </transforms> <incrementalFilter className=”SelectAllFilterStrategy” />
<schedule><poll interval="1h"></poll></schedule></etl>
Overwrite Full Table Every Hour
32
<?xml version="1.0" encoding="UTF-8"?><etl xmlns="http://labkey.org/etl/xml"> <name>Overwrite</name> <description>Replaces target with source query.</description> <transforms> <transform id="1hour"> <source schemaName="external" queryName="etl_source" /> <destination schemaName="patient" queryName="etl_target" targetOption=”merge"/> </transform> </transforms> <incrementalFilter className="ModifiedSinceFilterStrategy" timestampColumnName="Date" /> <schedule><poll interval="1h"></poll></schedule></etl>
Merge Changed Rows
33
Couple of key tables in the dataintegration schema TransformConfiguration
One row for each ETL Controls whether ETL is active Quick access to state of last run
TransformRun Stores information about every transform Success or Failure Total # of rows transferred
Pipeline Detailed log of steps
Storing ETL Information
34
Try an Early HIDRA ETL
35
Enable hidra and hidra_uw_intake
36
Amalga_Import has some Data
37
Let’s Try a Transform
38
<?xml version="1.0" encoding="UTF-8"?><etl xmlns="http://labkey.org/etl/xml"> <name>Amalga to hidraPrime - Patients</name> <description>Move uw_patient, uw_patientidentifier, uw_encounter from Amalga to hidraPrime</description> <transforms>
<transform id="patient"> <source schemaName="AmalgaImport_queries" queryName="uw_patient" timestampColumnName="updtDtTm" /> <destination schemaName="hidraPrime" queryName="Patient" targetOption="merge"/> </transform>
<transform id="patientidentifier_mrn"> <source schemaName="AmalgaImport_queries" queryName="uw_patientidentifier_mrn" timestampColumnName="lastUpdateTime"/> <destination schemaName="hidraPrime" queryName="PatientIdentifier" targetOption="merge"/> </transform>
<transform id="patientidentifier_epi"> <source schemaName="AmalgaImport_queries" queryName="uw_patientidentifier_epi" timestampColumnName="lastUpdateTime" /> <destination schemaName="hidraPrime" queryName="PatientIdentifier" targetOption="merge"/> </transform>
<transform id="encounter"> <source schemaName="AmalgaImport_queries" queryName="uw_encounter" timestampColumnName="lastUpdateTime" /> <destination schemaName="hidraPrime" queryName="Encounter" targetOption="merge"/> </transform>
</transforms>
<incrementalFilter className="ModifiedSinceFilterStrategy" timestampColumnName="lastUpdateTime" /></etl>
Files in: C:\LabKey\modules\hidra_uw_intake
A look inside
39
SELECT
(SELECT OID FROM AmalgaImport_azAEID.AEID204 WHERE AEID204.EIDForOID=UW_PID601.EIDForOID) as GPID,
LName AS LastName, FName as FirstName, MName as MiddleName, MotherMaidenName AS MaidenNameMother, DOB, Sex AS Gender, Language AS PrimaryLanguage, PatientAlias, Race, Street1 AS AddressLine1, Street2 AS AddressLine2,…
FROM AmalgaImport_azADT.UW_PID601
Patient Query
40
Nothing Happens Change some Data in
Amalga_Import.azADT.UW_PID601 Remember to update updtDtTm field
Now try again
Run Again
41
42
Researchers often have data in existing relational databases LIMS systems Clinical data Locally-developed applications
LabKey Server offers two mechanisms to incorporate this data Define an external schema connection (link) Use Extract, Transform and Load support (copy)
Data in external databases
43
LabKey Server consists of many separate modules Server modules usually contain SQL scripts to create
the database objects used by the module CREATE or ALTER, TABLES and VIEWs in native syntax Schema usually specific to a module Supported DBs: PostgreSQL and Microsoft SQL Server Script runner figures out which scripts needed for upgrade
Database tables and LabKey Server modules
44
After install or upgrade, the SQL sent to the database Mostly SELECTs and 1-row UPDATE/INSERT/DELETE SELECTS can be issued by a user or an application in
LabKey SQL LabKey translates into the back-end database dialect
45
Provides a way to link from LabKey Server to another data source to make LabKey’s functions and Client API to work directly on the external data
LabKey translates its own SQL into the dialect of the external schema. Supported databases include Oracle, SAS, and MySQL in addition to
Postgres and SQL Server Options:
Make only some tables exposed to LabKey Read only or read/write Implement folder-based security if a containerId is included Add additional metadata (example field display properties) via an XML
file
External schemas and data sources
Files Proteomics Flow
Fold
er 1
Fold
er 2
Tabular data rows and files are visible in folders46
Folders, files and tabular data
Top Related