OpenClinica sqldatamart Lindsay Stevens Viral Hepatitis Clinical Research Program Kirby Institute,...

41
OpenClinica sqldatamart Lindsay Stevens Viral Hepatitis Clinical Research Program Kirby Institute, UNSW Australia 1

Transcript of OpenClinica sqldatamart Lindsay Stevens Viral Hepatitis Clinical Research Program Kirby Institute,...

OpenClinica sqldatamartLindsay Stevens

Viral Hepatitis Clinical Research ProgramKirby Institute, UNSW Australia

1

©

#OC15Europe

Agenda Context

Features

Setup

Client use cases

Questions, or: Implementation details Feature ideas

©

#OC15Europe

Context Clinical research program

Focus on Hepatitis C treatment and prevention

Mainly phase 3, 4, observational studies

2-3 projects in/out per year

Local and International participating institutions

©

#OC15Europe

Problem: Study Setup Typical scope:

1. OpenClinica study

2. SAS extraction code (from XML to CSV or Excel)

3. Reports (typically Access)a) Standard reports (subject list, discrepancies, etc.)

b) Study-specific reports

4. Something special (tablet forms, system integrations, etc.)

sqldatamart removes steps 2 and 3a, simplifies 3b

©

#OC15Europe

Problem: Running Reports Typical steps:

1. Full study extract to XML

2. Run SAS code

3. Open report in Access

sqldatamart removes steps 1 and 2

©

#OC15Europe

Requirements

1. Easier setup per study

2. Easier to run reports

3. Provide access to all clinical and operational data

4. Low maintenance

5. Restrict study data access to staff with OpenClinica role

©

#OC15Europe

Previous Solution Access generates a report database in postgres

Written using SQL and Access VBA

Limitations: Access can only handle 255 columns per table Access is single core only, and a bit slow Roles used static passwords Refresh required dropping the report database

Making friends with postgres allows many cool things

©

#OC15Europe

Current Solution Postgres generates a report database in postgres

Written using SQL and PLPGSQL (postgresql procedural language)

Limitations: Synchronous commits during function execution

Can use dblink extension if this becomes a problem

Concurrent materialized view refresh requires unique index Can use array index

©

#OC15Europe

Feature Overview Common study data

Subject details, groups, event / CRF status SDV status, discrepancy notes Metadata, study user roles

Study specific data One matview per item group (long item names) One view per item groups (short item names)

Support functions Snapshot study to SAS / Stata files Create pass thru views in Access

©

#OC15Europe

Open Sourced All the code is on github

github.com/lindsay-stevens-kirby/openclinica_sqldatamart MIT License

Documentation is in the repository wiki

Code or bug report contributions welcome

Written to suit our OpenClinica usage patterns If your pattern is incompatible, maybe you or I can add a fix

©

#OC15Europe

Some Postgres Terminology Matview = Materialized View

A table that has an associated query which can be used to regenerate the table

Allows the database to cache (store) the results of a query

FT = Foreign Table A table that is actually a view to a resource outside of the

current database Allows the database to access other databases, files, etc.

©

#OC15Europe

Data Flow: Diagram

1 2 3

©

#OC15Europe

Data Flow: 1 openclinica_fdw schema

Facilitates connection to OpenClinica database

Copy of some catalog tables to look up the definition of tables, views and indexes

Copy of these tables, views and indexes

Home of the support functions

©

#OC15Europe

Data Flow: 2 dm schema

Facilitates running the core queries in one go for all data in the instance, keeping their definitions in one place

Home of the support queries

©

#OC15Europe

Data Flow: 3 Study schemas

Facilitates segmenting data by study

Transforms data into item group matviews

©

#OC15Europe

Some OpenClinica Terminology

Study

Subject

Event

CRF

ItemGroup

Item

ODM Schema Definition = Metadata Schema Instance = Clinicaldata

My Study

Subject 1

Event 1 Repeat 1

CRF 1

ItemGroup 1 Repeat 1

Item 1 Value

©

#OC15Europe

Common Matviews From dm schema, copied to each study, currently 15.

1. Clinicaldata: item data with some metadata

2. Discrepancy_notes_all: all parent and child notes

3. Discrepancy_notes_parent: just the parent notes

4. Metadata: study level metadata

5. Metadata_event_crf_ig: filtered metadata, top part

©

#OC15Europe

Common Matviews

6. Metadata_crf_ig_item: filtered metadata, bottom part

7. Metadata_study: study info, plus the schema name

8. Response_set_labels: CRF item value labels

9. Sdv_status_history: current and previous sdv statuses

10.User_account_roles: user accounts with study roles

©

#OC15Europe

Common Matviews

11.Subject_event_crf_expected: subjects x all possible

12.Subject_event_crf_status: what exists

13.Subject_event_crf_join: results of 11+12

14.Subject groups: group class assignment details

15.Subjects: subject details

©

#OC15Europe

Item Group Matviews Named using item group OID

Row identifier columns Study Site Subject Event CRF Item group

©

#OC15Europe

Item Group Matviews Item columns

Each item gets a column If study has item names > 12 characters, name is item_oid Otherwise, first 12 of item_name then first 45 of description

Coded items get a second “_label” column my_item my_item_label

Multi-choice items split into column pairs for each choice Choice value appended to name Probably problems if using long coded values

©

#OC15Europe

Item Group Matviews Item data values are stored as text

There is only one “item value” column varchar(4000)

Each value is cast to a type TEXT: ST, PDATE, FILE, “_label” columns NUMERIC: INT, REAL DATE: DATE

We don’t use PDATE or FILE so they don’t get any special handling

©

#OC15Europe

Item Group Alias Views Each item group matview has an twin “Alias” view

Alias view is for helping statistical software SAS and Stata variable name limits around 32 characters item_name is probably the shortest meaningful identifier Alias view renames matview names using item_name Example:

SELECT myitem_my_first_item_with_a_really_long_name

AS myitem

FROM ig_my_matview

©

#OC15Europe

Group Roles Group roles are collections of permissions which can be

assigned to Login roles (real users) or other group roles

A dm_admin group role is created, which owns all sqldatamart objects

A group role is created for each study schema

Study group roles have read only access to objects in that schema Named like dm_study_my_study_name

©

#OC15Europe

Login Roles A login role is created for each OpenClinica user

The local name of the user email is used E.g. email [email protected] => Lstevens

Login roles are given group roles matching OpenClinica

How these login roles authenticate is up to you Advanced setup describes single-sign on with SSPI Could use passwords, certificates, LDAP, etc.

©

#OC15Europe

Basic Setup Overview Create a select-only role in the OpenClinica database

Allow connections from a report database using that role

Install postgres 9.3+,

Install pgAgent (scheduled task runner)

Run bat script updated with the environment details Or execute the psql statements manually Or rewrite for bash

Create pgAgent job to refresh the database as desired

©

#OC15Europe

Advanced Setup Overview All of basic, plus some bonus how-tos

Encrypted connections Between OpenClinica and report database server Between report database server and client

Single sign on using SSPI authentication against AD

Data access logging

Performance settings suggestions

Least-privilege maintenance role

©

#OC15Europe

Maintenance pgAgent used as job runner

Could implement using Task Scheduler, Cron, etc...

Job runner tasks: Refresh data (if study not locked recently) Build schema for new study if not exist Rebuild study if event or CRF definitions updated Sync user roles (create / revoke / grant)

Speed: pgAgent keeps statistics so you can tune refresh frequency Takes ~5 min for our 11 study database, 500k item_data

©

#OC15Europe

Clients Anything that can talk to postgresql

Most practical for office environment is ODBC

Many common applications can talk ODBC SAS, Stata, Microsoft Office, LibreOffice, etc.

Many languages have their own database connection libraries e.g. python, java, c#, R

©

#OC15Europe

Client Setup psqlODBC driver

Optional: If using advanced setup, copy of report database server

issuer certificate for verification Might be unique to my environment but Access seems to

have very short DNS lookup timeout Currently solved by adding record to hosts file I submitted a patch for psqlODBC driver to allow specifying a

postgres service name as a connection parameter and putting the host IP in that, but the patch has not been accepted ... yet

©

#OC15Europe

Client Helper Scripts SAS script:

Interacts with special sqldatamart function Generates code to export a study to .sas7bdat Accepts general filters e.g. if you want only subject “x” Runs said code Result is a “snapshot” of all study data

Stata script and function, same as above except to .dta

These special functions live in the “public” schema so all users can execute them

©

#OC15Europe

Client Helper Scripts Access VBA module:

Import into a database Run the function, providing parameters:

Study schema name ODBC connection string or path to FileDSN

Creates pass through queries for each study matview

©

#OC15Europe

Client Examples - SAS SAS can interact with postgres without snapshot files

Create a foreign library for the study:

LIBNAME myStudyNameLibName ODBC SCHEMA=myStudyNameSchemaNameNOPROMPT="fileDSN reference string, or ODBC connection string"; RUN;

Use this library, e.g. for table of subjects per site by sex:

DATA myStudyNameLibName.subjects ; PROC FREQ; TABLE site_name*subject_sex; RUN;

©

#OC15Europe

Client Examples - Stata Stata can interact with postgres without snapshot files

Load a dataset

odbc load, table("myStudyNameSchemaName.subjects") noquote connectionstring("fileDSN reference string, or ODBC connection string")

Use the dataset, e.g. for table of subjects per site by sex:

tabulate site_name subject_sex

©

#OC15Europe

Client Examples - Access Access can interact with postgres via pass thru queries

Run the setup code or manually create each p-t query

Equivalent table of subjects per site by sex:

TRANSFORM Count(subjects.subject_id) AS CountSubjectSELECT subjects.site_nameFROM subjectsGROUP BY subjects.site_namePIVOT subjects.subject_sex;

Complex queries best done as pass thru since Access downloads each full set to execute query locally

©

#OC15Europe

Implementation Details Batch script feeds psql all CREATE FUNCTION files

Batch script feeds psql variables for build script

psql substitutes variables into build script

Build script uses the functions to create database

Create foreign table for catalog to get database definition

Use catalog to create foreign tables for all tables, views

©

#OC15Europe

Implementation Details Create matview for each foreign table: “cache layer 1”

Create “dm” schema with core queries: “cache layer 2” Clinicaldata, metadata, metadata-derivatives, etc

Create study schemata

Copy core queries filtered for that study

Create item group matviews: “cache layer 3”

User creation

©

#OC15Europe

Implementation Details Item group matviews pivot data using max aggregates

Dynamic SQL creates column names, types, case statements to filter out blanks and null codes

Simplified representation of what happens:

Itemgroup item Item_value

Visit visit_date 2015-06-01

Visit visit_attended Yes

Visit visit_reason_not_done ‘’

Itemgroup visit_date visit_attended visit_reason_not_done

Visit 2015-06-01 Yes [NULL]

OC

DM

©

#OC15Europe

Customisation Examples Change a core query (in dm schema)

The “dm_create_dm_...” functions are just wrappers around a CREATE MATERIALIZED VIEW statement

Edit the query and rebuild the database

Change how login roles are created Currently uses regex ^([A-z]+)@ Edit each “dm_users_...” function where this is used

Should delegate this to a function so it is easier to change

©

#OC15Europe

Customisation Examples Add a new core query (in dm schema)

Copy/paste/rename existing “dm_create_dm...” function Put your query between the $query$ string delimiters Add a call to it in the build script:

After the call to dm_create_ft_openclinica_matview_indexes Before the call to dm_create_study_schemas

Make sure it has a “study_name” column Should be copied by dm_create_study_common_matviews

That is, you might not need to rebuild

©

#OC15Europe

Feature Ideas Variable and value labelling in snapshot scripts

Reorganise clinical/metadata into modular views so that each child is a composition instead of a “select distinct”

Snapshot script for R / python(pandas)

Add more database IDs everywhere For unique indexes on item group matviews For links in reports that go straight to OpenClinica pages For data cleaning tracking against custom views

Mechanism to automatically re-create post-build custom study postgres views after rebuild