FromExcelTo Database#...Jun 21, 2017 · Somerules! Single copy for valued data " Valued variable...
Transcript of FromExcelTo Database#...Jun 21, 2017 · Somerules! Single copy for valued data " Valued variable...
Bo Yao 06/2017
From Excel To Database
Outlines
• Excel vs. Database • Fundamental Knowledge of Database • Database Design Strategies • Database Services Provided By BICF
– Project Examples
Data Record Issues
• Most of experimental data or results are recorded in Excel files
How to understand the variables and inputs recorded by other persons?
How to clean, pick up, or combine the data from several excel files?
How to safely transfer data from a leaving person to a new hire?
How to avoid typos and mismatches in excel files
How to control data access permissions and data usage
………
Reasons for Issues
How to understand the variables and inputs recorded by other persons?
How to clean, pick up, or combine the data from several excel files?
How to safely transfer data from a leaving person to a new hire?
How to avoid typos and mismatches in excel files
How to control data access permissions and data usage
No codebook or dicUonary
Weak search funcUon in Excel
No centralized data management. No code record standards
No self check and validaUon when inpuXng data
Weak funcUons of data access control in excel
Excel vs. Database Excel File Online Record Online Advantage
Access Loca6on Local Machine Internet Share Easier for collaborators
Data Source MulUple Copies on different machines
Single Data Source Easier for Data Version Control and Maintenance
Data Input Slow and wrong-‐input risk
Quick and Standard Input 1) Validate User’s Input 2) Allow Batch input
Access Permission Control
Weak Strong Contain mulUple access protecUons
View Change History None Possible Clinical InformaUon Change History is
Recorded
Unexpected Informa6on Dele6on
None Can be recovered The clinical informaUon deleUon can be
recovered in a short Ume
Data Backup None Periodic Data Backup Avoid Data Missing
Data Summary Weak Strong Quickly Generates Summary Graph and
Records
SuggesUons
• Excel – Quick – Flexible – Personal – Small projects – Temp / Short-‐term
• Database – Design before usage – Standard – Team work or shared – Large projects – Long-‐term
About Database
• Many Database Systems
.txt
.ini Registry Excel xml
Flat Database
Oracle SQL_Server MySQL
RelaUonal Database
Redis Tokyo_Cabinet
Flare
Key-‐Value Database
MongoDB CouchDB
Document Oriented Database
Cassandra Voldemort
Distributed Database
Learn RelaUonal Database • A relational database (RDB) is a collective set of
multiple data sets organized by tables, records and columns. RDBs establish a well-defined relationship between database tables. Tables communicate and share information, which facilitates data searchability, organization and reporting. (https://www.techopedia.com/definition/1234/relational-database-rdb)
• Top Questions o How to assign variables into tables? o How to set up constraints between these tables? o How to speed up search query?
Example
School Management System Database: MySQL
Fundamental Knowledge – Database Components
Database
Table 1 Table 2 Table 3
Table 4
variables
variables variables variables
Fundamental Knowledge -‐ Variables
• Name • Type
– string: varchar, text… – number: int, float, decimal… – Date: date, dateUme,datestamp – Blob
• Default value • Is Null • Is Auto Increment • Is Key
– Primary key – Foreign key – Unique key
• Charset
Fundamental Knowledge -‐ Keys
• Key is to data self-‐check or self-‐constraint
IdenUfier for row; Unique in table; AutomaUc index
Primary Key
Value is limited to value list of a variable of another table
Foreign Key
Unique in table
Unique Key
Examples -‐ key
PersonID Varchar(10) 155556 155557
Name Varchar(255) Eric Yao Tiger Yao
Birthday Date 12/12/2010 11/11/2011
SSN Varchar(20) 111-‐11-‐1111 222-‐22-‐2222
Department Varchar(255) Clinical Sciences BioinformaUcs
JobTitle Varchar(255) Web Developer I Postdoc
… … … …
Employee Table
Primary Key
Unique Key
Examples
PersonID Varchar(10)
Name Varchar(255)
Birthday Date
SSN Varchar(20)
Department Varchar(255)
JobTitle Varchar(255)
… …
PersonID Varchar(10)
Salary Decimal
… …
Employee Table
Salary Table
Fundamental Knowledge – Codebook and DicUonary
• Codebook is to summarize the categories of variable • Codebook is to standardize data input
Race • Asian • American African • White • …
Smoking Status • Current Smoker • Former Smoker • Non Smoker • …
Diagnosis • Yolk sac tumor • Embryonal carcinoma • Choriocarcinoma • …
Example of Codebook and DicUonary
hnps://qbrc.swmed.edu/projects/gct/documents/GCT%20CodeBook_v3.4.pdf hnps://qbrc.swmed.edu/projects/gct/documents/GCT_dicUonary_v3.4.pdf
Codebook DicUonary
Simple Conclusions – RelaUonal Database
• Consisted of several tables • Tables are linked by foreign keys • Keys are set as data constraints (self-‐check) • Codebook / dicUonary is to data input standards
How to design MySQL Database
• Main consideraUon before Design – Database size – Data Loading Methods – Data sensiUvity – End users – The aims of data collecUon – User account controls – Data backup – Data encrypUon
Basic Requirements
Data Consistence
No mismatch
Least Redundancy
Good space usage
Scalable
PotenUal for bigger data
Quick Query
Query performance
Data Standards
Avoid typos
Some rules u Single copy for valued data
² Valued variable only exists in one table
u Avoid performance to go down while records are increasing ² The number of records in one table should be less than 10^7
u Key / Constraints to avoid wrong input ² Linked as many tables as possible
u Atomic information stored in individual cell (e.g. avoid information like 'black,white' in one Race cell
² Combined values in one ‘cell’ is difficult to search or be indexed
u Set codebook as categorized variables ² Data standards
Database Design PracUce
Database design task
• QuesUon: Create a MySQL database ‘test’ to contain this informaUon. (No data input, only schema)
Sample'ID'(auto.increment)' 1' 2' 3'Patient'MRN'*' K3212d' Ge23ds3' Kid02112'Surgery'Date'*' 03/23/2016' 05/12/2016' 06/12/2016'Procedure'*' Surgery' Biopsy' Biopsy'Sequencing'Platform' Illumina' Affymetrix' Agilent'Data'Type' Row' Processed' Processed'Create'Date' 05/18/2017' 05/18/2017' 05/18/2017''
MySQL Tools
• MySQL management tool – phpmyadmin
• Database Client Tool – DbVisualizer – DataGrip
Codebook Tables • CodeProcedure
CREATE TABLE CodeProcedure ( ID int(2) NOT NULL, Proc varchar(40) NOT NULL, PRIMARY KEY (ID), UNIQUE KEY Proc (Proc) ) ENGINE=InnoDB DEFAULT CHARSET=laUn1
• CodeSeqPlarorm
CREATE TABLE CodeSeqPlarorm ( ID int(2) NOT NULL, SeqPlarorm varchar(40) NOT NULL, PRIMARY KEY (ID), UNIQUE KEY SeqPlgrorm(SeqPlgrorm) ) ENGINE=InnoDB DEFAULT CHARSET=laUn1
• CodeTypeData
CREATE TABLE CodeTypeData ( ID int(2) NOT NULL, TypeData varchar(40) NOT NULL, PRIMARY KEY (ID), UNIQUE KEY TypeData (TypeData) ) ENGINE=InnoDB DEFAULT CHARSET=laUn1
Sample Table CREATE TABLE Sample ( ID int(10) unsigned NOT NULL AUTO_INCREMENT, MRN varchar(40) NOT NULL, DateSurgery date NOT NULL, Proc int(2) NOT NULL, SeqPlarorm int(2) DEFAULT NULL, TypeData int(2) DEFAULT NULL, CreateDate date DEFAULT NULL, PRIMARY KEY (ID) ) ENGINE=InnoDB DEFAULT CHARSET=laUn1
Check Database Schema
(created by DBVisualizer)
Add Data Constraints
• Add foreign keys
ALTER TABLE Sample ADD CONSTRAINT s_procedure FOREIGN KEY (Proc) REFERENCES CodeProcedure(ID); ALTER TABLE Sample ADD CONSTRAINT s_seqplarorm FOREIGN KEY (SeqPlarorm) REFERENCES CodeSeqPlarorm(ID); ALTER TABLE Sample ADD CONSTRAINT s_typedata FOREIGN KEY (TypeData) REFERENCES CodeTypeData(ID);
Final Database Schema
(created by DBVisualizer)
Quick Summary
• Codebook • Meaningful naming • Data type selecUon • Key selecUon
Database Services From BICF
• Help desk for consulUng – Database design – Web portal design and development – Training
• Complete service for design and implement – Database: database design, data loading, maintenance, and periodic backup
– Web portal: design, development, deploy, and maintenance
Project Example
• Help Desk -‐ NutriUon Center
Help with database design to speed up data query
Database
Code checking to enhance web site security
Website Security
Advices to web user interface and funcUon to improve web usage performance
Website Enhancement
Project Management
• Complete service – Children’s Hospital
• Pediatric Biobank – Record paUent’s clinical data – Database and Web Portal
Pediatric Biobank
Secure Account System
User-‐friendly Data Input and Search
Track Account Login History
Track Clinical Data Change History
Collaborators Online Record Tool
Hardware Architect
Outside Internet
Firewall BICF Virtual Server
Clinical Server
Website Database
UTSW Internal User
Data Backup Server
Data ClassificaUon
• To standardize the input of clinical data, we classify the variables
Basic Informa6on Diagnosis
Chemotherapy Radia6on
Stem Cell Transplant Cancel Predisposi6on
Family History Others
Pediatric Biobank Tool
PaUent Search PaUent InformaUon Input
Data input and query
Data
UTSW Firewall
Secure HTTP web access
Clinical Server AuthenUcaUon
Mysql Database AuthenUcaUon
SensiUve Data Encrypted in Database
Data ProtecUon
Other FuncUons
Dynamic Data Summary Func6ons • Print specific-‐format record • Monitor illegal access and
email alert • Single unexpected data
deleUon recovery (in one month)
BICF Help Desk • hnp://www.utsouthwestern.edu/labs/bioinformaUcs/
• Contact us [email protected] Help Desk: 10AM – 11AM daily. LocaUon: NB5.604