FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable...

38
Bo Yao 06/2017 From Excel To Database

Transcript of FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable...

Page 1: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Bo  Yao  06/2017  

From  Excel  To  Database  

Page 2: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Outlines  

•  Excel  vs.  Database  •  Fundamental  Knowledge  of  Database  •  Database  Design  Strategies  •  Database  Services  Provided  By  BICF  

– Project  Examples  

Page 3: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Data  Record  Issues  

•  Most  of  experimental  data  or  results  are  recorded  in  Excel  files  

How  to  understand  the  variables  and  inputs  recorded  by  other  persons?  

How  to  clean,  pick  up,  or  combine  the  data  from  several  excel  files?  

How  to  safely  transfer  data  from  a  leaving  person  to  a  new  hire?  

How  to  avoid  typos  and  mismatches  in  excel  files    

How  to  control  data  access  permissions  and  data  usage  

………  

Page 4: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Reasons  for  Issues  

How  to  understand  the  variables  and  inputs  recorded  by  other  persons?  

How  to  clean,  pick  up,  or  combine  the  data  from  several  excel  files?  

How  to  safely  transfer  data  from  a  leaving  person  to  a  new  hire?  

How  to  avoid  typos  and  mismatches  in  excel  files    

How  to  control  data  access  permissions  and  data  usage  

No  codebook  or  dicUonary  

Weak  search  funcUon  in  Excel  

No  centralized  data  management.  No  code  record  standards  

No  self  check  and  validaUon  when  inpuXng  data  

Weak  funcUons  of  data  access  control  in  excel  

Page 5: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Excel  vs.  Database  Excel  File   Online  Record   Online  Advantage  

Access  Loca6on   Local  Machine   Internet  Share   Easier  for  collaborators  

Data  Source   MulUple  Copies  on  different  machines  

Single  Data  Source   Easier  for  Data  Version  Control  and  Maintenance  

Data  Input   Slow  and  wrong-­‐input  risk  

Quick  and  Standard  Input   1)  Validate  User’s  Input  2)  Allow  Batch  input    

 

Access  Permission  Control  

Weak   Strong   Contain  mulUple  access  protecUons  

View  Change  History   None   Possible   Clinical  InformaUon  Change  History  is  

Recorded    

Unexpected  Informa6on  Dele6on  

None   Can  be  recovered   The  clinical  informaUon  deleUon  can  be  

recovered  in  a  short  Ume  

Data  Backup   None   Periodic  Data  Backup   Avoid  Data  Missing  

Data  Summary   Weak   Strong   Quickly  Generates  Summary  Graph  and  

Records  

Page 6: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

SuggesUons  

•  Excel  –  Quick  –  Flexible  –  Personal  –  Small  projects  –  Temp  /  Short-­‐term  

•  Database  –  Design  before  usage  –  Standard  –  Team  work  or  shared  –  Large  projects  –  Long-­‐term  

Page 7: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

About  Database  

•  Many  Database  Systems    

.txt  

.ini  Registry  Excel  xml  

Flat  Database  

Oracle  SQL_Server  MySQL  

RelaUonal  Database  

Redis  Tokyo_Cabinet  

Flare  

Key-­‐Value  Database  

MongoDB  CouchDB  

Document  Oriented  Database  

Cassandra  Voldemort  

Distributed  Database  

Page 8: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Learn  RelaUonal  Database  •  A relational database (RDB) is a collective set of

multiple data sets organized by tables, records and columns. RDBs establish a well-defined relationship between database tables. Tables communicate and share information, which facilitates data searchability, organization and reporting. (https://www.techopedia.com/definition/1234/relational-database-rdb)

•  Top Questions  o  How to assign variables into tables?  o  How to set up constraints between these tables?  o  How to speed up search query?  

Page 9: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Example  

School  Management  System    Database:  MySQL  

Page 10: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Fundamental  Knowledge  –  Database  Components  

Database  

Table  1  Table  2   Table  3  

Table  4  

variables  

variables   variables   variables  

Page 11: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Fundamental  Knowledge  -­‐  Variables  

•  Name  •  Type  

–  string:  varchar,  text…  –  number:  int,  float,  decimal…  –  Date:  date,  dateUme,datestamp  –  Blob  

•  Default  value  •  Is  Null  •  Is  Auto  Increment  •  Is  Key  

–  Primary  key  –  Foreign  key  –  Unique  key  

•  Charset  

Page 12: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Fundamental  Knowledge  -­‐  Keys  

•  Key  is  to  data  self-­‐check  or  self-­‐constraint  

IdenUfier  for  row;  Unique  in  table;  AutomaUc  index  

Primary  Key  

Value  is  limited  to  value  list  of  a  variable  of  another  table    

Foreign  Key  

Unique  in  table    

Unique  Key  

Page 13: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Examples  -­‐  key  

PersonID   Varchar(10)   155556   155557  

Name   Varchar(255)   Eric  Yao   Tiger  Yao  

Birthday   Date   12/12/2010   11/11/2011  

SSN   Varchar(20)   111-­‐11-­‐1111   222-­‐22-­‐2222  

Department   Varchar(255)   Clinical  Sciences   BioinformaUcs  

JobTitle   Varchar(255)   Web  Developer  I   Postdoc  

…   …   …   …  

Employee  Table  

Primary  Key  

Unique  Key  

Page 14: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Examples  

PersonID   Varchar(10)  

Name   Varchar(255)  

Birthday   Date  

SSN   Varchar(20)  

Department   Varchar(255)  

JobTitle   Varchar(255)  

…   …  

PersonID   Varchar(10)  

Salary   Decimal  

…   …  

Employee  Table  

Salary  Table  

Page 15: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Fundamental  Knowledge  –  Codebook  and  DicUonary  

•  Codebook  is  to  summarize  the  categories  of  variable  •  Codebook  is  to  standardize  data  input  

Race    •  Asian  •  American  African  •  White  •  …  

Smoking  Status    •  Current  Smoker  •  Former  Smoker  •  Non  Smoker  •  …  

Diagnosis    •  Yolk  sac  tumor  •  Embryonal  carcinoma  •  Choriocarcinoma  •  …  

Page 16: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Example  of  Codebook  and  DicUonary  

hnps://qbrc.swmed.edu/projects/gct/documents/GCT%20CodeBook_v3.4.pdf   hnps://qbrc.swmed.edu/projects/gct/documents/GCT_dicUonary_v3.4.pdf  

Codebook   DicUonary  

Page 17: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Simple  Conclusions  –  RelaUonal  Database  

•  Consisted  of  several  tables  •  Tables  are  linked  by  foreign  keys  •  Keys  are  set  as  data  constraints  (self-­‐check)  •  Codebook  /  dicUonary  is  to  data  input  standards  

Page 18: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

How  to  design  MySQL  Database  

•  Main  consideraUon  before  Design  – Database  size  – Data  Loading  Methods  – Data  sensiUvity  – End  users  – The  aims  of  data  collecUon  – User  account  controls  – Data  backup  – Data  encrypUon  

Page 19: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Basic  Requirements  

Data  Consistence  

No  mismatch  

Least  Redundancy  

Good  space  usage  

Scalable  

PotenUal  for  bigger  data  

Quick  Query  

Query  performance  

Data  Standards  

Avoid  typos  

Page 20: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Some  rules  u  Single copy for valued data

²  Valued variable only exists in one table

u  Avoid performance to go down while records are increasing ²  The number of records in one table should be less than 10^7

u  Key / Constraints to avoid wrong input ²  Linked as many tables as possible

u  Atomic information stored in individual cell (e.g. avoid information like 'black,white' in one Race cell

²  Combined values in one ‘cell’ is difficult to search or be indexed

u  Set codebook as categorized variables ²  Data standards  

Page 21: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Database  Design  PracUce  

Database  design  task  

•  QuesUon:  Create  a  MySQL  database  ‘test’  to  contain  this  informaUon.  (No  data  input,  only  schema)  

Sample'ID'(auto.increment)' 1' 2' 3'Patient'MRN'*' K3212d' Ge23ds3' Kid02112'Surgery'Date'*' 03/23/2016' 05/12/2016' 06/12/2016'Procedure'*' Surgery' Biopsy' Biopsy'Sequencing'Platform' Illumina' Affymetrix' Agilent'Data'Type' Row' Processed' Processed'Create'Date' 05/18/2017' 05/18/2017' 05/18/2017''

Page 22: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

MySQL  Tools  

•  MySQL  management  tool  – phpmyadmin  

•  Database  Client  Tool  – DbVisualizer  – DataGrip  

Page 23: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Codebook  Tables  •  CodeProcedure  

CREATE  TABLE  CodeProcedure  (      ID  int(2)  NOT  NULL,      Proc  varchar(40)  NOT  NULL,      PRIMARY  KEY  (ID),      UNIQUE  KEY  Proc  (Proc)  )  ENGINE=InnoDB  DEFAULT  CHARSET=laUn1  

•  CodeSeqPlarorm  

CREATE  TABLE  CodeSeqPlarorm  (      ID  int(2)  NOT  NULL,      SeqPlarorm  varchar(40)  NOT  NULL,      PRIMARY  KEY  (ID),      UNIQUE  KEY  SeqPlgrorm(SeqPlgrorm)  )  ENGINE=InnoDB  DEFAULT  CHARSET=laUn1  

•  CodeTypeData  

CREATE  TABLE  CodeTypeData  (      ID  int(2)  NOT  NULL,      TypeData  varchar(40)  NOT  NULL,      PRIMARY  KEY  (ID),      UNIQUE  KEY  TypeData  (TypeData)  )  ENGINE=InnoDB  DEFAULT  CHARSET=laUn1  

Page 24: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Sample  Table  CREATE  TABLE  Sample  (      ID  int(10)  unsigned  NOT  NULL  AUTO_INCREMENT,      MRN  varchar(40)  NOT  NULL,      DateSurgery  date  NOT  NULL,      Proc  int(2)  NOT  NULL,      SeqPlarorm  int(2)  DEFAULT  NULL,      TypeData  int(2)  DEFAULT  NULL,      CreateDate  date  DEFAULT  NULL,      PRIMARY  KEY  (ID)  )  ENGINE=InnoDB  DEFAULT  CHARSET=laUn1  

Page 25: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Check  Database  Schema  

(created  by  DBVisualizer)  

Page 26: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Add  Data  Constraints  

•  Add  foreign  keys  

 ALTER  TABLE  Sample  ADD  CONSTRAINT  s_procedure  FOREIGN  KEY  (Proc)  REFERENCES  CodeProcedure(ID);      ALTER  TABLE  Sample  ADD  CONSTRAINT  s_seqplarorm  FOREIGN  KEY  (SeqPlarorm)  REFERENCES  CodeSeqPlarorm(ID);      ALTER  TABLE  Sample  ADD  CONSTRAINT  s_typedata  FOREIGN  KEY  (TypeData)  REFERENCES  CodeTypeData(ID);  

Page 27: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Final  Database  Schema  

(created  by  DBVisualizer)  

Page 28: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Quick  Summary  

•  Codebook  •  Meaningful  naming  •  Data  type  selecUon  •  Key  selecUon  

Page 29: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Database  Services  From  BICF  

•  Help  desk  for  consulUng  – Database  design  – Web  portal  design  and  development  –  Training  

•  Complete  service  for  design  and  implement  – Database:  database  design,  data  loading,  maintenance,  and  periodic  backup  

– Web  portal:  design,  development,  deploy,  and  maintenance  

Page 30: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Project  Example  

•  Help  Desk  -­‐  NutriUon  Center  

Help  with  database  design  to  speed  up  data  query  

Database  

Code  checking  to  enhance  web  site  security    

Website  Security  

Advices  to  web  user  interface  and  funcUon  to  improve  web  usage  performance  

Website  Enhancement  

Page 31: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Project  Management  

•  Complete  service  –  Children’s  Hospital  

•  Pediatric  Biobank  – Record  paUent’s  clinical  data  – Database  and  Web  Portal  

Page 32: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Pediatric  Biobank  

Secure  Account  System  

User-­‐friendly  Data  Input  and  Search  

Track  Account  Login  History  

Track  Clinical  Data  Change  History  

Collaborators   Online  Record  Tool  

Page 33: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Hardware  Architect  

Outside  Internet  

Firewall  BICF  Virtual  Server  

Clinical  Server  

Website   Database  

UTSW  Internal  User  

Data  Backup  Server  

Page 34: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Data  ClassificaUon  

•  To  standardize  the  input  of  clinical  data,  we  classify  the  variables  

Basic  Informa6on   Diagnosis  

Chemotherapy   Radia6on  

Stem  Cell  Transplant   Cancel  Predisposi6on  

Family  History   Others  

Page 35: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Pediatric  Biobank  Tool  

PaUent  Search   PaUent  InformaUon  Input  

Data  input  and  query  

Page 36: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Data  

UTSW  Firewall  

Secure  HTTP  web  access  

Clinical  Server  AuthenUcaUon  

Mysql  Database  AuthenUcaUon  

SensiUve  Data  Encrypted  in  Database  

Data  ProtecUon  

Page 37: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

Other  FuncUons  

Dynamic  Data  Summary   Func6ons  •  Print  specific-­‐format  record  •  Monitor  illegal  access  and  

email  alert  •  Single  unexpected  data  

deleUon  recovery  (in  one  month)  

Page 38: FromExcelTo Database#...Jun 21, 2017  · Somerules! Single copy for valued data " Valued variable only exists in one table Avoid performance to go down while records are increasing

BICF  Help  Desk  •  hnp://www.utsouthwestern.edu/labs/bioinformaUcs/  

•  Contact  us    [email protected]    Help  Desk:  10AM  –  11AM  daily.    LocaUon:  NB5.604