Handling Large Amounts of Biological Data Xiaobin Guan, Ph.D. Senior Oracle DBA/Bioinformatician...

Handling Large Amounts of Handling Large Amounts of Biological DataBiological Data

Xiaobin Guan, Ph.D.Senior Oracle DBA/Bioinformatician

National Institutes of Health

Session id:40364

Introduction

Bioinformatics In Silico Large Database DNA Sequence Using CLOB Using Partition Tables

NISC Database Environment

NIH Intramural Sequencing Center Established in 1997 A multi-disciplinary genomics facility Large-scale DNA sequencing Applied Biosystems (ABI) DNA Analyzers Produce 10,000 DNA sequences per day

NISC Pipeline

The Laboratory Information Management System (LIMS).

Move the sequencing data from each PC to a partition (/area1) on our main Unix Server.

A Perl script is then running to validate the trace name and run folder name, and also check for duplicates. Then, moved to another partition (/area2).

Phred is run on each trace file to get rid of the low quality bases at the beginning and end of each read.

NISC Pipeline

Vector Screening is then performed on each read, and masked out where the vector is.

Contaminant Checking is to use BLAST to screen any contaminants. The information about contamination is then stored in the database.

QC Report is generated to show the quality and other information.

Why CLOB?

To store DNA sequences Combination of ‘ACGT’ character strings The length can be more or less than 4KB

LOBs vs. Long/Long Raw

LONG, LONG RAW

Number of LOB columns per table

1 Multiple

LOB Capacity Up to 2 GB Up to 4 GB

Data stored out-of-line No Yes

Object type support No Yes

Random piece-wise access No Yes

A Simple Create Table Statement

CREATE TABLE dna_sequence1

(base_id NUMBER(6),

base_sequence CLOB)

TABLESPACE example;

Specify the Segment Name, and LOB Storage

(base_id NUMBER(6),

base_sequence CLOB)

LOB (base_sequence) STORE AS

dna_seq_lob

(TABLESPACE lob_seg_ts)

TABLESPACE example;

Specify the Index Name and Index Storage

(base_id NUMBER(6),

base_sequence CLOB)

LOB (base_sequence) STORE AS

dna_seq_lob1

(TABLESPACE lob_seg_ts

INDEX dna_seq_clob_idx (

TABLESPACE nisc_index))

TABLESPACE example;

Check Segment and Index Name

SELECT table_name, column_name, segment_name, index_name

FROM user_lobs;

TABLE_NAME COLUMN_NAME SEGMENT_NAME INDEX_NAME

--------------- --------------- --------------------------- ------------------------

DNA_SEQUENCE1 BASE_SEQUENCE SYS_LOB0000040338C00002$$ SYS_IL0000040338C00002$$

DNA_SEQUENCE2 BASE_SEQUENCE DNA_SEQ_LOB SYS_IL0000040341C00002$$

DNA_SEQUENCE3 BASE_SEQUENCE DNA_SEQ_LOB1 DNA_SEQ_CLOB_IDX

Query the Table

SELECT * FROM dna_sequence WHERE base_id = 20;

20 actcggtactgggacccatgtggtggatttctatccttgaagctgcacgtaaagacccggtttttgcgggtatctctgataatgccaccgctcaaatcgctacagcgtgggcaagtgcactggctgactacgccgcagcacataaatctatgccgcgtccggaaattctggcctcctgccaccagacgctggaaaactgcctgatagagtccacccgcaatagcatggatgccactaataaagcgatgctggaatctgtcgcagcagagatgatgagcgtttctgacggtgttatgcgtctgcctttattcctcgcgatgatcctgcctgttcagttgggggcagctaccgctgatgcgtgtaccttcattccggttacgcgtgaccagtccgacatctatgaagtctttaacgtggcaggttcatcttttggttcttatgctgctggtgatgttctggacatgcaatccgtcggtgtgtacagccagttacgtcgccgctatgtgctggtggcaagctccgatggcaccagcaaaaccgcaaccttcaagatggaagacttcgaaggccagaatgtaccaatccgaaaaggtcgcactaacatctacgttaaccgtattaagtctgttgttgataacggttccggcagcctacttcactcgtttactaatgctgctggtgagcaaatcactgttacctgctctctgaactacaacattggtcagattgccctgtcgttctccaaagcgccggataaaagcactgagatcgcaattgagacggaaatcaatattgaagccggctctgagctgatcccgctgatcacca

In-line or Out-of-line Storage

In-line Out-of-line Enable storage in row Disable storage in row Tablespaces

CLOB Usage

Table structure

– This table contains two CLOB columns BASECALLS stores DNA sequences BASEQUALS stores the quality score of each

sequence

– The length of both fields varies between a few hundred to up to 6 thousand characters

Test Protocol

Create tablespaces– Four for 4 tables, and two for LOB storage

Create four test tables– T1, in-line, one tablespace– T2, in-line, two tablespaces– T3, out-of-line, one tablespace– T4, out-of-line, two tablespaces

Test Table 1 (T1)

CREATE TABLE T1 (CALL_ID NUMBER(10) NOT NULL, TRACE_ID NUMBER(10) NOT NULL, BASECALLS CLOB NOT NULL, BASEQUALS CLOB) TABLESPACE "TEST_CALL1" LOB("BASECALLS") STORE AS (TABLESPACE

"TEST_CALL1" ENABLE STORAGE IN ROW) LOB("BASEQUALS") STORE AS (TABLESPACE

"TEST_CALL1" ENABLE STORAGE IN ROW);

Test Table 2 (T2)

"TEST_CALL_LOB1" ENABLE STORAGE IN ROW) LOB("BASEQUALS") STORE AS (TABLESPACE

"TEST_CALL_LOB1" ENABLE STORAGE IN ROW);

Test Table 3 (T3)

"TEST_CALL3" DISABLE STORAGE IN ROW) LOB("BASEQUALS") STORE AS (TABLESPACE

"TEST_CALL3" DISABLE STORAGE IN ROW);

Test Table 4 (T4)

"TEST_CALL_LOB2" DISABLE STORAGE IN ROW) LOB("BASEQUALS") STORE AS (TABLESPACE

"TEST_CALL_LOB2" DISABLE STORAGE IN ROW);

Results

In-line/out-of-line IN-LINE OUT-OF-LINE

Tablespace usage One TS

Two TS

One TS

Two TS

Table name T1 T2 T3 T4

Initial space used (MB) 6 7(2+5) 6 7(2+5)

Space used after 10000 row insert (MB)

46 47(42+5)

162 163(2+161)

Total insert time (sec) 10 11 47 48

Ranking 1 2 3 4

DBMS_LOB Package

Functions/Procedures to Read or Return LOB Values

Subprogram F/P Description

COMPARE() F Compares the value of two LOBs

GETCHUNKSIZE()

F Gets the chunk size used when reading and writing. This only works on internal LOBs and does not apply to external LOBs (BFILEs).

GETLENGTH() F Gets the length of the LOB value

INSTR() F Returns the matching position of the nth occurrence of the pattern in the LOB

READ() P Reads data from the LOB starting at the specified offset

SUBSTR() F Returns part of the LOB value starting at the specified offset

Functions/Procedures to Write LOB Values

APPEND() P Appends the LOB value to another LOB

COPY() P Copies all or part of a LOB to another LOB

ERASE() P Erases part of a LOB, starting at a specified offset

LOADFROMFILE() P Load BFILE data into an internal LOB

LOADCLOBFROMFILE()

P Load character data from a file into a LOB

LOADBLOBFROMFILE()

P Load binary data from a file into a LOB

TRIM() P Trims the LOB value to the specified shorter length

WRITE() P Writes data to the LOB at a specified offset

WRITEAPPEND() P Writes data to the end of the LOB

Functions/Procedures for BFILEs

FILECLOSE() P Closes the file. Use CLOSE() instead.

FILECLOSEALL()

P Closes all previously opened files

FILEEXISTS() F Checks if the file exists on the server

FILEGETNAME()

P Gets the directory alias and file name

FILEISOPEN() F Checks if the file was opened using the input BFILE locators. Use ISOPEN() instead.

FILEOPEN() P Opens a file. Use OPEN() instead.

Call Functions in SQL

SELECT dbms_lob.getlength(base_sequence) FROM dna_sequence1DBMS_LOB.GETLENGTH(BASE_SEQUENCE)--------------------------------- 878 1269 893 872 961 807 806 808 833 83710 rows selected.

Call procedures in PL/SQLDECLARE v_dna_seq CLOB; v_seq_amt BINARY_INTEGER :=10; v_seq_buffer VARCHAR2(10);BEGIN v_dna_seq :=

'atctcgagtagctgaagctccaatgntggtggaattcacgagttgctt';

DBMS_LOB.READ (v_dna_seq, v_seq_amt, 1, v_seq_buffer);

DBMS_OUTPUT.PUT_LINE('The first 10 bases for this DNA sequence are: ' || v_seq_buffer);

END;/The first 10 bases for this DNA sequence are:

atctcgagtaPL/SQL procedure successfully completed.

Substr vs. dbms_lob.substr

Substr(the_string, from_character, number_of_characters);

Dbms_lob.substr(the_string, number_of_characters, from_character).

Substr vs. dbms_lob.substrCREATE table substring (str varchar2(20), lob clob);INSERT INTO substring VALUES ('Oracle10G', 'Oracle10G');SELECT substr (str, 7, 3), dbms_lob.substr(lob, 7, 3) lob FROM substring;ow03@NISCDEV.NHGRI.NIH.GOV> SUB LOB--- ----------10G acle10G10G acle10GSELECT substr (str, 7, 3), dbms_lob.substr(lob, 3, 7) lob FROM substring;ow03@NISCDEV.NHGRI.NIH.GOV> SUB LOB--- ----------10G 10G10G 10G

Lob Usage Limitation

Not in the ORDER BY, or GROUP BY or in an aggregate function.

Not in a SELECT... DISTINCT or SELECT... UNIQUE statement or in a join.

Not in ANALYZE... COMPUTE or ANALYZE... ESTIMATE statements.

Not as a primary key column. Not select a LOB column through dblink. ORA-

22992: cannot use LOB locators selected from remote tables.

Partitioning and Its

Usage Scenarios at NISC

Partition Method

Range Partitioning, introduced in Oracle 8. Hash Partitioning, introduced in 8i. List Partitioning, introduced in 9i release 1. Composite Partitioning. The range-hash

partition was introduced in 8i, and the range-list partition was introduced in 9i release 2.

This is a good example how Oracle adds functionalities to the new release.

Benefit of Partitioning

The amount of time for each operation can be significantly reduced because of the small segment.

Improve query performance. The I/O will be balanced among disks.

Reduce the downtime. Part of the table can be put to read only

mode. Easy to implement.

When to Partition

When table becomes large. 2GB is considered as a general guideline.

When the data is kind of adding on, meaning new data will go to the new partition.

Work with Range Partition

Create table with range partitioning. Convert a non-partition table to a partition

table. Merge/split partition. Tablespace usage with partition. Maintain range partition.

Partitioning Usage Examples

Create tablespace Create table Add partition Drop partition Exchange partition Move partition Merge partition Split partition Truncate partition Rename partition

Create Partitioned Table

CREATE TABLE dna_sequence (base_id NUMBER(6), base_sequence CLOB) LOB (base_sequence) STORE AS dna_seq_lob2 TABLESPACE examplePARTITION BY RANGE (BASE_ID) (partition dna_sequence1 values less than (100)

tablespace dna_sequence_p1, partition dna_sequence2 values less than (200)

tablespace dna_sequence_p2, partition dna_sequence3 values less than (300)

tablespace dna_sequence_p3);

Query the Partitioned Table

SELECT table_name, partition_name, tablespace_name, high_value

FROM user_tab_partitions

ORDER BY partition_name;

TABLE_NAME PARTITION_NAME TABLESPACE_NAME HIGH_VALUE

---------------- -------------------- -------------------- ----------

DNA_SEQUENCE DNA_SEQUENCE1 DNA_SEQUENCE_P1 100

Add Partition

ALTER TABLE dna_sequence

ADD PARTITION dna_sequence4 VALUES LESS THAN (400)

TABLESPACE dna_sequence_p1;

--------------- ----------------- -------------------- ----------

Drop Partition

ALTER TABLE dna_sequence DROP PARTITION dna_sequence4;

Run partition.sql;

---------------- ------------------- -------------------- ---------

Exchange Partition

CREATE TABLE dna_sep03

AS SELECT *

FROM dna_sequence

WHERE 1=2;

EXCHANGE PARTITION dna_sequence3 WITH TABLE dna_sep03;

Move Partition

MOVE PARTITION dna_sequence4 TABLESPACE dna_sequence_p2 NOLOGGING;

Split Partition

ALTER TABLE dna_sequence SPLIT PARTITION dna_sequence4 AT (350) INTO ( PARTITION dna_sequence4 TABLESPACE dna_sequence_p1, PARTITION dna_sequence5 TABLESPACE dna_sequence_p2) PARALLEL ( DEGREE 5 );

TABLE_NAME PARTITION_NAME TABLESPACE_NAME HIGH_VALUE----------------- -------------------- -------------------- ----------DNA_SEQUENCE DNA_SEQUENCE1 DNA_SEQUENCE_P1 100DNA_SEQUENCE DNA_SEQUENCE2 DNA_SEQUENCE_P2 200DNA_SEQUENCE DNA_SEQUENCE3 DNA_SEQUENCE_P3 300DNA_SEQUENCE DNA_SEQUENCE4 DNA_SEQUENCE_P1 350DNA_SEQUENCE DNA_SEQUENCE5 DNA_SEQUENCE_P2 400

Truncate Partition

TRUNCATE PARTITION dna_sequence4 DROP STORAGE;

Rename Partition/Table

Rename partition– ALTER TABLE dna_sequence

RENAME PARTITION dna_sequence4 TO dna_sequence5;

Rename table– ALTER TABLE dna_sequence

RENAME TO dna_seq;– RENAME dna_seq TO dna_sequence;

Conclusion

By proper use of the Oracle features such as CLOB, and partitioning table, it becomes a lot easier to manage the database containing large amounts of biological data.

Major Benefits using CLOB and Partitioning at NISC

Space Savings: Proper use of CLOB Better performance: Put big tables into

smaller segments Better Maintenance: Easier backup and

recovery; Less down time

AQ&Q U E S T I O N SQ U E S T I O N S

A N S W E R SA N S W E R S

Reminder – please complete the OracleWorld online session survey

Thank you.

Xiaobin Guan, Ph.D.NISC/NIHXiaobin_Guan@nih.gov

Handling Large Amounts of Biological Data Xiaobin Guan, Ph.D. Senior Oracle DBA/Bioinformatician...

Documents

Transcript of Handling Large Amounts of Biological Data Xiaobin Guan, Ph.D. Senior Oracle DBA/Bioinformatician...

An Overview on the Source Identification of Atmospheric Mercury using PCA Xiaohong (Iris) Xu, Xiaobin Wang University of Windsor, Windsor, Ontario Canada.

PROCEEDINGS Open Access The MOLGENIS toolkit: rapid ... · infrastructure accommodates a particular research. Using ‘model-driven’ development methods a bioinformatician only

Decolorization: Is rgb2gray() out? Yibing Song, Linchao Bao, Xiaobin Xu and Qingxiong Yang City University of Hong Kong.

Dependence of SMOS/MIRAS brightness temperatures on wind speed and foam model Xiaobin Yin, Jacqueline Boutin LOCEAN & ARGANS.

Xiaobin Shen eScience2013 presentation

ECON 40364: Monetary Theory & Policy Eric Sims

RNAseq Experimental Design: The Perspective of a ... · PDF fileRNAseq Experimental Design: The Perspective of a Bioinformatician. by ... • Ion semiconductor sequencing ... pyrosequencing

Genomic EpidemiologyGenomic Epidemiology Lee Katz, Ph.D. Senior bioinformatician Enteric Diseases Laboratory Branch. Computational Genomics course. Jan 31, 2018. National Center for

Shen, Leiting; Li, Xiaobin; Taskinen, Pekka Thermodynamics ...

How online networks (mostly) kept a lone bioinformatician from going insane

Financial Structure - ECON 40364: Monetary Theory & Policyesims1/slides_financial_structure... · 2018-11-02 · corporate bonds) or equity (e.g. issue new stock) I Fact: indirect

Bioinformatician – more than just a number cruncher (or bridging the gap between computer scientists and biologists) - Nathan Hall

The Federal Reserve and the COVID-19 Crisis - ECON 40364: …esims1/slides_coronavirus_sp2020.pdf · 2020-04-27 · Economic Fallout I The economic contraction from the Coronavirus

Toward building an automated bioinformatician: more ...ckingsf/AutoAlg2019/slides/Building...Dan DeBlasio dandeblasio.com ... arg Format string describing the library type -r [ --unmatedReads

Analysis of Massively Parallel Sequencing Data – Application of … · 2017. 1. 29. · Gordon Blackshields Senior Bioinformatician Source BioScience. 2 Next Generation Sequencing

Money Demand - ECON 40364: Monetary Theory & Policyesims1/slides_money_demand_fall2020.pdfVelocity, Money Demand, and the Quantity Theory I The terms \velocity" and \money demand"

Yufeng Zhou and Xiaobin Wilson Gao School of Mechanical & Aerospace Engineering Nanyang Technological University.

HighLoad Solutions On MySQL / Xiaobin Lin (Alibaba)

Clinical Bioinformatics STP MAHSE Open Day 2019mahse.co.uk/wp-content/uploads/MAHSE... · Clinical Bioinformatician (Health Informatics) • You will advise other healthcare professionals,

REPORT #40364 Enlighted Technical Proof of Concept … · PILOT SITE MEASURED PERFORMANCE ANALYSIS ... APPENDIX D: LIGHT LEVEL MEASUREMENTS ... Enlighted Technical Proof of Concept