Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

47
Applications of Applications of knowledge discovery to knowledge discovery to molecular biology: molecular biology: Identifying structural regularities Identifying structural regularities in proteins in proteins Shaobing Su Shaobing Su Supervisor: Dr. Lawrence B. Supervisor: Dr. Lawrence B. Holder Holder Committee: Dr. Diane J. Cook Committee: Dr. Diane J. Cook Dr. Edward Dr. Edward Bellion Bellion

description

Applications of knowledge discovery to molecular biology: Identifying structural regularities in proteins. Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook Dr. Edward Bellion. Outline. Motivation and goal of the research - PowerPoint PPT Presentation

Transcript of Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Page 1: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Applications of knowledge Applications of knowledge discovery to molecular discovery to molecular biology:biology:Identifying structural regularities in Identifying structural regularities in proteinsproteins

Shaobing SuShaobing SuSupervisor: Dr. Lawrence B. HolderSupervisor: Dr. Lawrence B. HolderCommittee: Dr. Diane J. CookCommittee: Dr. Diane J. Cook

Dr. Edward BellionDr. Edward Bellion

Page 2: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

OutlineOutline Motivation and goal of the researchMotivation and goal of the research SUBDUE knowledge discovery systemSUBDUE knowledge discovery system Proteins and PDBProteins and PDB Methods and resultsMethods and results Discussion and conclusionDiscussion and conclusion Future researchFuture research

Page 3: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Motivation and GoalMotivation and Goal Explosive amount of molecular biology info Explosive amount of molecular biology info

need to be analyze to help understanding need to be analyze to help understanding the underlining structure-function the underlining structure-function relationship in protein and other relationship in protein and other macromolecules.macromolecules.

Apply SUBDUE to the Brookhaven Protein Apply SUBDUE to the Brookhaven Protein Data Bank (PDB) to identify biologically Data Bank (PDB) to identify biologically meaningful patternsmeaningful patterns

Page 4: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

SUBDUE knowledge SUBDUE knowledge discovery systemdiscovery system SUBDUE discovers patterns (substructures) SUBDUE discovers patterns (substructures)

in structural data sets in structural data sets SUBDUE represent data as a labeled graphSUBDUE represent data as a labeled graph Inputs: vertices and edgesInputs: vertices and edges Outputs: discovered patterns and Outputs: discovered patterns and

instancesinstances

Page 5: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

ExampleExample

objecttriangle

objectsquareon

shape

shape

Vertices: objects or attributesEdges: relationships

4 instances of

Page 6: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

SUBDUE’s search SUBDUE’s search algorithmalgorithm Minimum Description Length (MDL) principle: Minimum Description Length (MDL) principle:

The best theory to describe a set of data is the The best theory to describe a set of data is the one that minimizes the DL of the entire data setone that minimizes the DL of the entire data set

DL of the graph: the number of bits necessary DL of the graph: the number of bits necessary to completely describe the graph to completely describe the graph

Search for the substructure that results in Search for the substructure that results in the maximum compressionthe maximum compression

Page 7: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Inexact graph match Inexact graph match approachapproach

Find instances with a slight Find instances with a slight distortion: insertion, deletion, distortion: insertion, deletion, and substitution of and substitution of edges/vertices.edges/vertices.

Threshold parameter: specify Threshold parameter: specify amount of distortion allowed.amount of distortion allowed.

Page 8: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Overview of proteinsOverview of proteins most important biomolecule most important biomolecule composed from 20 amino acidscomposed from 20 amino acids structural hierarchystructural hierarchy very diverse structure and functionvery diverse structure and function

Page 9: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Structural hierarchy in Structural hierarchy in proteinsproteins Primary structure (sequence of protein)Primary structure (sequence of protein)

Secondary structure (helix, sheet, Secondary structure (helix, sheet, random)random)

Tertiary structure (3-D)Tertiary structure (3-D)

Page 10: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Primary Structure of proteinsPrimary Structure of proteins Average 100-150 residues (a.a.) linked in head Average 100-150 residues (a.a.) linked in head

to tailto tail N-terminus and C-terminus N-terminus and C-terminus Peptide bond, alpha-carbonPeptide bond, alpha-carbon

H3N - C1 - C - N - C2 - C - O

R1 O H R2 ON-terminus C-terminus

+ -

peptide bondfirst a.a second a.a

Page 11: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Secondary structure Secondary structure elementselements Ordered backbone arrangement: helix and Ordered backbone arrangement: helix and

sheetsheet Helix (0 % to 90 %; average 11 a.a; several Helix (0 % to 90 %; average 11 a.a; several

types)types) Sheet (2 to 15 strands per sheet; parallel and Sheet (2 to 15 strands per sheet; parallel and

anti-parallel; average 6 a.a. per anti-parallel; average 6 a.a. per strand)strand)

Right-handeda -helix

Two-stranded parallel b -sheet

Two-strandedanti-parallel b -sheet

Page 12: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Tertiary Structure of Tertiary Structure of proteinprotein Highly complicated 3-D arrangementHighly complicated 3-D arrangement Folding of its secondary structure elementsFolding of its secondary structure elements

Page 13: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Brookhaven Protein Data Brookhaven Protein Data Bank Bank (PDB)(PDB) Brookhaven National LaboratoryBrookhaven National Laboratory

Over 6000 Experimentally determined Over 6000 Experimentally determined 3-D structure of 3-D structure of biomolecules biomolecules

Majority: protein structuresMajority: protein structures

Page 14: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Contents of PDBContents of PDB SEQRES: sequence of a.a. (three letter SEQRES: sequence of a.a. (three letter

code) code)

HELIX: starting, ending, and type HELIX: starting, ending, and type

SHEET: starts, ends, senseSHEET: starts, ends, sense

ATOM: (x, y, z) coordinates for each atoms ATOM: (x, y, z) coordinates for each atoms in protein in protein

Page 15: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Applications of SUBDUE to Applications of SUBDUE to PDBPDB- Methods and Results- Methods and Results July 1997 PDBJuly 1997 PDBTMTM release (6000 PDB) release (6000 PDB)

Global data set (4000 PDB)Global data set (4000 PDB)

Category data sets Category data sets hemoglobin hemoglobin Myoglobin Myoglobin Ribonuclease ARibonuclease A

Page 16: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Flowchart of ResearchFlowchart of Research

Preprocessing Application

BrookhavenPDB

Graphic representation

Inputs to SUBDUE

Patterns in Category

Patterns in Global others

Instancemapping

Page 17: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

PreprocessingPreprocessing compile PDB list for each categorycompile PDB list for each category model.c: extract first modelmodel.c: extract first model seq.c: extract sequence info seq.c: extract sequence info

convert to graphic format convert to graphic format secondary.c: extract secondary structure info secondary.c: extract secondary structure info

and convert to graphic format and convert to graphic format coor.c: extract 3D coordinates coor.c: extract 3D coordinates

convert to grahic format convert to grahic format

Page 18: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Primary structure and its Primary structure and its representationrepresentation Sample PDB lines: Sample PDB lines:

SEQRES 1 150 ALA ASN LYS THR 1ASH 139 SEQRES 1 150 ALA ASN LYS THR 1ASH 139 SEQRES 2 150 LYS SER LEU GLU 1ASH 140 SEQRES 2 150 LYS SER LEU GLU 1ASH 140

Sequence (N-terminus to C-terminus): Sequence (N-terminus to C-terminus): ALA ASN LYS THR LYS SER LEU GLU ALA ASN LYS THR LYS SER LEU GLU

SUBDUE graphic input (ALA ASN): SUBDUE graphic input (ALA ASN): v 1 ALA - - - ALA residue v 1 ALA - - - ALA residue v 2 ASN v 2 ASN - - - ASN residue - - - ASN residue e 1 2 bond - - - a peptide bond between ALA and ASN e 1 2 bond - - - a peptide bond between ALA and ASN

Page 19: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Secondary structure and its Secondary structure and its representation -HELIXrepresentation -HELIX Sample PDB linesSample PDB lines (starting, ending, type):(starting, ending, type):

HELIX 1 ASN HELIX 1 ASN 1 HIS 13 1 1 HIS 13 1 HELIX 2 ASN 20 ASN 36 1 HELIX 2 ASN 20 ASN 36 1

vertex: h_type_lengthvertex: h_type_length Helix Length:Helix Length:

Hlength = SeqNum(last a.a.) - SeqNum(first a.a.)Hlength = SeqNum(last a.a.) - SeqNum(first a.a.) SUBDUE graphic input:SUBDUE graphic input:

v 1 h_1_12 - - - helix 1, type 1, length v 1 h_1_12 - - - helix 1, type 1, length 12 v 2 h_1_16 - - - 12 v 2 h_1_16 - - - helix 2, type 1, length 16helix 2, type 1, length 16

Page 20: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Secondary structure and its Secondary structure and its representation - SHEETrepresentation - SHEET Sample PDB linesSample PDB lines (sense, length):(sense, length):

SHEET 1 TYR 284 ILE 286 0 SHEET 1 TYR 284 ILE 286 0 SHEET 2 HIS 292 SHEET 2 HIS 292 THR 294 - 1THR 294 - 1

vertex: s_sense_lengthvertex: s_sense_length

SUBDUE graphic input:SUBDUE graphic input: v 1 s_0_2 - - - strand 1, sense 0, length 2 v 1 s_0_2 - - - strand 1, sense 0, length 2 v 2 s_-1_2 - - - strand 2, sense -1, length 2 v 2 s_-1_2 - - - strand 2, sense -1, length 2

Page 21: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Overall secondary structure Overall secondary structure representationrepresentation PDB line: SUBDUE PDB line: SUBDUE

graphic input graphic input HELIX 1 THR 3 MET 13 1 HELIX 1 THR 3 MET 13 1 v 1 h_1_10 v 1 h_1_10 HELIX 2 ASN 24 ASN 34 1 HELIX 2 ASN 24 ASN 34 1

v 2 h_1_10 e 1 2 sh v 2 h_1_10 e 1 2 sh HELIX 3 SER 50 GLN 60 1 HELIX 3 SER 50 GLN 60 1 v 3 s_0_7 e 2 3 sh v 3 s_0_7 e 2 3 sh SHEET 1 SHEET 1 LYS 41 HIS 48 0LYS 41 HIS 48 0 v 4 h_1_10 e 3 4 sh v 4 h_1_10 e 3 4 sh SHEET 2 MET 79 SHEET 2 MET 79 THR 87 -1THR 87 -1 v 5 s_-1_8 e 4 5 shv 5 s_-1_8 e 4 5 sh

sequential relationship is represented as edge “sh”sequential relationship is represented as edge “sh”

Visualization: Visualization:

N-terminus C-terminus

Page 22: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Tertiary structure and its Tertiary structure and its representationrepresentation Sample PDB lines:Sample PDB lines: XX Y Y Z Z

ATOMATOM CACA ALAALA 11 10.36910.3690.9970.997 10.519 ATOM10.519 ATOM CACAASNASN 22 6.6916.691 0.2390.239 9.8309.830

vertex: backbone carbon; vertex: backbone carbon; edge: distance (vs, s) edge: distance (vs, s)

Distance (Å): Distance (Å): distance = ((xdistance = ((x22-x-x11))22 + (y + (y22-y-y11))22 + (z + (z22 - z - z11))22))1/21/2

v 1 CA_ALA v 1 CA_ALA v 2 CA_ASN v 2 CA_ASN e 1 2 vs e 1 2 vs - - - very short - - - very short distancedistance

Page 23: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Rationale for representation Rationale for representation choicechoice-Criteria-Criteria Patterns identified by SUBDUE must be Patterns identified by SUBDUE must be

representative for each categoryrepresentative for each category

Patterns discovered by SUBDUE should Patterns discovered by SUBDUE should discriminate one category from othersdiscriminate one category from others

Page 24: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Primary sequencePrimary sequence vertex - a.a. residue namevertex - a.a. residue name edge - peptide bondedge - peptide bond

e 1 2 bond e 2 3 bond

ARG GLU ALAbond bond

v 1 ARG v 2 GLU v 3 ALA

Page 25: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Secondary structure Secondary structure elementselements Type of the helixType of the helix starting and ending points (a.a name and seq starting and ending points (a.a name and seq

number)number)

Helix 1

1 12

ASN … HIS

type length

starts ends

N-terminus C-terminus

Page 26: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Other ways of representing Other ways of representing helixhelix Separate type and lengthSeparate type and length combine type and length combine type and length

Helix 1

1 12

Helix_1_12 type length

Page 27: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Tertiary structureTertiary structure (x, y, z) coordinates vary with different origin choice(x, y, z) coordinates vary with different origin choice

avoid numeric number, use vs (avoid numeric number, use vs (4 Å), s (4 Å < dist 4 Å), s (4 Å < dist 6 6 Å)Å)

10.4 6.7

1.0 C1 C2 0.2

10.5 9.8

x x y vs y

z z

Page 28: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Results:Results:Primary structure patternsPrimary structure patterns

Ribonuclease_A_sequence:GLY GLN THR ASN CYS TYR GLN SER TYR SER THR MET SER ILE THR ASP CYS ARG GLU THR GLY SER SERLYS TYR PRO ASN CYS ALA TYR LYS THR THR GLN ALA ASN LYS HIS ILE ILE VAL ALA CYS GLU GLY ASN PRO TYR VAL PRO VAL HIS PHE ASP ALA SER VAL

Hemo_seq (63/65)Hemo_sequence:THR LYS THR TYR PHE PRO HIS PHE ASP LEU SER HIS GLY SER ALA GLN VAL LYS GLY HIS GLY LYS LYSVAL ALA ASP ALA LEU THR ASN ALA VAL ALA HIS VAL ASP ASP MET PRO ASN ALA LEU SET ALA LEU SERTHR LEU ALA ALA HIS LEU PRO LAL GLU PHE THR PRO ALA VAL HIS ALA SET LEU ASP LYS PHE LEU ALASET VAL SER THR VAL LEU THR SER LYS TYR

Myo_seq (67/103)Myoglo_sequence:VAL LSU SER GLU GLY GLU TRP GLN LEU VAL LEU HIS VAL TRP ALA LYS VAL GLU ALA ASP VAL ALA GLY HIS GLY GLN ASP ILE LEU ILE ARG LEU PHE LYS SER HIS PRO GLU THR LEU GLU LYS PHE ASP ARG

Ribo_A (59/68)

Page 29: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Primary structure patternsPrimary structure patterns Unique to each sample categoryUnique to each sample category

hemoglobin and myoglobin proteins hemoglobin and myoglobin proteins share little sequence similarity share little sequence similarity

Page 30: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Results:Results:Hemo secondary structure Hemo secondary structure patternspatterns

The secondary structure patterns discovered in hemoglobin (Hemo)

Exp.Parameter

Pattern 1(# of instances inHemo/Global_Other)

Pattern 2(# of instances inHemo/ Global_Other)

Pattern 3(# of instances inHemo/ Global_Other)

Threshold0.0

Hemo_s_1_0.01

(50 / 0)Hemo_s _2_0.02

(52 / 0)Hemo_s _3_0.03

(50 / NA)Threshold0.1

Hemo_s _1_0.14

(51 / NA)Hemo_s _2_0.15

(58 / NA)Hemo_s _3_0.16

(52 / NA)Threshold0.2

Hemo_s _1_0.27

(90 / NA)Hemo_s _2_0.28

(98 / NA)Hemo_s _3_0.29

(92 / NA)Threshold0.3

Hemo_s _1_0.310

(95 / NA)Hemo_s _2_0.311

(107 / NA)Hemo_s _3_0.312

(100 / NA)

1: h_1_14 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20

7: h_1_15 -> h_1_15 -> h_1_6 -> h_1_1 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20

Page 31: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Results:Results:Myo secondary structure Myo secondary structure patternspatterns

The secondary structure patterns discovered in myoglobin (Myo)

Exp.Parameter

Pattern 1(# of instances inMyo/ Global_Other)

Pattern 2(# of instances inMyo/ Global_Other)

Pattern 3(# of instances inMyo/ Global_Other)

Threshold0.0

Myo_s_1_0.01

(81 / 0)Myo_s _2_0.02

(82 / 0)Myo_s _3_0.03

(81 / 0)Threshold0.1

Myo_s _1_0.14

(81 / NA)Myo_s _2_0.15

(84 / NA)Myo_s _3_0.16

(81 / NA)Threshold0.2

Myo_s _1_0.27

(83 / NA)Myo_s _2_0.28

(84 / NA)Myo_s _3_0.29

(83 / NA)Threshold0.3

Myo_s _1_0.310

(83 / NA)Myo_s _2_0.311

(84 / NA)Myo_s _3_0.312

(84 / NA)

1: h_1_15 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_9 -> h_1_18 -> h_1_25

Page 32: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Results:Results:Ribo_A secondary structure Ribo_A secondary structure patternspatterns

The secondary structural patterns discovered in ribonuclease A (Ribo_A)

Exp.Parameter

Pattern 1(# of instances inRibo_A/ Global_Other)

Pattern 2(# of instances inRibo_A/ Global_Other)

Pattern 3(# of instances inRibo_A/ Global_Other)

Threshold0.0

Ribo_A_s_1_0.01

(25 / 0)Ribo_A _s _2_0.02

(25 / 0)Ribo_A _s _3_0.03

(25 / 0)Threshold0.1

Ribo_A _s _1_0.14

(27 / NA)Ribo_A _s _2_0.15

(27 / NA)Ribo_A _s _3_0.16

(27 / NA)Threshold0.2

Ribo_A _s _1_0.27

(27 / NA)Ribo_A _s _2_0.28

(27 / NA)Ribo_A _s _3_0.29

(27 / NA)Threshold0.3

Ribo_A _s _1_0.310

(36 / NA)Ribo_A _s _2_0.311

(36 / NA)Ribo_A _s _3_0.312

(36 / NA)

1: h_1_10 -> h_1_10 -> s_0_7 -> s_0_7 -> h_1_10 -> s_0_3 -> s_0_3 -> s_-1_4 -> s_-1_4 -> s_-1_8 -> s_-1_1 -> s_-1_10 -> s_-1_10 -> s_-1_8 -> s_-1_8 -> s_-1_5 -> s_-1_3

10: h_1_10 -> h_1_10 -> s_0_7 -> h_1_10 -> s_0_3 -> s_-1_4 -> s_-1_8 -> s_-1_8 -> s_-1_6

Page 33: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Results:Results:Tertiary structural patternsTertiary structural patterns SUBDUE finds small patterns (2 or 3 SUBDUE finds small patterns (2 or 3

a.a.)a.a.)

not unique for each category of proteinsnot unique for each category of proteins

not biologically meaningfulnot biologically meaningful

Page 34: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Visualization of secondary Visualization of secondary structure patterns -structure patterns -hemoglobinhemoglobin

complete hemoglobin 2 instances of pattern structure

N-terminus C-terminus

Page 35: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Visualization of secondary Visualization of secondary structure patterns -structure patterns -myoglobinmyoglobin

complete myoglobin 1 instance of pattern structure

N-terminus C-terminus

Page 36: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Visualization of secondary Visualization of secondary structure patterns -structure patterns -ribonuclease_Aribonuclease_A

complete ribonuclease_A 1 instance of pattern structure

N-terminus C-terminus

Page 37: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

DiscussionDiscussion-Hemoglobin-Hemoglobin Hemoglobin: A, B, C, D chainsHemoglobin: A, B, C, D chains Two types of patterns identified by SUBDUE Two types of patterns identified by SUBDUE

One for A, C chains, the other for B, D chainsOne for A, C chains, the other for B, D chains Patterns exist in a majority of hemoglobin Patterns exist in a majority of hemoglobin

proteinsproteins No instances of the best hemoglobin pattern No instances of the best hemoglobin pattern

found in other proteins in the global data set found in other proteins in the global data set

Page 38: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Occurrence of hemo patternsOccurrence of hemo patternsThe occurrences of the best hemoglobin patterns

PDB Name Occurrence Speciespdb2hhb.ent B, D chains1; A, C chains7 humanpdb1sdl.ent NO human pdb1bbb.ent B chai1 humanpdb4hhb.ent B, D chains1; A, C chains7 humanpdb1thb.ent A, C, B, D chains7 humanpdb3hhb.ent B chain1; A chain7 humanpdb1sdk.ent NO humanpdb1cbm.ent A, B, C, D chains1 humanpdb1cls.ent NO humanpdb1hbb.ent B, D chains1; A, C chains7 humanpdb1hba.ent B, D chains1; A, C chains7 humanpdb2hbc.ent B chain1; A chain7 humanpdb1cbl.ent A, B, C, D chains1 humanpdb1hga.ent N/A humanpdb1hgb.ent N/A humanpdb1hgc.ent N/A humanpdb2hbd.ent B chain1; A chain7 humanpdb2hbf.ent B chain1; A chain7 humanpdb1hho.ent B chain1; A chain7 humanpdb1nih.ent B, D chains1; A, C chains7 humanpdb1coh.ent B, D chains1; A, C chains7 humanpdb1fdh.ent G chain1 humanpdb2hco.ent B chain1; A chain7 humanpdb1cmy.ent B, D chains1 humanpdb1hbs.ent B,D,F,H chains1; A,C,E,G chains7 humanpdb1hco.ent B chain1; A chain7 human

N/A: secondary structure information not available1 instance found with a thredhold of 0.0 (default): Hemo_s_1_0.01

4 instance found with a thredhold of 0.1: Hemo_s_1_0.14

7 instance found with a thredhold of 0.2: Hemo_s_1_0.27

10 instance found with a thredhold of 0.3 (default): Hemo_s_1_0.010

Page 39: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Occurrence of hemo patterns Occurrence of hemo patterns -continued-continued

The occurrences of the best hemoglobin patterns

PDB Name Occurrence Speciespdb1bab.ent B chain1; A, C chains7 humanpdb1dxu.ent B, D chains1; A, C chains7 humanpdb1dxv.ent B, D chains1; A, C chains7 humanpdb1dxt.ent B, D chains1; A, C chains7 humanpdb1gbu.ent B, D chains10 humanpdb1gbv.ent B, D chains10 humanpdb1hdb.ent NO humanpdb1dsh.ent NO humanpdb2hhe.ent N/A humanpdb1gli.ent B, D chains1; A, C chains7 humanpdb2hbe.ent B chain1 humanpdb1ibe.ent NO horsepdb2mhb.ent NO horsepdb2dhb.ent B chain4; A chain7 horsepdb1hds.ent N/A deerpdb1hda.ent B, D chains1 bovinepdb2pgh.ent B, D chains1 pigpdb1out.ent NO troutpdb1ouu.ent NO troutpdb1pbx.ent B chain1 antarctic fishpdb1hbh.ent NO antarctic fishpdb1ith.ent NO innkeeper worm

N/A: secondary structure information not available1 instance found with a thredhold of 0.0 (default): Hemo_s_1_0.01

4 instance found with a thredhold of 0.1: Hemo_s_1_0.14

7 instance found with a thredhold of 0.2: Hemo_s_1_0.27

10 instance found with a thredhold of 0.3 (default): Hemo_s_1_0.010

Page 40: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

DiscussionDiscussion-Myoglobin-Myoglobin Myoglobin: one chainMyoglobin: one chain One dominant pattern identified by SUBDUE One dominant pattern identified by SUBDUE Patterns exist in most of myoglobin proteinsPatterns exist in most of myoglobin proteins No instances of the best myoglobin pattern No instances of the best myoglobin pattern

found in other proteins in the global data found in other proteins in the global data set set

Page 41: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Discussion:Discussion:-Hemoglobin and Myoglobin-Hemoglobin and Myoglobin Similar secondary structure patternsSimilar secondary structure patterns

Hemoglobin B, D chains (from N- to C-terminus) h_1_14 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20

Myoglobin chain (from N- to C-terminus)

h_1_15 -> h_1_15 -> h_1_6 -> h_1_6 -> h_1_19 -> h_1_9 -> h_1_18 -> h_1_25

Hemoglobin A, C chains (from N- to C-terminus)

h_1_15 -> h_1_15 -> h_1_6 -> h_1_1 -> h_1_19 -> h_1_8 -> h_1_18 -> h_1_20

Page 42: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Discussion:Discussion:-Hemoglobin and Myoglobin-Hemoglobin and Myoglobin Consistent with the genetic studies Consistent with the genetic studies

Hemoglobin and myoglobin share one ancestral geneHemoglobin and myoglobin share one ancestral gene

Divergence occurred in the course of evolution. One Divergence occurred in the course of evolution. One copy of gene for myoglobin, four copies for hemoglobin.copy of gene for myoglobin, four copies for hemoglobin.

The last helix of the hemoglobin is shorter; One of the The last helix of the hemoglobin is shorter; One of the helix in hemoglobin A, C chains almost disappear: allow helix in hemoglobin A, C chains almost disappear: allow conformational change conformational change

Page 43: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Discussion:Discussion:-ribonuclease A proteins-ribonuclease A proteins All patterns have three helices of the All patterns have three helices of the

same sizesame size

Several strands appear twice indicating Several strands appear twice indicating participation in two sheet formation. participation in two sheet formation.

Ribonuclease S protein (S-protein Ribonuclease S protein (S-protein fragment) also has the pattern. fragment) also has the pattern.

Page 44: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Conclusion of the resultsConclusion of the results Secondary structure patterns discovered by Secondary structure patterns discovered by

SUBDUE are representative to each categorySUBDUE are representative to each category

Secondary structure patterns discovered by Secondary structure patterns discovered by SUBDUE are distinct for each categorySUBDUE are distinct for each category

SUBDUE has the ability to discover SUBDUE has the ability to discover biologically interesting patterns from PDB biologically interesting patterns from PDB and other similar MB data basesand other similar MB data bases

Page 45: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Comparison with other related Comparison with other related studiesstudies Different graphic representationDifferent graphic representation

predefined patterns with exact or inexact predefined patterns with exact or inexact graph matchgraph match

Not applied systematically to PDB or other DBNot applied systematically to PDB or other DB

SUBDUE would perform similar task if the SUBDUE would perform similar task if the inexact graph match routine is incorporatedinexact graph match routine is incorporated

Page 46: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Conclusions of the studyConclusions of the study Abstraction over 3D structure to its secondary Abstraction over 3D structure to its secondary

structural elements is suitable for discoverystructural elements is suitable for discovery

SUBDUE discovered secondary structure patterns for SUBDUE discovered secondary structure patterns for each category can be used as a signature for its classeach category can be used as a signature for its class

Inexact graph match is useful for finding similar Inexact graph match is useful for finding similar patterns patterns

SUBDUE is suitable for knowledge discovery in MB SUBDUE is suitable for knowledge discovery in MB structural DBstructural DB

Page 47: Shaobing Su Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook

Future ResearchFuture Research More consistent and detailed description of More consistent and detailed description of

secondary structure secondary structure Add relative positions of the secondary structural Add relative positions of the secondary structural

elements to represent spatial relationshipelements to represent spatial relationship Investigate alternative representation: more Investigate alternative representation: more

suitable 3D coordinates representation; suitable 3D coordinates representation; weighting on different edgesweighting on different edges

Inexact graph match in predefined substructureInexact graph match in predefined substructure More collaboration with domain scientistsMore collaboration with domain scientists