Molecular Machine Learning: A Personal Introduction · Molecular Machine Learning: A Personal...

Post on 10-Jun-2020

12 views 2 download

Transcript of Molecular Machine Learning: A Personal Introduction · Molecular Machine Learning: A Personal...

Molecular Machine Learning: Molecular Machine Learning: A Personal IntroductionA Personal Introduction

Byoung-Tak Zhang

Adapted from “Biological Machine Learning: From Biology to Machine Learning and Back” presented November 9, 2002

as Invited Talk at Korea Information Science Society SIG CVPR

Center for Bioinformation Technology (CBIT) & Biointelligence Laboratory

School of Computer Science and EngineeringSeoul National University

http://bi.snu.ac.kr/http://cbit.snu.ac.kr/

2(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

From BT to IT and Back:From BT to IT and Back:Biological Machine LearningBiological Machine Learning

Bioscience

Biotechnology

Bioinformatics

MolecularComputation

Neural Computation

Evolutionary Computation

Biocomputing

Bio-InspiredMachine Learning

BiotechnicalMachine Learning

S/W

BT

H/W

S/W

IT

H/W

3(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

OutlineOutline

Biological Machine Learning: BITML Inspired by Biology: NN & GA

ML for Biology: BioinformaticsML Using Biology: Molecular ML

Conclusions and Outlook

ML Inspired by Biology:ML Inspired by Biology:

Neural NetworksNeural NetworksGenetic AlgorithmsGenetic Algorithms

5(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Brain vs. ComputerThe Brain vs. Computer

1. 1011 neurons with1014 synapses

2. Speed: 10-3 sec3. Distributed processing4. Nonlinear processing5. Parallel processing

1. 1011 neurons with1014 synapses

2. Speed: 10-3 sec3. Distributed processing4. Nonlinear processing5. Parallel processing

1. A single processor with complex circuits

2. Speed: 10 –9 sec 3. Central processing4. Arithmetic operation

(linearity) 5. Sequential processing

1. A single processor with complex circuits

2. Speed: 10 –9 sec 3. Central processing4. Arithmetic operation

(linearity) 5. Sequential processing

6(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

From Biological Neuron to From Biological Neuron to Artificial NeuronArtificial Neuron

7(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Activation FunctionScaling Function

Output Comparison

Information Propagation

Error Backpropagation

Input x1

Input x2

Input x3

Output

Input Layer Hidden Layer Output Layer

Weights

Activation Function

Multilayer Multilayer PerceptronPerceptron (MLP)(MLP)

∑∈

−≡outputsk

kkd otwE 2)(21)(

iiiii w

Ewwww∂∂

−=∆∆+← η ,

x )(xfo =

8(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Learning as Error MinimizationLearning as Error Minimization

9(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Gradient DescentGradient DescentGradient Descent

iiiii w

Ewwww∂∂

−=∆∆+← η ,

∑∈

−=∆Dd

idddi xotw )(η

10(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Application Example:Application Example:Autonomous Land Vehicle (ALV)Autonomous Land Vehicle (ALV)

NN learns to steer an autonomous vehicle.960 input units, 4 hidden units, 30 output units Driving at speeds up to 70 miles per hour

Weight valuesfor one of the hidden units

Image of aforward -mountedcamera

ALVINN System

11(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Charles Darwin (1859)“Owing to this struggle for life, any variation, however slight and from

whatever cause proceeding, if it be in any degree profitable to an individual of any species, in its infinitely complex relations to other organic beings and to external nature, will tend to the preservation of that individual, and will generally be inherited by its offspring.”

12(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Evolutionary ComputationEvolutionary Computation

What is the Evolutionary Computation? 4Stochastic search (or problem solving) techniques that

mimic the metaphor of natural biological evolution.

Metaphor

EVOLUTION

IndividualFitness

Environment

PROBLEM SOLVING

Candidate SolutionQualityProblem

13(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

General Structure of General Structure of GAsGAs

solutions

1100101010

1011101110

0011011001

1100110001

110010 1110

101110 1110

110010 1010 crossover

mutation00110

101110 1010

10011

00110 10010

evaluation

1100101110

1011101010

0011001001

solutions

fitnesscomputation

roulettewheel

selectionnew population

encodingchromosomes

14(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

MajorMajor Evolutionary AlgorithmsEvolutionary Algorithms

Genetic Programming

Evolution Strategies

Genetic Algorithms Evolutionary

ProgrammingClassifier Systems

• Genetic representation of candidate solutions• Genetic operators• Selection scheme• Problem domain

Hybrids: BGA

15(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Selection StrategiesSelection Strategies

Proportionate Selection4 Reproduce offspring in proportion to fitness fi.

Ranking Selection4 Select individuals according to rank(fi).

Tournament Selection4 Choose q individuals at random, the best of which survives.

Other Ways

( ) ( )( )∑ =

= λ

1jtj

tit

isaf

afap

( )⎪⎩

⎪⎨

≤≤

≤≤=

λµ

µµ

i

iap t

is

,0

1,1

( ) ( ) ( )( )qqq

tis iiap −−+−= λλ

λ11

16(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

, where rx, rσ , rα ∈{-, d, D, i, I, g, G}, e.g. rdIIασ rrr xr

⎪⎪⎪⎪⎪⎪

⎪⎪⎪⎪⎪⎪

′−⋅+

′−⋅+

′−+

′−+

=′

GiSiTiiS

giSiTiS

IiSiTiS

iiSiTiS

DiTiS

diTiS

iS

i

rxxx

rxxx

rxxx

rxxx

rxx

rxx

rx

x

i

i

i

teintermedia dgeneralize panmictic)(

teintermedia dgeneralize)(

teintermedia panmictic2/)(

teintermedia2/)(

discrete panmicticor

discreteor

ionrecombinat no

,,,

,,,

,,,

,,,

,,

,,

,

χ

χ

ES: Recombination OperatorES: Recombination Operator

17(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

m{τ,τ’,β} : Iλ → Iλ is an asexual operator.4nσ = n, nα = n(n-1)/2

41 < nσ < n, nα = 0

4nσ = 1, nα = 0

)),(,0(

)1,0())1,0()1,0(exp(

ασ

βααττσσ

′′+=′

⋅+=′⋅+⋅′⋅=′

CNxx

NNN

jjj

iii

( )0873.02

2

1

1

≈∝′

⎟⎠⎞⎜

⎝⎛∝

βτ

τ

n

n

)1,0())1,0()1,0(exp(

iiii

iii

NxxNN

⋅′+=′⋅+⋅′⋅=′

σττσσ

)1,0())1,0(exp( 0

iii NxxN

⋅′+=′⋅⋅=′

στσσ

n/10 ∝τ

ES: Mutation OperatorES: Mutation Operator

18(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Geometric Analogy Geometric Analogy -- Fitness LandscapeFitness Landscape

ML for BioinformaticsML for Bioinformatics

20(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Computers Meet BiosciencesComputers Meet Biosciences

Bioinformation Technology(BIT)

BT IT

21(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Structural Genomics

FunctionalGenomics Proteomics Pharmaco-

genomics

AGCTAGTTCAGTACA

TGGATCCATAAGGTA

CTCAGTCATTACTGC

AGGTCACTTACGATA

TCAGTCGATCACTAG

CTGACTTACGAGAGT

Microarray (Biochip)

Infrastructure of Bioinformatics

Areas and Workflow of BioinformaticsAreas and Workflow of Bioinformatics

22(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Topics in BioinformaticsTopics in Bioinformatics

Structure analysisStructure analysis4 Protein structure comparison4 Protein structure prediction 4 RNA structure modeling

Pathway analysisPathway analysis4Metabolic pathway4 Regulatory networks

Sequence analysisSequence analysis4 Sequence alignment4 Structure and function prediction4 Gene finding

Expression analysisExpression analysis4 Gene expression analysis4 Gene clustering

23(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Machine Learning Techniques for Machine Learning Techniques for BioinformaticsBioinformatics

Sequence Alignment4 Simulated Annealing4 Genetic Algorithms

Structure and Function Prediction4 Hidden Markov Models4Multilayer Perceptrons4 Decision Trees

Molecular Clustering and Classification4 Support Vector Machines4 Nearest Neighbor Algorithms

Expression (DNA Chip Data) Analysis4 Self-Organizing Maps4 Bayesian Networks

24(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Gene Finding Gene Finding

Upstream Open Reading FrameDownstream

mRNA

25(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Gene Finding ProgramsGene Finding ProgramsName Methods Organism

ER Discriminant Analysis Human, Arabidopsis

GENSCAN (seems the most accurate)

Semi Markov Modelvertebrate, caenorhabditis, arabidopsis, maize

GRAIL Neural Networkhuman, mouse, arabidopsis, drosophila, E.coli

GenLang Definite Clause Grammer Vertebrate, Drosophila, Dicot

GenView Linear combination Human, Mouse, Diptera

GeneFinder(FGENEH,etc.)

LDAHuman, E.coli, Drosophila, Plant, Nematode, Yeast

GeneID Perceptron,rules Vertebrate

GeneMark 5th-Markov Almost all model organism

GeneParser Neural networks Human

Genie GHMM Human (vertebrate)

GlimmerInterpolated Markov models (IMMs)

microbial

MORGAN Decision Tree vertebrate

MZEF Quadratic Discriminant Analysis Human, mouse, Arabidopsis, Pombe

NetPlantGene Combined Neural Networks A. thaliana

OC1 Decision tree Human

PROCRUSTES Spliced alignment vertebrate

Sorfind Rule base Human

VEIL HMM vertebrate

Hogehoge Wonderful method extraterrestrial

26(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

27(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Learning SchemeLearning Scheme

Training setAATGCGTACCTCATACGACCACAACGAATGAATATGATGT………

Training setAATGCGTACCTCATACGACCACAACGAATGAATATGATGT………

Test setTCGACTACGAGCCTCATCGACGAACGAATGAATATGATGT………

Test setTCGACTACGAGCCTCATCGACGAACGAATGAATATGATGT………

PredictionMethod

PredictionMethod

Learning (Model Construction)

Outputinput

inputoutput

28(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

29(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Neural Networks in GRAILNeural Networks in GRAIL

CATATTCAAGAATTGAAGCGTGTAGTCCTGACTTGAGAGCTGTAGATGACGTGCTTATATGTTC………………………..

Known Sequence

0.7 0.8 0.1 0.3 … 0.9 0.2

0.4 0.2 0.6 0.1 … 0.4 0.5

x1x2

xn

0.2 0.9 0.3 0.1 … 0.8 0.3

0.6 0.3 0.2 0.8 … 0.2 0.4

Coding potential valueGC Composition

LengthDonor

Intron vocabulary

1

0

0

1

t1t2

tn

x3 t3

Exon

Input Layer

Hidden Layer

Output Layer

Weights

Training

Preprocessing

Testing

ATGACGTACGATCCCGTGACGGTGACGTGAGCTGACGTGCCGTCGTAGTAATTTAGCGTGA………………………..

Unknown Sequence

0.6 0.3 0.2 0.8 … 0.2 0.4x f(x) ?∑

−≡outputsk

kkd otwE 2)(21)(

iiiii w

Ewwww∂∂

−=∆∆+← η ,

o

30(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Protein Structure PredictionProtein Structure Prediction

Amino acid sequences of protein determine its 3D conformation

MNIHRSTPITIARYGRSRNKTQDFEELSSIRSAEPSQSFSPNLGSPSPPETPNLSHCVSCIGKYLLLEPLEGDHVFRAVHLHSGEELVCKVFDISCYQESLAPCF

Sequence Structure Function

31(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Motif, Domain, FamilyMotif, Domain, Family

Domain consists of combinations of motifs, the size of domains varies from about 25 to 30 amino acid residues to about 300, with an average of about 100.

Motif (protein sequence pattern): is recognizable combinations of α helices and β strands that appear in a number of proteins.

Protein family consist of members which has1) Same function; and2) Clear evolutionary relationship; and 3) Patterns of conservation, some positions are

more conserved than the others, and some regions seem to tolerate insertions and deletions more than other regions, the similarity usually > 25% .

32(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Neural Network for Structure Neural Network for Structure PredictionPrediction

33(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

SequenceSequence--toto--Structure Network Structure Network ArchitectureArchitecture

Frequency of amino acid‘E’ in multiple alignments

of some protein familyCurrent input window

Each input unit consists of 21 frequencies

34(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

First level: sequence-to-structure

Second level: structure-to-structure

Third level: jury decision

Overall Architecture

The input is based on a profile made from amino acid occurrences in columns of a multiple alignment of sequences with high similarity of the query sequence.

35(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

cDNAcDNA MicroarrayMicroarray

cDNA clones(probes)

PCR product amplificationpurification

Printing

Microarray

Hybridize target to microarray

mRNA target

Excitation

Laser 1Laser 2

Emission

Scanning

Analysis

Overlay images and normalize

0.1nl/spot

36(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Image AnalysisImage Analysis

Scanned images probe intensities numerical values for higher-level analysis

Array target segmentationBackground intensity extraction Target detection

Target intensity extractionRatio analysis

37(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Gene Expression Data MiningGene Expression Data Mining

Data preprocessing:

- Normalization

- Discretization

- Gene selectionLearning:

- Greedy search

- EM algorithm

- Classification

- Clustering

- Analysis of gene regulations

38(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Cancer Classification with DNA Cancer Classification with DNA MicroarraysMicroarrays

39(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Gene Expression ProfilingGene Expression Profiling

[DNA microarray dataset]

[Gene Cluster 1]

[Gene Cluster 2]

[Gene Cluster 3]

[Gene Cluster 4]

40(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Cell CycleCell Cycle--regulated Genes in regulated Genes in S. S. cerevisiaecerevisiae (Yeast)(Yeast)

Identify cell cycle-regulated genes by cluster analysis.4104 genes are already known to

be cell-cycle regulated.4Known genes are clustered into

6 clusters.Cluster 104 known genes and other genes together.The same clustersimilar functional categories.

[Fig.] 104 known gene expression levels according to the cell cycle(row: time step, column: gene).

41(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Probabilistic Clustering Using Probabilistic Clustering Using Latent VariablesLatent Variables

gi: ith genezk: kth clustertj: jth time stepp(gi|zk): generating probability

of ith gene given kth clustervk=p(t|zk): prototype of kth

cluster

)()()|()|()(

i

kkiikki p

zpzpzpzpg

ggg ==∈

∑∑ ∑=i j k

kjkikij ztpzpzpgztf ))|()|()(log(),,( gg

∑=j

kjijki vxsimilarity ),( vx

: (*) objective function(maximized by EM)

42(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Experimental Results:Experimental Results:Prototype Expression Levels of Found ClustersPrototype Expression Levels of Found Clusters

[Fig.] Prototype expression levels of genes found to be cell cycle-regulated (4 clusters).

• The genes in the same cluster show similar expression patterns during the cell cycle.• The genes with similar expression patterns are likely to have correlated functions.

43(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

NCI Drug Discovery ProgramNCI Drug Discovery Program

NCI 60 cell lines data set

44(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Bayesian NetworksBayesian Networks

Represent the joint probability distribution among random variables efficiently using the concept of conditional independence.

BA

C D

Enet) Bayes example (by the )|()|(),|()()(

rule)chain (by ),,,|(),,|(),|()|()(),,,,(

CEPBDPBACPBPAPDCBAEPCBADPBACPABPAP

EDCBAP

==

•A, C and D are independent given B.

•C asserts dependency between A and B.

•A, B and E are independent given C.

An edge denotes the possibility of the causal relationship between nodes.

45(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Bayesian Network LearningBayesian Network Learning

Dependence analysis [Margaritis ’00]4Mutual information and χ2 test

Score-based search

• D: data, S: Bayesian network structure

4NP-hard problem4Greedy search4Heuristics to find good massive network structures

quickly (local to global search algorithm)

∏ ∏ ∏= = = Γ

Γ⋅=

=

n

i

q

j

r

kijk

ijkijk

ijij

iji i NN

Sp

SDpSpSDp

1 1 1 )()(

)()(

)(

)|()(),(

αα

αα

46(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

A Small Bayesian Network for A Small Bayesian Network for Classification of CancerClassification of Cancer

Zyxin

Leukemiaclass

MB-1

C-mybLTC4S

1.3/340/38RBF networks1/340/38Neural trees2/340/38Bayes nets

Test errorTraining error

•The Bayesian network was learned by full searchusing BD (Bayesian Dirichlet) score with uninformative prior [Heckerman ’95] from the DNA microarray data for cancer classification(http://waldo.wi.mit.edu/MPR/).

[Table] Comparison of the classification performance with other methods [Hwang ’00].

47(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Experimental Results:Experimental Results:DrugDrug--Drug DependencyDrug DependencyDrug-drug activity correlations

<Part of the learned Bayesian network structure>

- Three drugs “Aphidicolin-glycinate”, “Floxuridine”, and “Cytarabine” directly depend on each other.

- “Cyclocytidine” directly depends on “Cytarabine” and vice versa.

Confirmation

48(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Experimental Results:Experimental Results:GeneGene--Drug DependencyDrug Dependency

Gene expression-drug activity correlations

- The negative correlation between “ASNS” and “L-asparagine” is mediated by two other genes.

- The relationships revealed by the Bayesian network is putative and should be verified by biological experiments. exploratory analysis

`

<Part of the learned Bayesian network structure>

Confirmation

Discovery

49(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

ML Using Biology: ML Using Biology: BiocomputingBiocomputing

50(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

BioinformationBioinformation Technology (BIT)Technology (BIT)

BTBTITIT

Bioinformatics (in silico Biology)

Biocomputing (e.g. DNA Computing)

Model

Tool

Tool

NNGA

MC

AL

51(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Biology as Inspiration: HistoryBiology as Inspiration: History

Neural Networks4McCulloch & Pitts (1943)4Rosenblatt (1958)

Genetic Algorithms4Fogel (1960s)4Rechenberg (1960s)4Holland (1975)

Artificial Life 4Langton (1988)

DNA Computing4Adleman (1994)

52(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Biology and Artificial Intelligence Biology and Artificial Intelligence (AI)(AI)Symbolic AI

Rule-Based Systems

Connectionist AI Neural Networks

Evolutionary AI Genetic Algorithms

Molecular AI: DNA Computing

53(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

DNA ComputingDNA Computing: Biology as : Biology as TechnologyTechnology

011001101010001 ATGCTCGAAGCT

54(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

LiquidLiquid--Phase Phase 3D3D Biochemical Biochemical Reaction as ComputingReaction as Computing

R ∨ ¬P ∨ ¬Q S

Q ∨ ¬T ∨ ¬S T P

¬R

55(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Computing: Hybridization & Computing: Hybridization & LigationLigation

CGTACCTTAGGCT

AGCTTAGGATGGCATGG AATCCGATGCATGGC

CGTACCTTAGGCTAGCTTAGGATGGCATGGAATCCGATGCATGGC

CGTACCTTAGGCTAGCTTAGGATGGCATGGAATCCGATGCATGGC

CGTACCTTAGGCT

AGCTTAGGATGGCATGGAATCCGATGCATGGC+

+

Ligation

Hybridization

Dehybridization

56(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Complementary

Magnetic Beads

Magnet

Solution Detection: Bead SeparationSolution Detection: Bead Separation

57(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

HPPHPP

...

......

...ATGATGACGACG

TGCTGC

CGACGA

TAATAAGCAGCA

CGTCGT...

...

...

...... ......

...

10

3

2 56

4

SolutionSolution

ATGTGCTAACGAACG

ACGCGAGCATAAATGTGCCGTACGCGAGCATAAATGTGCCGT

TAAACG

CGACGT

TAAACGGCAACG

...

...

...

...

CGACGTAGCCGT

...

...

...

ACGCGAGCATAAATGTGCCGTACGCGAGCATAAATGTGCCGTACGCGTAGCCGT

ACGCGT

......

...

...

...

ACGGCATAAATGTGCACGCGTACGCGAGCATAAATGCGATGCCGT

ACGCGAGCATAAATGTGCCGTACGCGAGCATAAATGTGCCGT

... ... .........

ACGCGAGCATAAATGTGCCGTACGCGAGCATAAATGTGCCGT

...

.........

...

Decoding

Ligation

Encoding

Gel Electrophoresis

Affinity Column

ACGCGAGCATAAATGTGCACGCGT

ACGCGAGCATAAATGCGATGCACGCGT

ACGCGAGCATAAATGTGCACGCGT

ACGCGAGCATAAATGCGATGCACGCGT

2

0 13 4

56

Node 0: ACG Node 3: TAANode 0: ACG Node 3: TAANode 1: CGA Node 4: ATGNode 1: CGA Node 4: ATGNode 2: GCA Node 5: TGCNode 2: GCA Node 5: TGC

Node 6: CGTNode 6: CGT

DNA Computing: DNA Computing: BioBio--Lab ProcedureLab Procedure

PCR(Polymerase

Chain Reaction)

58(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Why DNA Computing?Why DNA Computing?

6.022 × 1023 molecules / moleImmense, brute force search of all possibilities4Desktop: 109 operations / sec4Supercomputer: 1012 operations / sec41 µmol of DNA: 1026 reactions

Favorable energetics: Gibb’s free energy

1 J for 2 × 1019 operationsStorage capacity: 1 bit per cubic nanometer

-1mol 8kcalG −=∆

59(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

DNA Computers vs. Conventional DNA Computers vs. Conventional ComputersComputers

Electronic data are vulnerable but can be backed up easily

DNA is sensitive to chemical deterioration

Setting up only requires keyboard input

Setting up a problem may involve considerable preparations

Smaller memoryCan provide huge memory in small space

Can do substantially fewer operations simultaneously

Can do billions of operationssimultaneously

Fast at individual operationsSlow at individual operationsMicrochip-based computersDNA-based computers

60(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

DNA Computing for AI and DNA Computing for AI and CogSciCogSci

Memory4Associative memory with enormous density46 x 10^23 molecules / mole43 grams of water contains 10^22 molecules

Inference4Massive parallel reactive process410^19 operations for Joule

Learning4Generalization through biochemical reaction43-dimensional diffusion in solution4Global parameterization, e.g. temperature

61(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Molecular Machine LearningMolecular Machine Learning

Molecular Learning Machines

Molecular Computing

Machine Learning

62(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Example: Concept Learning for a Example: Concept Learning for a Janitor RobotJanitor Robot

four, fivefloorfaculty, staffstatuscs, eedepartment

ValuesAttribute

Examples: x1: <cs, faculty, four>+ x2: <ee, faculty, five>-

Conceptsh1: <faculty>h2: <cs, faculty>h3: <cs, faculty, four>

More general:h1 >g h2 >g h3

ee faculty staff four five

cs ∧ faculty ee ∧ facultycs ∧ staff ee ∧ staff faculty ∧ four faculty ∧ five

cs ∧ staff ∧ five ee ∧ faculty ∧ four cs ∧ faculty ∧ five cs ∧ faculty ∧ four

cs

More general

More specific

63(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

DNADNA--Based Concept LearningBased Concept Learning< >, <cs>, <ee>, <faculty>, <staff>, <four>, <five>, <cs, faculty>, <cs, staff>, <ee, faculty>, <ee, staff>, <cs, four>, <cs, five>, <ee, four>, <ee, five>, <faculty, four>, <faculty, five>, <staff, four>, <staff, five>, <cs, faculty, four>, <cs, faculty, staff>, <cs, staff, four>, <cs, staff, five>, <ee, faculty, four>, <ee, faculty, staff>, <ee, staff, four>, <ee, staff, five>

<>, <cs>, <faculty>, <four>, <cs, faculty>, <cs, four>, <faculty, four>,<cs, faculty, four>

<cs, faculty, four> (+)

< >, <cs>, <ee>, <faculty>, <staff>, <four>, <five>, <cs, faculty>, <cs, staff>, <ee, faculty>, <ee, staff>, <cs, four>, <cs, five>, <ee, four>, <ee, five>, <faculty, four>, <faculty, five>, <staff, four>, <staff, five>, <cs, faculty, four>, <cs, faculty, staff>, <cs, staff, four>, <cs, staff, five>, <ee, faculty, four>, <ee, faculty, staff>, <ee, staff, four>, <ee, staff, five>

cs four

staffee

?status?dept

five

faculty

?floor

<>, <cs>, <faculty>, <four>, <cs, faculty>, <cs, four>, <faculty, four>,<cs, faculty, four>

<cs, staff, five> (-)

<faculty>, <four>, <cs, faculty>, <cs, four>,<faculty, four>,<cs, faculty, four>

<four>, <cs, four>,

Intersection

<faculty>, <cs, faculty>,<faculty, four>,<cs, faculty, four>

Difference

<cs, staff, four> is classified as negative!

< >, <cs>, <staff>, <four>,<cs, staff>, <cs, four>, <staff, four>,

<cs, staff, four>

<faculty>, <four>, <cs, faculty>, <cs, four>,<faculty, four>,<cs, faculty, four>

<cs, staff, four> ?

Majority Voting

64(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

BioBio--Lab ProcedureLab Procedure

1. Sequence Design and Synthesis1. Sequence Design and Synthesis

2. Hybridization2. Hybridization

3. Ligation3. Ligation

4. Learning (Affinity Separation)4. Learning (Affinity Separation)

5. Classification5. Classification

Initial Version Space

cs four

staffee

?status?dept

five

faculty

?floor

Primer

Sticky End

cs faculty four

cs faculty five

cs faculty ?

cs staff four

cs ? ?

ee faculty ?

ee ? ?

? ? ?

...

...

...

...

65(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

1. Generating the Concept Space1. Generating the Concept Space

One DNA molecule for one hypothesisSticky ends and “don’t care symbols”

cs four

staffee

?status?dept

five

faculty

?floor

cs faculty four

cs faculty five

cs faculty ?

cs staff four

cs ? ?

ee faculty ?

ee ? ?

? ? ?

...

...

...

...

Primer

Sticky End

66(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

2. Given a Positive Example: 2. Given a Positive Example: <<cscs, faculty, four>+, faculty, four>+

Given a positive example, select all hypotheses that are consistent with the example (and don’t cares)

cs

? dept

faculty

? status

four

? floor Affinity separation by magnetic beads

A

BA n B A

A A n B

67(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

staff

five

ee

3. Given a Negative Example: 3. Given a Negative Example: <<cscs, staff, five>, staff, five>--

Given a negative example, filter out all hypothesesthat are consistent with the example

A – B AA

A A - B

B

A n B

Affinity separation by magnetic beads

68(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

4. Given an Unknown Example:4. Given an Unknown Example:<<cscs, staff, four>?, staff, four>?

Compute Y = A n B and N = A – B

Answer yes if |Y| > |N|, answer no, otherwise.

This example isnegative!

A

A

A

Y

N

B Amplify(Query)

YN

Fluorescence or UV

A n B

Y A n BN A - B

A - B

69(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Experimental Results (1)Experimental Results (1)

Generation of the initial concept space4Sequence design: H-measure, similarity, and Tm4Attribute: 20 mer4Sticky end: 10 mer4Total: 80 mer (3 attributes per hypothesis)

100 bp

75 bp

50 bp

25 bp

Gel Electrophoretogramfor hybridization & ligation

70(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Experimental Results (2)Experimental Results (2)Learning from the first (positive) example4<cs, faculty, four> +

M 1 2 3 4 5 6 7 8 M 1 2 3 4 5 6 7 8 9 10 M9 10 M

100bp75bp

50bp25bp

M : 25bp marker (KDR)

lane 1: cs / 4

lane 2: cs / ?floor

lane 3: ?dept / 4

lane 4: ?dept / ?floor

lane 5: ee / 5

lane 6: No Primers

lane 7: ligation mixture

lane 8: sup. after cs/ ?dept

lane 9: sup. after faculty / ?stat

lane 10: sup. after 4/ ?floot

<cs, faculty, four>, <cs, faculty, ?>, <cs, ?, four>, <?, faculty, four>,<cs, ?, ?>, <?, faculty, ?>, <?, ?, four>, <?, ?, ?>

71(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Experimental Results (3)Experimental Results (3)

Learning from the second (negative) example4<cs, staff, five> -

• <cs, faculty, four>, <cs, faculty, ?>, <cs, ?, four><?, faculty, four>, <?, faculty, ?>, <?, ?, four>

Answering to the query (unknown example)4<cs, staff, four> ?

• (+): <cs, ?, four>, <?, ?, four>• (-): <cs, faculty, four>,<cs, faculty, ?>

<?, faculty, four>, <?, faculty, ?>• Should be classified as “negative”• Result

4 (+) : (-) = 0.762 : 1.134

UV spectrophotometry

0

0.5

1

1.5

2

2.5

3

3.5

4

180.00 90.00 45.00 22.50 11.25 5.63 2.81 1.41 0.70

Concentration (μg/ml)

72(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Experimental Results (4)Experimental Results (4)

M: 25 bp marker (KDR)

lane p1: cs / 4

lane p2: ?dept / 4

lane p3: ee / 5

lane p4: No Primers

lane n1: cs / 4

lane n2 cs / ?floor

lane n3: ?dept / 4

lane n4: ?dept / ?loor

lane n5: No PrimersFigure 3. PCR product of majority voting

confirmed by 3% agarose gel electrophoresis

100bp

M p1 p2 p3 p4 n1 n2 n3 n4 n5 M p1 p2 p3 p4 n1 n2 n3 n4 n5

75bp50bp

25bp

(+): <cs, ?, four>, <?, ?, four>(-): <cs, faculty, four>,<cs, faculty, ?>,

<?, faculty, four>,<?, faculty, ?>

73(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Conclusions and Outlook Conclusions and Outlook

Bio-inspired ML: Machine learning models have been inspired by biological systems.4 Neural networks from central nervous systems4 Genetic algorithms from natural selection

ML for Biology: Bio-inspired machine learning techniques are applied back to study biology.4 Neural networks for protein structure analysis4 Genetic algorithms for sequence alignment

Biotechnical ML: The resulting advancement in biotechnology will further develop the machine learning technology, not just in models but in substrates as well.4 DNA-based inductive machine learning4 Learning to diagnose diseases from natural DNA

74(c) 2002 SNU Biointelligence Lab, http://bi.snu.ac.kr/

Collaborating labs서울대바이오지능 & 인공지능연구실서울대세포및미생물공학연구실

한양대프로테오믹스연구실

㈜바이오니아 & ㈜바이오인포메틱스

Supported by

과기부국가지정연구실사업

산자부차세대신기술연구개발사업

More information at http://bi.snu.ac.kr/

http://cbit.snu.ac.kr/

AcknowledgementsAcknowledgements