MA 881 Topics in High Dimensional Data Analysis: Unit 01...

27
MA 576: Generalized Linear Models- Unit 12 - p. 1 MA 881 Topics in High Dimensional Data Analysis: Unit 01 Introduction and Genesis of High dimensional data Instructor : Surajit Ray

Transcript of MA 881 Topics in High Dimensional Data Analysis: Unit 01...

Page 1: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 1

MA 881 Topics in High Dimensional DataAnalysis: Unit 01

Introduction and Genesis of Highdimensional data

Instructor : Surajit Ray

Page 2: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 2

Genesis of high dimensional data.

■ Microarray Gene Expression

■ Medical Imaging.

■ Immunoinformatics.

Page 3: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 3

Gene Expression Data: TBP-TAND interaction

■ Description of experiment

Interested in clustering of genes only.

Page 4: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 4

Odontogenic Tumor study

Data from Wright Lab (Dentistry, UNC)Goal: Study early tumorigenesis using mouse model and relate them to humantumors.■ Tg.Ac transgenic mice readily develops odontogenic tumor (35%)

■ Odontogenic tumor types based on histology (phenotype)➥ Type I: Mixed Cell tumor➥ Type II: Ameloblastomas➥ Type III: Complex Odontomas

■ Number of genes examined in the microarray > 16,000. [obviously filtering willreduce this number considerably]

Histological classifications are not perfect.

Page 5: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 5

Microarray Tecnology

■ Animation of CDNA Microarray technology

■ Example of Micro array data fromhttp://www.ncbi.nlm.nih.gov/geo/

Page 6: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 6

Challenges due to data collection techniques:Pre-processing

■ Screening.

■ Transformation./ Measure of Cnetral tendency

■ Normalization/Standardization.

■ Batch Correction. Example: Distance Weighted Discrimination (DWD).➥ Adjustment of systematic microarray data biases Vol. 20 no. 1 2004,

pages 105-114.➥ Illustration: Movie

Page 7: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 7

Nature of Microarray data

■ High dimensionality comes from the number of genes that can be put on a chip

■ Usually not of high dimension in the number of samples➥ What are the problems?➥ Is it a HDLLS scenario (High dimensional low sample size)?➥ Can classical techniques be used to analyse microarray data

Page 8: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 8

Analysis of Gene Expression Data

■ Clustering and Classification.

■ Regression Analysis.( Disease association)

■ Mixture regression.

■ Bi-Clustering/Two way clustering.

■ Combining microarray data with Sequence data and Gene Ontology information.➥ Network?

Page 9: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 9

Gene/drug clustering

Source: http://www.bio.davidson.edu/people/macampbell/CSU/

Page 10: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 10

Example: Medical Image Segmentation

Page 11: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 11

Example: Medical Image Segmentation

Page 12: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 12

Types of Image data

■ Magnetic Resonance ImagingChannels T1 and T2.

■ Fluorodeoxyglucose (FDG)-positron emission tomography (PET) Detects chemicaland metabolic changes in disease states.

Superimposed images for composite analysis

■ 8 Channel Brain Array

Page 13: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 13

Medial Representation of kidney

Representing right kidney using spokes protruding out from 15 medial atoms.Geometry of M-rep models

Kidney M-rep model gives rise to high-dimensional data e.g. 15x8 dimensions foreach kidney.MREP Movie

Page 14: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 14

Challenges

■ High dimension low sample size

■ Very structured geomteric information

■ Representation needs analysis of data belonging to manifolds

■ Check for over prametrization

■ Will PCA work?

■ PCA vs PGA

■ Phase transition of PCA

Page 15: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 15

History of Immunology

The Chinese are credited with making the observation that deliberately infectingpeople with mild forms of smallpox could prevent infection with more deadly formsand provide life long protection.

Introduction of first generation of vaccines for use in humans

■ 1798 Smallpox

■ 1885 Rabies

■ 1897 Plague

■ 1923 Diphtheria

■ 1926 Pertussis

■ 1927 Tuberculosis (BCG)

■ 1927 Tetanus

■ 1935 Yellow Fever

Page 16: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 16

Types of Vaccine

■ Live attenuated vaccines

■ Inactivated or ’killed’ vaccines

■ Recombinant sub-unit envelope vaccines

■ Recombinant vectored vaccines

■ DNA vaccines and replicons➥ Involve HIV genetic sequences which, once injected, induce expression of HIV

antigens by human cells.➥ In the case of replicons, these sequences are wrapped in the outer coat of an

unrelated virus.

Page 17: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 17

Cartoon view of Immunology

Page 18: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 18

Immunological Bioinformatics

Page 19: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 19

Issues to be addressed

■ Sequence Analysis Viral/Microbial Evolution. Being Lower forms of life ( or evenunicellular organism) they can mutate much easily. But we cannot.

■ Gene Expression Analysis Which Genes/Molecules/Protiens are expressed atcertain time point in the viral life cycle.

■ Classification and Prediction Which of these proteins will bind to the MHCMolecule present in a particular human being.

Page 20: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 20

Adaptive Immune System

■ Destruction of infected cells and tumor cells by cytotoxic T-lymphocytes or CTLs .

■ CTLs are effector cells derived from T8-lymphocytes during cell-mediatedimmunity.

■ The TCRs and CD8 molecules on the surface of naive T8-lymphocytes aredesigned to recognize peptide epitopes bound to MHC-I molecules onantigen-presenting cells (APCs).

Page 21: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 21

The bottle neck: binding to MHC molecules

■ Major Histocompatibility molecules, also known as human leukocyte antigens orHLA

■ Present on various human cells e.g. dendritic cells.

■ Typically epitopes of length 9.

■ MHC I movie

Page 22: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 22

Epitope Driven vaccine design

Page 23: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 23

Example: MHC I binders

>A85A_MYCTU 48 GLPVEYLQV

MQLVDRVRGAVTGMSRRLVVGAVGAALVSGLVGAVGGTATAGAFSRPGLPVEYLQVPSPS

MGRDIKVQFQSGGANSPALYLLDGLRAQDDFSGWDINTPAFEWYDQSGLSVVMPVGGQSS

FYSDWYQPACGKAGCQTYKWETFLTSELPGWLQANRHVKPTGSAVVGLSMAASSALTLAI

YHPQQFVYAGAMSGLLDPSQAMGPTLIGLAMGDAGGYKASDMWGPKEDPAWQRNDPLLNV

GKLIANNTRVWVYCGNGKPSDLGGNNLPAKFLEGFVRTSNIKFQDAYNAGGGHNGVFDFP

DSGTHSWEYWGAQLNAMKPDLQRALGATPNTGPAPQGA

>A85A_MYCTU 242 KLIANNTRV

MQLVDRVRGAVTGMSRRLVVGAVGAALVSGLVGAVGGTATAGAFSRPGLPVEYLQVPSPS

MGRDIKVQFQSGGANSPALYLLDGLRAQDDFSGWDINTPAFEWYDQSGLSVVMPVGGQSS

FYSDWYQPACGKAGCQTYKWETFLTSELPGWLQANRHVKPTGSAVVGLSMAASSALTLAI

YHPQQFVYAGAMSGLLDPSQAMGPTLIGLAMGDAGGYKASDMWGPKEDPAWQRNDPLLNV

GKLIANNTRVWVYCGNGKPSDLGGNNLPAKFLEGFVRTSNIKFQDAYNAGGGHNGVFDFP

DSGTHSWEYWGAQLNAMKPDLQRALGATPNTGPAPQGA

>A85B_MYCTU 239 KLVANNTRL

MTDVSRKIRAWGRRLMIGTAAAVVLPGLVGLAGGAATAGAFSRPGLPVEYLQVPSPSMGR

DIKVQFQSGGNNSPAVYLLDGLRAQDDYNGWDINTPAFEWYYQSGLSIVMPVGGQSSFYS

DWYSPACGKAGCQTYKWETFLTSELPQWLSANRAVKPTGSAAIGLSMAGSSAMILAAYHP

QQFIYAGSLSALLDPSQGMGPSLIGLAMGDAGGYKAADMWGPSSDPAWERNDPTQQIPKL

VANNTRLWVYCGNGTPNELGGANIPAEFLENFVRSSNLKFQDAYNAAGGHNAVFNFPPNG

THSWEYWGAQLNAMKGDLQSSLGAG

>ACTB_HUMAN 180 ALPHAILRL

MDDDIAALVVDNGSGMCKAGFAGDDAPRAVFPSIVGRPRHQGVMVGMGQKDSYVGDEAQS

KRGILTLKYPIEHGIVTNWDDMEKIWHHTFYNELRVAPEEHPVLLTEAPLNPKANREKMT

QIMFETFNTPAMYVAIQAVLSLYASGRTTGIVMDSGDGVTHTVPIYEGYALPHAILRLDL

AGRDLTDYLMKILTERGYSFTTTAEREIVRDIKEKLCYVALDFEQEMATAASSSSLEKSY

ELPDGQVITIGNERFRCPEALFQPSFLGMESCGIHETTFNSIMKCDVDIRKDLYANTVLS

GGTTMYPGIADRMQKEITALAPSTMKIKIIAPPERKYSVWIGGSILASLSTFQQMWISKQ

EYDESGPSIVHRKCF

Page 24: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 24

Challenges

Representation of peptides and classification based on the representation■ Property based

■ Which properties to use

■ Which classification mechanism to use

■ Can we use X-ray crystallography and untilize bilology of mhc-peptide binding.

Page 25: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 25

Example of Bio physio chemical properties

Values of 4 most important indexes (properties) of amino acids determing the

peptide-MHC binding

1L Name Moleculer Weight Volume Hydropathy Isoelectric

A alanine 89.09 88.6 1.8 6.00

C cysteine 121.16 108.5 2.5 5.02

D aspartate 133.10 111.1 −3.5 2.77

E glutamate 147.13 138.4 −3.5 3.22

F phenylalanine 165.19 189.9 2.8 5.48

G glycine 75.07 60.1 −0.4 5.97

H histidine 155.15 153.2 −3.2 7.47

I isoleucine 131.17 166.7 3.8 5.94

K lysine 146.19 168.6 −3.9 9.59

L leucine 131.17 166.7 3.8 5.98

M methionine 149.21 162.9 1.9 5.74

N asparagine 132.12 114.1 −3.5 5.41

P proline 115.13 112.7 −1.6 6.30

Q glutamine 146.14 143.8 −3.5 5.65

R arginine 174.20 173.4 −4.5 11.15

S serine 105.09 89.0 −0.8 5.68

T threonine 119.12 116.1 −0.7 5.64

V valine 117.15 140.0 4.2 5.96

W tryptophan 204.23 227.8 −0.9 5.89

Y tyrosine 181.19 193.6 −1.3 5.66

Page 26: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 26

Allelic Distribution

Caucasian African Asian LatinoA1 15.18 5.72 4.48 7.40A2 28.65 18.88 24.63 28.11A3 13.38 8.44 2.64 8.07A11 6.17 1.58 17.31 4.83A24 9.32 2.96 22.03 13.26B7 12.17 10.59 4.26 6.44B8 9.40 3.83 1.33 3.82B15 6.49 3.52 12.21 5.29

Page 27: MA 881 Topics in High Dimensional Data Analysis: Unit 01 ...math.bu.edu/people/sray/teaching/math881/notes/unit01.pdfMA 576: Generalized Linear Models- Unit 12 - p. 16 Types of Vaccine

MA 576: Generalized Linear Models- Unit 12 - p. 27

New technology: Itopia

Example: Survivin Genome■ Output from Itopia

■ Table

Properties■ Can determine experimental binding for several alleles for overlapping peptides

■ Ideal for epitope based vaccine design.

Challenges:■ Large dataset

■ Can we borrow strength from across allele?

■ Can we design better experiments

■ Regression instead of classification