notes #1

35
1/4/2010 TCSS588A Isabelle Bichindaritz 1 Introduction to class

Transcript of notes #1

Page 1: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 1

Introduction to class

Page 2: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 2

OutlineOutline

• Introduction to class

• Introduction to machine learning / data mining

• Introduction to the Life Sciences

• Example and importance of microarray data

Page 3: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 3

Introduction to Class

• This class focuses on learning how to apply data mining to biological and medical fields to solve some of their problems.

• Does not require prior knowledge in the application areas.

• Does not require prior knowledge in machine learning and/or data mining.

Page 4: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 4

Introduction to Class

• Data mining specialized in– Statistical data analysis and inference – SPSS, R-language– Clustering – SPSS, Gene Pattern– Machine learning - Rapid Miner– Classification – Rapid Miner ,R-language.

• Requirement: use biological datasets and/or medical datasets.

• Seattle area has many renowned research institutes.

Page 5: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 5

Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001

Page 6: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 6

The Human Genome Project

• The Human Genome Project

Page 7: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 7

Data Mining Motivation: “Necessity is the Mother of Invention”

• Data explosion problem

– Automated data collection tools and mature database technology

lead to tremendous amounts of data stored in databases, data

warehouses and other information repositories

• We are drowning in data, but starving for knowledge!

• Solution: Data warehousing and data mining

– Data warehousing and on-line analytical processing

– Extraction of interesting knowledge (rules, regularities, patterns,

constraints) from data in large databases

Page 8: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 8

What Is Data Mining?• Data mining (knowledge discovery in databases):

– Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases

• Alternative names and their “inside stories”: – Data mining: a misnomer?– Knowledge discovery(mining) in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

• What is not data mining?– (Deductive) query processing. – Expert systems or small ML/statistical programs are often a

part of data mining

Page 9: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 9

What Is Data Mining?• Data mining (knowledge discovery in databases)

is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories.

• Machine learning and knowledge discovery are interested in the process of discovering knowledge that may be structurally or semantically more complex: models, graphs, new theorems or theories … in particular to assist scientific discovery.

Page 10: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 10

Data Mining: A KDD Process

– Data mining: the core of knowledge discovery process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Page 11: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 11

Machine Learning Functionalities (1)

• Concept description: Characterization and discrimination– Generalize, summarize, and contrast data characteristics, e.g., dry

vs. wet regions

• Association (correlation and causality)– Multi-dimensional vs. single-dimensional association

– age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”) [support = 2%, confidence = 60%]

– contains(T, “computer”) contains(x, “software”) [1%, 75%]

– Diaper Beer [0.5%, 75%]

Page 12: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 12

Machine Learning Functionalities (2)• Classification and Prediction

– Finding models (functions) that describe and distinguish classes or concepts for future prediction

– E.g., classify countries based on climate, or classify cars based on gas mileage

– Presentation: decision-tree, classification rule, neural network

– Prediction: Predict some unknown or missing numerical values

• Cluster analysis– Class label is unknown: Group data to form new classes, e.g.,

cluster houses to find distribution patterns

– Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity

Page 13: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 13

Machine Learning Functionalities (3)• Outlier analysis

– Outlier: a data object that does not comply with the general behavior of the data

– It can be considered as noise or exception but is quite useful in fraud detection,

rare events analysis

• Trend and evolution analysis

– Trend and deviation: regression analysis

– Sequential pattern mining, periodicity analysis

– Similarity-based analysis

• Other pattern-directed or statistical analyses

Page 14: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 14

Are All the “Discovered” Patterns Interesting?

• A data mining or machine learning system/query may generate

thousands of patterns, not all of them are interesting.

– Suggested approach: Human-centered, query-based, focused mining

• Interestingness measures: A pattern is interesting if it is easily

understood by humans, valid on new or test data with some degree of

certainty, potentially useful, novel, or validates some hypothesis that a

user seeks to confirm

• Objective vs. subjective interestingness measures:

– Objective: based on statistics and structures of patterns, e.g., support,

confidence, etc.

– Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty,

actionability, etc.

Page 15: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 15

Can We Find All and Only Interesting Patterns?

• Find all the interesting patterns: Completeness

– Can a data mining or machine learning system find all the interesting

patterns?

– Association vs. classification vs. clustering

• Search for only interesting patterns: Optimization

– Can a data mining or machine learning system find only the interesting

patterns?

– Approaches

• First general all the patterns and then filter out the uninteresting ones.

• Generate only the interesting patterns—mining query optimization

Page 16: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 16

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology

Statistics

OtherDisciplines

InformationScience

MachineLearning Visualization

Page 17: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 17

Data Mining: Classification Schemes

• General functionality

– Descriptive data mining

– Predictive data mining

• Different views, different classifications

– Kinds of databases to be mined

– Kinds of knowledge to be discovered

– Kinds of techniques utilized

– Kinds of applications adapted

Page 18: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 18

Architecture of a Typical Data Mining System

Data Warehouse

Data cleaning & data integration Filtering

Databases

Database or data warehouse server

Data mining engine

Pattern evaluation

Graphical user interface

Knowledge-base

Page 19: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 19

Introduction to the Life SciencesIntroduction to the Life Sciences

• What is human DNA ?– DNA stands for DeoxyriboNucleic Acid– DNA stores the genetic material chromosomes in each

cell nucleus– DNA is transcribed into RNA out of the nucleus

(transcription)– RNA stands for RiboNucleic Acid– RNA is translated into proteins in a cytoplasm

organism called a ribosome (translation) – DNA RNA proteins

Page 20: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 20

Introduction to the Life SciencesIntroduction to the Life Sciences

DNA

mRNA rRNA tRNA

transcription

Ribosome

Protein

translation

Page 21: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 21

Introduction to the Life SciencesIntroduction to the Life Sciences• Gene expressions are any molecular

compound produced from genes (ex: RNA)

Genes are expressed by being transcribed into RNA, and this transcript may then be translated into protein.

Page 22: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 22

Introduction to the Life SciencesIntroduction to the Life Sciences

• DNA and RNA are composed of– Nucleotides (nucleic acid molecules)

• Pyrimidines– Cytosine (C) (DNA & RNA)– Thymine (T) (DNA)– Uracil (U) (RNA)

• purines – Adenine (A) (DNA & RNA)– Guanine (G) (DNA & RNA)

– Oses (Ribose for RNA, Deoxyribose for DNA)

Page 23: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 23

Introduction to the Life SciencesIntroduction to the Life Sciences

• Succession of nucleotides composes a single strand in DNA

• Two strands of DNA pair themselves in the 3-D shape of a double helix, where bases are paired (bp = base pair)

• Pairing of the bases (A=T, G C) provides chemical bonds responsible for the double helix shape.

Page 24: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 24

Introduction to the Life SciencesIntroduction to the Life Sciences

Page 25: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 25

Human Genome Program, U.S. Department of Energy, Genomics and Its Impact on Medicine and Society: A 2001 Primer, 2001

Page 26: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 26

Introduction to the Life SciencesIntroduction to the Life Sciences

• Genes– A gene is a part of the genome that can be translated– A gene may encode a protein or RNA sequence– Genes are separated by non coding regions– Genes are concentrated in certain regions of the

genome rich in G and C – Regions rich in A and T do not contain genes– Between the two, CpG islands (repetition of C and G)

separate coding regions from non coding ones– Non coding regions can be parts of genes

Page 27: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 27

Introduction to the Life SciencesIntroduction to the Life Sciences

• Genomes, diversity, size, structure– Profound diversity of living organisms genome.– DNA (cells), DNA or RNA (phage, virus)– Direction: from 5’ to 3’ of molecule (double stranded DNA),

or both directions (single stranded)– Genome organized or not in chromosomes– Human genome: 22 chromosomes, 3 billion bases, 30,000

genes– Other species genome vary in size and number of genes– Human genome has only twice as many genes than a

primitive worm– GenBank database

Page 28: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 28

Introduction to the Life SciencesIntroduction to the Life Sciences

• Proteomes– The proteome is the set of proteins that can be

expressed from a genome– Determination of:

• Sequence of encoding genes• Location of the genes• Function of protein encoding genes• Different biochemical states (phosphorylation,

glycosylation, co-enzymes…)

Page 29: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 29

Introduction to the Life SciencesIntroduction to the Life Sciences

• Gene ontologies– Gene ontology consortium

• Dynamic controlled vocabulary to describe– Molecular function (Ex: DNA polymerase, …)

– Biological process (Ex: DNA synthesis, respiration, …)

– Cellular component (Ex: nucleus, ribosome, …)

Page 30: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 30

Principles of BioinformaticsPrinciples of Bioinformatics

• Biological information– Molecules at the basis of life can be

represented as digital symbol strings (DNA, RNA, …)

– Digital symbols (monomers) constitute an alphabet

– Unique representation– Importance of probabilistic models

Page 31: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 31

Principles of BioinformaticsPrinciples of Bioinformatics

• Database annotation quality– In addition to natural noise, data are distorted

by people’s annotations (curation of the data)– Resulting error is very significant– Reasons:

• Storage of positions in a sequence, not content

• Difficulty of storing content

– Need to check the data

Page 32: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 32

Principles of BioinformaticsPrinciples of Bioinformatics

• Database redundancy– Different representations: RNA, cDNA (corresponding

complementary)– Different methods: single-pass sequence, multi-fold

repetition of a sequence– Different fragments: pre-mRNA can lead to several

levels of splicing in cDNA, alternative splicing– Redundancy is source of error:

• Bias of over represented fragments for closely related segments• Bias of over represented fragments for correlations• Overestimate prediction if input and output are related

Page 33: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 33

Principles of BioinformaticsPrinciples of Bioinformatics

• Database redundancy– Better to clean the data first– Data mining cleaning methods apply– Difficulty to differentiate between true

analogous sequences, and related ones– Sequence profile describes amino acid

variation in a family of sequences

Page 34: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 34

Principles of BioinformaticsPrinciples of Bioinformatics

• Main bioinformatics questions– Determine the exact transition between coding and non

coding regions of genes

– Find genes in prokaryotes and eukaryotes

– Determine transcription initiation and termination

– Sequence clustering and cluster topology

– Protein structure prediction

– Protein function prediction

– Protein family classification

Page 35: notes #1

1/4/2010 TCSS588A Isabelle Bichindaritz 35

Principles of BioinformaticsPrinciples of Bioinformatics

• Question– Propose questions pertinent for bioinformatics

– Propose questions pertinent for medical informatics