Data mining ppt

15
Applications and Trends in Data Mining Data Mining For Biological Data Analysis

Transcript of Data mining ppt

Page 1: Data mining ppt

Applications and Trends in Data Mining

Data Mining For

Biological Data Analysis

Page 2: Data mining ppt

Factors that led for the development

• The past decade has seen an explosive growth in: 1.Genomics 2.Proteomics 3.Functional genomics 4.Biomedical research

• Identification and comparative analysis of genomes of humans and other species for investigation of genetic networks.

• Development of new Pharmaceuticals and advances in cancer therapies.

Page 3: Data mining ppt

• DNA sequences form the foundation of genetic codes of all living organisms.

• DNA sequences are comprised of four basic building blocks called nucleotides:

1.adenine (A) 2.cytosine (C) 3.guanine (G) 4.thymine (T)

• These four nucleotides (or bases) are combined to form long chains that resemble a twisted ladder.

Page 4: Data mining ppt
Page 5: Data mining ppt

• DNA sequence … CTA CAC ACG TGT AAC …

• A gene usually comprises hundreds of individual nucleotides arranged in particular order.

• A genome is the complete set of genes of an organism.

• Genomics is the analysis of genome sequences.

• A proteome is the complete set of protein molecules present in a cell, tissue, or organism.

• Proteomics is the study of proteome sequences.

Page 6: Data mining ppt

Data mining may contribute to the biological data analysis in

the following aspects.

Page 7: Data mining ppt

Biological data mining has become an essential part of

new research field called bioinformatics.

Page 8: Data mining ppt

1)Semantic integration of heterogeneous, distributed genomic and proteomic data bases.• Genomic and proteomic data sets are often generated at

different labs and by different methods.

• They are distributed, heterogeneous, and of wide variety.

• Integration of such data is essential to cross-site analysis of biological data .

• Such integration and linkage analysis would facilitate the systematic and coordinated analysis of genome and biological data.

Page 9: Data mining ppt

• This has promoted the development of integrated data warehouses to store and manage derived biological data.

• Data cleaning, data integration, reference reconciliation, classification, and clustering methods will facilitate the integration of biological data and the construction of data warehouses for biological data analysis.

Page 10: Data mining ppt

2)Alignment, indexing, similarity search, and comparative analysis of multiple nucleotide/protein sequences.

• BLAST and FASTA, in particular, are the tools for the systematic analysis of genomic and proteomic data.

• Biological sequence analysis methods differ from many sequential pattern analysis algorithms proposed in data mining.

• For protein sequences, two amino acids should also be considered a “match” if one can be derived from the other by substitutions that are likely to occur in nature.

Page 11: Data mining ppt

• There is a combinatorial number of ways to approximately align multiple sequences:

1)reducing a multiple alignment to a series of pair wise alignments and then combining the result.

2)using Hidden Markow Models or HMMs.

• Multiple alignment can be used to identify highly conserved residues among genomes and they can be used to build phylogenetic trees to infer evolutionary relationships among species.

• Genomic and proteomic sequences isolated from diseased and healthy tissues can be compared to identify critical differences between them.

• Sequences occurring in the diseased samples may indicate the genetic factor of the disease.

Page 12: Data mining ppt

3)Discovery of structural patterns and analysis of genetic networks and protein pathways.

• Protein sequences are folded into 3D structures, and such structures interact with each other based on the relative position and distances between them.

• Such complex interactions lead to the formation of genetic networks and protein pathways.

• It is important to develop powerful and scalable data mining to discover patterns and to study about regularities and irregularities among complex biological network.

Page 13: Data mining ppt

4)Association and path analysis: identifying co-occurring gene sequences and linking genes to different stages of disease development .• Many studies have been focused on comparison of one gene

to another.

• Most diseases are not triggered by a single gene but by a combination of genes acting together.

• Association analysis methods can be used to determine the kinds of genes that are likely to co-occur in target samples.

• A group of genes may contribute to a disease process, here path analysis is expected to play an important role.

Page 14: Data mining ppt

5)Visualization tools in genetic data analysis.

• Alignments among genomic or proteomic sequences and interactions between them can be expressed in

1)Graphic forms. 2)Transformed into various kinds of easy-to-understand visual displays.• They facilitate pattern understanding, knowledge discovery,

and interactive data exploration.

Page 15: Data mining ppt

Thank you