KDD for Science Data Analysis Issues and Examples.

18
KDD for Science Data Analysis Issues and Examples
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    224
  • download

    2

Transcript of KDD for Science Data Analysis Issues and Examples.

KDD for Science Data AnalysisIssues and Examples

Contents

Introduction Data Considerations Brief Case Studies

Sky Survey Cataloging Finding Volcanoes on Venus Biosequence Databases Earth Geophysics Atmospheric Science

Issues and Challenges Conclusion

Data Considerations

Image Data Time-series and sequence data Numerical Vs Categorical values Structured and sparse data Reliability of Data

Brief Case Studies

Sky Survey Cataloging Finding Volcanoes on Venus Earth Geophysics Atmospheric Science Biosequence Databases

Sky Survey Cataloging

The survey consists of 3 terabytes of image data containing an estimated 2 billion sky objects

The basic problem is to generate a survey catalog which records the attributes of each object along with its class: star or galaxy

To achieve this scientists developed the SKICAT system

Reasons why SKICAT was successful

The astronomers solved the feature extraction problem Data mining methods contributed to solving difficult

classification problems Manual approaches were simply not feasible.

Astronomers needed an automated classifier to make the most out of the data

Decision tree methods proved to be an effective tool for finding the important dimensions for this problem

Finding Volcanoes on Venus

Data collected by Magellan spacecraft The first pass of Venus using the left looking radar

resulted in 30,000 1000 x 1000 pixel images To help geologists analyze this data set, the JPL

Adaptive Recognition Tool (JARtool) was developed

Motivation for using Data mining methods

Scientists did not know much about image processing or about the SAR properties. Hence they could easily label images but not design recognizers

There was little variation in illumination and orientation of objects of interest. Hence mapping from pixel space to feature space can be performed automatically

Geologists did not have any other easy means for finding the small volcanoes, hence they were motivated to cooperate by providing training data and other help

Earth Geophysics

Two images taken before and after an earthquake and by repeatedly registering different local regions of the two images, it is possible to infer the direction and magnitude of ground motion due to the earthquake.

Example of a geoscientific data mining system is Quakefinder which automatically detects and measures tectonic activity in the earths crust by examination of Satellite data

Atmospheric Science

Data mining tool used is called CONQUEST Parallel testbeds were employed by Conquest to

enable rapid extraction of spatio-temporal features for content based access.

Some of the goals of the this tool is the development of “learning” algorithms which look for novel patterns, event clusters etc.

Retrieved Sea Level Pressure Fields

Biosequence Databases

The largest DNA database is GENBANK with a database of about 400 million letters of DNA from a variety of organisms

The pressing data mining tasks for biosequence are

Find genes in the DNA sequences of various organisms.

Some of the gene finding programs such as GRAIL, GeneID, GeneParser, Genie use neural nets and other AI or statistical methods

Issues and Challenges

Feature Extraction Minority Classes High degree of Confidence Data mining task Relevant domain Knowledge Scalable machines and Algorithms

Conclusions

KDD applications in science may in general be easier than applications in business, finance, or other areas. This is due to the fact that science end users typically know the data in intimate detail.