ESML, Subsetting, Mining Tools - NASA...ESML, Subsetting, Mining Tools Sara Graves Rahul...
Transcript of ESML, Subsetting, Mining Tools - NASA...ESML, Subsetting, Mining Tools Sara Graves Rahul...
ESML, Subsetting, Mining Tools
Sara GravesSara Graves
Rahul RamachandranRahul Ramachandran
Information Technology and Systems Center (ITSC)Information Technology and Systems Center (ITSC)
University of Alabama in Huntsville (UAH)University of Alabama in Huntsville (UAH)
www.itsc.uah.edu
MODIS Science Team Meeting July 24, 2002
Tools Encompassing All Phasesof Scientific Analysis
• Science Data Usability– Data/Application Interoperability
• Earth Science Markup Language (ESML)
• Science Data Preprocessing– Subsetting
• Various Subsetting Tools such as HEW
• Science Data Analysis– Data Mining
• Algorithm Development and Mining (ADaM) System
• Mission/Project/Field Campaign Coordination– Electronic Collaboration
Science Data Usability
http://esml.itsc.uah.edu
Earth Science Data Characteristics
• Different formats,types and structures(18 and counting forAtmospheric Sciencealone!)
• Different states ofprocessing ( raw,calibrated, derived,modeled orinterpreted )
• Enormous volumes
Ø Heterogeneity leadsto Data usabilityproblem
HDF HDF-EOS
netCDFASCII
Binary GRIB
Data Usability Problem
DATA FORMAT 1
DATA FORMAT 1
DATA FORMAT 2
DATA FORMAT 2
DATA FORMAT 3
DATA FORMAT 3
APPLICATION
READER 1 READER 2
FORMATCONVERTER
• Requires specialized code for every format• Difficult to assimilate new data types• Makes applications tightly coupled to data
• One possible solution - enforce a Standard Data Format• Not practical, especially for legacy datasets
ESML Solution
• ESML (external metadata) files containing the structuraldescription of the data format
• Applications utilize these descriptions to figure out how toread the data files resulting in data interoperability forapplications
ESML LIBRARY
APPLICATION
ESMLFILE
ESMLFILE
ESMLFILE
DATA FORMAT 1
DATA FORMAT 1
DATA FORMAT 2
DATA FORMAT 2
DATA FORMAT 3
DATA FORMAT 3
What is ESML?
• It is a specialized markup language for Earth Sciencemetadata based on XML
• It is a machine-readable and -interpretable representationof the structure and content of any data file, regardless ofdata format
• ESML description files contain external metadata that canbe generated by either data producer or data consumer (atcollection, data set, and/or granule level)
• ESML provides the benefits of a standard, self-describingdata format (like HDF, HDF-EOS, netCDF, geoTIFF, …)without the cost of data conversion
• ESML is an Interchange Technology that allowsdata/application interoperability
ESML Tools/Products Availablehttp://esml.itsc.uah.edu
Ability to browse data files using the ESMLdescription files
ESML Data Browser
Ability to write and validate ESML descriptionsESML Editor
•WINDOWS and LINUX
•URL access
•Handles ASCII, Binary (McIDAS), HDF-EOS,GRIB files
•Handles preprocessing, wild card and symbols
ESML Library
(C++/Java JNI )
•Defines ASCII, Binary (McIDAS), HDF-EOS,GRIB formats (more formats to follow)
•Provides preprocessing, wild card, symbols andsemantic tagging capabilities
ESML Schema
FeaturesTools/Products
MODIS/CERES Collocation Application
ESMLfile
ESMLfile
ESMLfile
ESML Library
Collocation Algorithm
255
256
257
258
259
260
261
262
263
264
265
200 210 220 230 240 250 260 270 280 290 300
Sea Surface Temperature (TMI) Degree Kelvin
Ch
n 5
Tem
per
atu
re (
AM
SU
) D
egre
e K
elvi
n
MODIS CERES MISR/Others
Network
Analysis
Scientists can:• Select remote files across
the network• Select fields by modifying
semantic tags in the ESMLfile
Purpose:• To study the relationship
between shortwave flux andcloud/aerosol properties
• Important for climate changestudies
Science Data Preprocessing
http://subset.org
Currently Available/PlannedSubsetting Applications
• HEW Subsetting– Complete System (available)– Subsetting Engine Only (available)– Subsetting Center (available)– SPOT - Subsettability Checker (available)– HEW Integration with ECS (in work)– Remote Subsetting Service (planned)– Subsetting as a Web Service (planned)
• Customized Subsetting– MODIS tools (available)– Coarse-grain SSM/I Subsetter (available)
• General Purpose Customizable Subsetting– Based on ADaM Data Mining Engine (available)– Subsetting Tool using ESML (in work)
• MODIS – Land, Quality Assessment
•modland – subsetter for MODIS gridded data
• stitcher – pieces together 2 or 4 contiguous MODIS tiles
• MODIS – Atmosphere
•modair - specialized subsetter for MODIS swaths
Tools developed for MODISScientists
HEW integration with ECS
ECS EDG System
EDG ECS
Subsetter Input data
Output data
Enduser Order
submission (HTML)
Data order and reply
Subset ODL and reply
Subsetting System
Output data (Reingested)
1 2
34
5
6
7
• UAH/ITSC-writtensubsetting and interfacesoftware
• Ongoing testing with ECS6a.05 and EDG 3.4 atNSIDC, LP DAAC,GDAAC
• Enhancements for DAACsmay be made
ESML enabled generic Subsetter
ESMLfile
ESMLfile
ESMLfile
ESML Library
Subsetting Algorithm
HDF-EOS Binary/ASCII
OtherFormats
Network
Subsetted Data
For HDF-EOS data notformatted for subsettingwith the HDF-EOS library:ESML file can be used tocorrect the semantic tagrequired to subset HDF-EOS data without the needto recreate the data file
Science Data Analysis
http://datamining.itsc.uah.edu
Data Mining
• Data Mining is the task of discoveringinteresting patterns/anomalies andextracting novel information from largeamounts of data
• Data Mining is an interdisciplinary fielddrawing from areas such as statistics,machine learning, pattern recognition andothers
Iterative Nature of the DataMining Process
DATA
PREPROCESSING CLEANING
AndINTEGRATION
MINING SELECTION
AndTRANSFORMATION
DISCOVERY
KNOWLEDGEEVALUATIONAnd
PRESENTATION
ADaM Engine Architecture
PreprocessedData
PreprocessedData
DataDataTranslated
Data
Patterns/ModelsPatterns/Models
ResultsResults
OutputGIF ImagesHDF-EOSHDF Raster ImagesHDF SDSPolygons (ASCII, DXF)SSM/I MSFC
Brightness TempTIFF ImagesOthers...
Preprocessing AnalysisClustering K Means Isodata MaximumPattern Recognition Bayes Classifier Min. Dist. ClassifierImage Analysis Boundary Detection Cooccurrence Matrix Dilation and Erosion HistogramOperations PolygonCircumscript Spatial Filtering Texture OperationsGenetic AlgorithmsNeural NetworksOthers...
Selection and Sampling Subsetting Subsampling Select by Value Coincidence SearchGrid Manipulation Grid Creation Bin Aggregate Bin Select Grid Aggregate Grid Select Find HolesImage Processing Cropping Inversion ThresholdingOthers...
Processing
InputHDFHDF-EOSGIF PIP-2SSM/I PathfinderSSM/I TDRSSM/I NESDIS Lvl 1BSSM/I MSFC
Brightness TempUS RainLandsatASCII GrassVectors (ASCII Text)
Intergraph RasterOthers...
Reasons for Building a Data MiningEnvironment
Reasons for Building a Data MiningEnvironment
• Provide scientists with the capabilities to iterate
• Allow the flexibility of creative scientific analysis
• Provide data mining benefits of
• Automation of the analysis process
• Reduction of data volume
• Provide a framework to allow a well defined structure forthe entire analysis process
• Provide a suite of mining algorithms for creative analysis
• Provide capabilities to add “science algorithms” to theframework
ADaM : Mining Environment forScientific Data
• The system provides knowledge discovery, feature detection andcontent-based searching for data values, as well as for metadata.
•contains over 120 different operations•Operations vary from specialized science data-set specificalgorithms to various digital image processing techniques,processing modules for automatic pattern recognition, machineperception, neural networks, genetic algorithms and others
Extensibility of ADaM
Visualization/AnalysisPackages
TrainingData
InputData
ADaM Mining EngineAnalysisModules
InputModules
OutputModules
General PurposeAlgorithms
TrainableClassifiers
User DefinedAlgorithms
• Provide scientists with the capabilities to iterate
• Allow the flexibility of creative scientific analysis
• Is a powerful tool for research and analysis given the volume ofscience data
• Extremely useful when manual examination of data is impossible
• Allows scientists to add problem specific algorithms to theADaM toolkit
• Minimizes scientists’ data handling to allow them to maximizeresearch time
• Reduces “reinventing the wheel”
Reasons for using ADaM forScientific Data Analysis
Mission/Project/Field CampaignCoordination
Electronic Collaboration
Strategic and Tactical Coordination
• Data acquisitionand integrationfrom multipleplatforms,instruments andagencies for quickexploitation
• Intra-projectcommunicationsbefore, during,and after CAMEXcampaigns
Technologies to coordinate complex projects
CAMEX-4Coordination:
pre-flight
NASA Aircraft
NOAA Aircraft
USAF Aircraft
Radars
RDBMS
CoordinationClearinghouse
Experiment PI: Coordinates with allparticipants, posts plan of the day
Aircraft Crew: Perform aircraftmaintenance and report status.
Forecaster: Contacts local weather, forecastcenters, weather support web sites to prepareand post daily morning weather briefing
NASA managers may review status ofaircraft, instruments, flight plans atvarious times throughout the mission
Preflight mission briefing andflight planning
Scalable, reliable data management
Web-based interface with customizedinformation access for different user groups;rapid development, scalability and portability
CAMEX-4Coordination:
in flight
NASA Aircraft
NOAA Aircraft
USAF Aircraft
RDBMS
CoordinationClearinghouse
Download latest satelliteimagery to web
Radars
Instrument scientists:Collect, process, and storedata on board aircraft
Forecaster: Continue monitoring satelliteimagery, radar data, and landing forecasts
CommunicationsSatellite
Transmit selected data toNational Hurricane Centerfor inclusion in computerforecast models
Experiment PI: Modify flight planas needed in response to changingweather events
CAMEX-4Coordination:
post-flight
NASA Aircraft
NOAA Aircraft
USAF Aircraft
Radars
RDBMS
CoordinationClearinghouse
Mission and Instrumentscientists: Post sortie andinstrument reports andquicklook data
Forecaster: Prepare post-flight weather briefing
Aircraft Crew: Prepare aircraftand instruments for next flight andupdate status.