1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure...
-
Upload
jonathan-ruddle -
Category
Documents
-
view
218 -
download
0
Transcript of 1 NASA Earth Exchange: Improving access to large-scale data and computational infrastructure...
1
NASA Earth Exchange: Improving access to large-scale data and computational infrastructure
ACCESS-11-0034 Annual ReviewAugust 20, 2013
Ramakrishna Nemani, Petr Votava, Andrew Michaelis, Hirofumi Hashimoto, Forrest
Melton
2
Vision: To engage and enable the Earth science community to address global Earth science challenges.
NEX is a collaborative compute platform that improves the availability of Earth
science data, models, analysis tools and scientific results through a centralized environment that fosters knowledge
sharing, collaboration, innovation and direct access to compute resources.
Engage:Network, share and collaborate
Discuss and formulate new ideasPortal, Virtual Institute
Enable:Access to data
Access to computingAccess to knowledge
Background: NASA Earth Exchange
4
Outline
• Project background• Updated quad chart• Review of schedule and milestones• Description of work accomplished and results• Technical reports and presentations• Discussion of next 6 month activity• Schedule and budget summary
5
Project Background
• Main focus of the projects is on supporting the NEX community by continuously improving access to data, tools, computing and knowledge.
• By improving the above, we can engage more users and teams and provide them with better and faster support- Need to be able to respond quickly to new requirements- Focus on knowledge acquisition, and access
• We can also help our users to significantly scale their projects
NASA Earth Exchange: Improving access to large-scale data and computational infrastructure
Key Milestones
Goals and Objectives
• Enhance access, discovery and integration of data, models and services for the NEX communities
• Provide integrated system view of NEX data, metadata, processing libraries, models and QA
information
• Provide API and client libraries to NEX tools, datasets and search capabilities
• Provide streamlined way for researchers to share their results with the community
Approach
• Inventory current NEX datasets, tools and models and engage the community in gathering
requirements and use cases.• Design a common database schema for existing NEX
datasets.• Develop API that facilitates search and access to
data, tools and models and use it to implement client libraries
• Develop migration and dissemination tools for NEX users
• Co-I: Petr Votava, Andrew Michaelis, Dr. Hirofumi Hashimoto, Forrest Melton/CSU Monterey Bay
PI: Ramakrishna Nemani Ph.D., NASA Ames Research Center
Co-Is/Partners
Architecture Overview
TRLin = 6
• Preliminaries completed 07/2012• Data integration completed 11/2012• Process integration completed 01/2013• System interface completed 08/2013• Migration tools completed 01/2014• Client libraries and tools completed 02/2014
08/13
8
Project Goal
To enhance access, discovery and integration of data, models and tools for the NEX
communities.
9
Objectives for Activity During Review Period
• Complete inventory of current NEX data, metadata, tools and libraries
• Engage NEX users to gather additional data and tools requirements
• Complete initial data integration with the key NEX datasets and the existing infrastructure
• Continue rapid prototyping of database access tools based on user requirements.
• Continue integration of utilities and tools with NEX system.
• Prototype integration with NEX semantic infrastructure.
10
Project Drivers = Why
1. To directly support large-scale NASA projects such as WELD, NAFD, NCA, MEASURES, CMS, CMAC and projects in applied sciences
2. Efficiently support fast growing NEX community both inside and outside of NASA- Earth science research is a global undertaking and we aim to engage the
largest possible community- Large global collaboratory
• Global knowledge pool
• Need critical mass -> everybody benefits
- Support for large-scale science while engaging large community
3. Place for community contributions and access to these contributions:- Knowledge, tools, data, workflows, …
11
NEX User and Project Evolution
• Number of active compute/data users at the beginning of this ACCESS project: less than 50
• Current number of active compute/data users: 158• Largest data requirements at the beginning of this
ACCESS project: 10s of TB (per project)• Current data requirements: 100s of TB – 1PB+
(per project)• On the NEX portal – currently 404 users and
1,252 projects (not all active)
12
ACCESS Project Overview
Data
Tools
Knowledge
Provide integrated view of NEX data and metadata through API, command-line tools and query services.
Cross-reference and provide access to information about datasets, tools, users, projects, publications and other docs.
DisseminationEstablish process, policies and infrastructure for dissemination
of data produced on NEX.
Provide mechanism to discover and manage environments fortools and utilities required by different projects and provide APIs
InfrastructureComponents and solutions that enable the above within security
and policy constraints.
13
Data Organization
• Started with inventory• Currently over 450TB on-line and 500+TB near-
line• Feedback from summer school 2012 users,
summer interns in 2013 and NEX users and PIs• Two rounds of “Query Requirements” with the
NEX science team • Two-to-three tier system
- Primary on-line fast storage, secondary on-line cache, near-line tape accessed through DMF
14
Query Categories and Requirements
• “Standard” queries- Temporal, spatial, match region by name, what data are available, …
• Data provenance- How was data produced (process/workflow)?- What were the inputs into the process?- Who created this dataset?
• Knowledge queries- Which projects work with dataset X? In what geographic region?- Which publications are relevant to dataset X?
• Administration queries- How often is the dataset updated? From where?
• Analytics queries (not addressed by this project)- Filter based on internal QA, Landcover or statistics- Large number of requests for these capabilities
15
Data Organization Details
• Keep metadata in the original format/naming conventions- Researchers are used to the metadata names- At times extensive documentation exists to describe the metadata
• Metadata are processed by custom parsers- Different for different sensors (MODIS, Landsat, NAIP, …)
• Each datasets is stored in a separate set of tables and when it is added to NEX a custom plug-in is written- Overrides abstract methods from the DB class- It is manageable, because the class of the datasets in not that large (few dozens at
most) and writing a generic code in this case while maintaining the original metadata would take longer in this case
- We are experimenting with semantic layer that describes and maps terms in different DBs to common taxonomy, but it requires dynamic query rewriting and it’s suitability for this problem is questionable.
- Best solution in this case seems either fully relational (current) or fully graph-based (future). Needs to hide the implementation behind an API, however users at times want access to a full RDBMS in which case maintaining two consistent copies seems the best answer.
16
Tools/Utilities/Models
• Tried number of approaches- Users often want custom solutions with specific library/tool versions- Management of this gets quickly complicated
• Using “modules” infrastructure to provide custom environments for NEX teams- We can easily mix and match versions as per team’s requirements- Also good for easy reproduction/packaging of environments- Will be basis for tool contribution setup (nex/contrib)
• Access to almost all tools through a Python API or through regular command-line invocation- Great for integration with VisTrails workflow management system
• Mechanism to query a list of modules to be built or request a new module to be built.
• Working on adding better search and documentation capabilities- Also, exposing documentation externally on the NEX portal
17
Knowledge Organization
• Internal NEX Knowledge graph- Spans data, content, web portal, tools- Provenance
• RDF/OWL representation- Triple and quad-store (MySQL and Virtuoso)
• Knowledge Acquisition- Manual = Documentation, blogs etc. (internal and external)- Automatic = entity extraction from text and metadata using natural language
processing• Location, datasets used by project, sensors
• Build relationships
• Improves search – who is doing what where
- Who is doing work in Amazon, what sensors are they using? What are the most frequent sensors used by NEX projects
• Can generate project concepts, so that projects can be easily related to each other (LSI)
18
Relating Entities
NEX Projects, wikis,…(NEX web portal)
Publications(NEX Web Portal
Harvard Database,…)
GCMD Concepts
NEX Extension(Additional conceptsoutside the GCMD
hierarchy – data hierarchy, …)
NEX GraphData Store
Extract entities
Extract entitiesLink to
Link to/Define new
Queries
Links to externaldocs
(LP DAAC, …)
Link to resources
Provenance from running process
Recor
d pr
oven
ance
19
Example queries
• What is the provenance of file X?• What is the bounding box of region R?• Get sorted (by number of projects) the usage of each of the
NASA instruments in the NEX projects?• What instruments are used by projects doing research in
the Amazon?• What are the most cited datasets in the remote sensing
publications?• Now that NEX portal has been migrated to NAS we can
start to integrate this information with the portal a lot easier.
20
Data Dissemination
• Number of faucets- Large-scale data distribution (CMIP-5 for NCA)- Web-services application support (SIMS)- Open Access – Amazon
• Focus not only on the mechanics and implementation, but also on protocols and policies development- Often more time-consuming than implementation
21
CMIP-5 Dissemination
• Downscaled climate dataset produced on NEX (17TB)- Important and highly requested by the community
• First process for NEX data -> NASA distribution facility- Established DOI mining capabilities (through UC Digital Library)
• http://dx.doi.org/10.7292/W0WD3XH4
- Established a technique for DOI dataset verification through checksums without extensive web services even when underlying naming changes.
• Data available at:- http://dataserver.nccs.nasa.gov/thredds/idd/bypass.html- And internally on NEX- Data had to be aggregated and reformatted for use by NCCS
• This raises issues of verifications with original datasets as well as the fact that there are effectively two copies of the data in different formats
• Needed to work extensive work with users + many lessons learned = update protocol with NCCS, but will be different with different facilities
22
NASA Satellite Irrigation Management Support (SIMS)
• ACCESS software infrastructure directly supports the SIMS project (NASA Applied Sciences)- Build partially on efforts from last ACCESS project- Provides access to near-real-time Landsat data time-
series through a data cube interface- The goal of the SIMS project is to develop new
information products from satellite data to support growers in optimizing irrigation
• Currently tested by 12 partner growers
- Data visualization and queries via web services built on OPeNDAP
- Both web-based and mobile interfaces
23
crop cond.% cover
crop coeff
crop waterrequirement
An example of the SIMS web / mobile data interface, which is designed to enhance grower access to satellite-derived measures of crop condition and crop water
requirements across 3.7 million ha of irrigated land in California.
24
Amazon Web Services Space Act Agreement
• Prototype process for providing access to NEX data through public cloud facilities- Open access to data and workflows
• We are reaching capacity on NEX and have restrictions on access
- Different cost model – billing for computing is under users control- We can add complete Virtual Machines with packaged environments and
workflows developed and managed on NEX and accessible through the NEX web portal
- Prototyping effort includes• NCA-related activities
- NCA downscaled data (CMIP-5)- NEX portal linked with Amazon Web Services (open) or internal (NEX-members
only) NEX work environment
25
Infrastructure
• Database setup- Access to database systems from all NEX components- Mostly MySQL-based, experimenting with Virtuoso, Neo4j and re-
visiting MongoDB• Supercomputing setup
- Work with NAS system group to enable access even from within Pleiades supercomputer
- Needed for easier streaming of provenance information• Applications support
- Separate OpenDAP, THREDDS and FTP server• Security considerations
- Moderate system = 2-factor authentication required- Waiver for NEX portal for OpenID and NDC users- One of the drivers for testing public cloud solutions to improve access
26
Immediate Benefits for Many NEX Projects (Examples)
• Web-Enabled Landsat Data (WELD)- Acquisition, organization and access to data and processing
capabilities for monthly Landsat vegetation composites – 800+TB total data requirements
• North America Forest Dynamics (NAFD)- Acquisition, organization and access to data, QA, metadata and
processing capabilities for Landsat (80TB)
• BIOCLIM- Acquisition and organization of global MODIS land and
atmospheric products including swath mapping to acquisition regions (15 TB).
27
Takes over 10,000 scenes each month using WELD system
Creating Global Monthly Landsat Composites, 1999 - Present
April 2010
October 2010
Web Enabled Landsat Data: Going Global, Roy et al.,
28
North American Forest Disturbance (NAFD, Goward et al.,)
Expanding from 23 samples to Wall-to-wall coverageProcessing 96000 scenes from 1985-2010 on NEX
33
Summary of Activity During Review Period (1)
• Inventory of NEX tools and datasets. - Started with 25 existing datasets on NEX comprising about 300TB
of data.- Work with NEX users to better understand:
• How they use the data, metadata and QA information
• Which tools and utilities they are using the most and what functionality is missing from the existing tools and utilities. We have prototyped the database access for number of use cases and some parts of it are already being used by NEX science teams.
• As the science teams have developed a highly sought-after downscaled climate datasets, we have prototyped a process through which the data will are distributed by NASA’s NCCS facility
34
Summary of Activity During Review Period (2)
• Set-up initial NEX-wide repository based on the “module” utilities that enables us to customize environments for specific user’s needs in terms of tool/software versions and dependencies.
• Started to integrate some of the tools and utilities for data manipulation with the NEX semantic infrastructure and prototyped an end-to-end process of the semantic data and process integration with MODIS climatology processes that also include provenance capture.
• Work closely with several NEX projects to establish initial NEX database and tools API, which is currently in use mainly for access to Landsat and MODIS data and metadata for both gridded and swath datasets.
35
Summary of Activity During Review Period (3)
• Added a new metadata collection capability for some datasets that enable us to better estimate future data requirements as well as provide users with additional information, mainly for QA screening purposes.
• Prototyped an automated process through which users can submit requests for data, tools and models to be included on NEX using PivotalTracker
36
Papers and Presentations
“NASA Earth Exchange (NEX): Earth science collaborative for global change science“. Presented at IGARSS 2012.
“NASA Earth Exchange (NEX)”, Presented at Supercomputing 2012.
“Connecting Provenance and Semantic Descriptions in NASA Earth Exchange (NEX)”, Presented at AGU 2012.
37
ESDSWG Participation
• Participated at 2012 ESDSWG meeting• Participated in Semantics Working Group until it
was dissolved• Currently participate in the Cloud Computing
Working group• Plan to attend 2013 ESDSWG meeting and
expand participation to Earth Science Collaboratory WG
38
Relationship to other funded activities
• AIST- Facilitate access to tools and knowledge through API for workflow
integration
• CMAC (Data Mining)- Facilitates access to data and pre-processing tools
• CMAC (Recommendations)- Facilitates access to tools through workflows
• National Climate Assessment (NCA)- Facilitates process for NEX-produced data distribution for NCA
• BIOCLIM- Facilitates access to tools, data and libraries for several BIOCLIM
projects.
39
Relationship to NEX
• Provides foundation for user/project work environments
• Provides access to metadata for integration with the NEX knowledge system
• Provides the overarching metadata architecture for data and processes integrated through a semantic layer
41
Summary of Work During Next Review Period (through 2/14)
• Continuous integration of tools and utilities with the NEX infrastructure based on user’s requirements
• Continuous integration of data with the NEX infrastructure based on user’s requirements
• Continue to work on the data and process interface (API) – the initial API is in Python, but we are also working with users for access to data and tools through R and MATLAB- The extent of this will be driven by user requirements
• Work with users in order to continue integration of documentation, FAQs and code samples for the tools and datasets so that they are available both on the computing platform and on the NEX web portal.
42
Cumulative Budget (3/2012 – 8/2013)
• FY12: $141,750 - All funds have been obligated
• FY13: $145,200 - All funds have been obligated
• Does it match your numbers?
43
Glossary
• API: Application Programming Interface• BIOCLIM: Climate and Biological Response: Research and Applications• CMAC: Computational Modeling Algorithms and Cyberinfrastructure• CMS: Carbon Monitoring System• DMF: Data Migration Facility• HEC: High-End Computing• HPC: High-Performance Computing• NAFD: North American Forest Disturbance• NCCS: NASA Center for Climate Simulations• NEX: NASA Earth Exchange• OWL: Web Ontology Language• RDF: Resource Description Framework• SIMS: Satellite Irrigation Management Support