Strategic Initiative in Support of CTBT Data Processing ...€¦ · Strategic Initiative in Support...

1
Strategic Initiative in Support of CTBT Data Processing: vDEC (virtual Data Exploitation Center) S. Vaidya 1 , E.R. Engdahl 2 , R. LeBras 3 , K. Koch 3 , O. Dahlman 3 1 Lawrence Livermore National Laboratory, 2 University of Colorado, 3 CTBTO Vision: The Comprehensive Test Ban Treaty (CTBT) is receiving renewed attention in the Obama administration, in light of this administration’s long range objective of steering towards a nuclear weapon free world. While the current International Data Center (IDC) has successfully detected several low magnitude explosions, opportunities exist to further enhance its detection and localization capabilities, drawing upon the more advanced tools and techniques for large scale data analysis blossoming in the private/public sector. Over the years, the IMS sensor data volume has grown to tens of terabytes, and will continue to grow as more receiver stations come on line. Exploiting this valuable repository of historical data to extract better sensor models, earth models, phase classifiers, and association algorithms could prove extremely valuable especially as the network targets even lower threshold occurrences. Innovative approaches for automatically mining large databases are providing a valuable service for applications ranging from text and image search to outlier detection in financial transactions and target tracking and analysis. These approaches have matured from the purely empirical to machine learning based and statistically driven algorithms, adapted to the physics of the problem. Furthermore, methods for extracting extremely weak signatures from noisy data and improving the accuracy of anomaly characterization are also becoming more robust; these could also be brought to bear to enhance IDC effectiveness for qualifying small magnitude events. As such, we believe that it is timely to formulate a strategic R&D thrust focused at the means of handling present and future IMS sensor data sets, taking into account the state-of-the-art, the potential for advances in areas such as data structures and query techniques, the shaping of sensor data for exploitation, and emerging methodologies for pattern discovery. The long term goal of this effort will be to assist the CTBTO analyst in making robust, expedient decisions, aided by a historical perspective, in the face of rapidly growing multi-sensory information and the increasing need for more accurate and timely feedback for event characterization. To facilitate such an activity, we propose to define a virtual development environment vDEC (Virtual Data Exploitation Center) which will (1) connect experts from different disciplines with the IDC framework, to assess and develop upgrades for the current data processing infrastructure, with a view to lower the cost and improve the accuracy and efficiency of data analysis; (2) enable a pathway for new technology insertion and upkeep within the CTBTO operational environment; and (3) aid in constructing data exploitation architectures and system solutions for enhancing the fidelity of time critical on-site inspections (OSI). Mission: The objectives of this Center will be to link together via the world-wide web computer scientists and statisticians from academia, industry and stake holder agencies, with domain experts, to focus on the IDC data processing challenge. Current analyses rely upon slow, expensive and often error-prone manual correction of automated front end procedures. vDEC’s goal will be to evolve robust, efficient automated methodologies for waveform analysis and network processing, in an environ where new and innovative ideas can be tested and applied in context and with minimal delay. vDEC will exploit open IDC-defined data sets for algorithmic development and prototyping, consistent with current standards for data formats and interfaces. The Center will simultaneously champion industry partnerships so as to draw in methods and best practices from within the private sector, as well as define a conduit for technology migration and long term maintenance for IDC infrastructure enhancements. The goal will be to evolve and drive a phased roadmap which progressively integrates modules into the IDC data processing chain. Projects will be selected on the basis of their usefulness to the IDC over a rolling 3-5 year time horizon. Technical Portfolio: The following enumerates some of the research themes, in varying degrees of maturity, which could be exploited for CTBT verification. Probabilistic Classification and Inferencing: The current IDC software architecture automatically isolates and labels detections, assigns phases, and identifies events, creating an output bulletin of seismic activity for subsequent analyst review. Human analysts then meticulously check each event to remove redundancies and false positives and insert missed events. This discrepancy between the automated bulletin and human review could be substantially reduced by incorporating modern machine learning methods for phase/event classification which utilize historical data from past analyst review bulletins to provide learning sets for training classifiers. A number of prior-data driven techniques have evolved over the years to address such probabilistic classification challenges, including kernel methods, nonparametric methods, clustering and decision trees. These remain candidates for seismic event extraction. Drawing conclusions from IDC data sets currently relies upon a series of pipelined stages, where distinct software modules perform tasks based upon locally derived information. This has the advantage of efficiency and simplifies modular implementations. However, incorrect front end labeling can propagate through the chain to produce erroneous end results with no recourse to midcourse correction. Thus, tasks such as automated detection of weak sub-threshold signals, or cooperative correction of signatures based upon multiple sensors remain a human responsibility. The question to be asked is whether this sequential aspect of current processing can be converted into a bidirectional flow via probabilistic inference techniques, so that information at later stages about hypothesized events can effectively revise earlier detection and classification decisions. Emergent hierarchical probabilistic models offer the possibility of drawing conclusions based upon all the evidence available, not just the local context, by consistently updating the uncertainties in different information sources. The time may be right to explore some of these avenues for qualifying event signatures. Seismic Data Storage and Search: The IDC currently stores seismic data in a relational database (e.g. Oracle, MySQL) where records are indexed and queried by a predetermined set of standard attributes (schema) such as the station name, number of channels and time. Relational databases however become impractical as the data size grows, especially when more complex attributes are desired, since computing those indices would require interrogating all the records in the database, an increasingly time consuming task. To address this challenge, data intensive businesses such as Google and Yahoo! have developed an efficient schema-neutral distributed storage paradigm, hosting the data on thousands of nodes and moving the computation closer to each node, and relying upon redundancy to protect against transient faults or hardware failures. In addition, there is a wide database research community exploring alternate methods for spatial and temporal data structure and layout optimization for rapid access and query. These skills can be leveraged to assist in IDC archival data search and content extraction. Physics-Based/Empirical Earth Models: Production of high quality ground truth for seismic characterization remains a costly endeavor, requiring state-of-the-art supercomputing to calculate the geology of the transmission medium; and even then, the geographic sensor coverage in many areas is insufficient to either yield high-fidelity 3D models or comprehensive empirical calibration of travel times. Stochastic techniques are evolving which incorporate physics based constraints of seismic travel times to cull the incoming data sets and improve predictions. These methods can prove extremely valuable for tomographic inversion and event location as well as for providing more accurate training data for classification and inferencing. In addition to the above, as more diverse sensors come on line, the opportunity for automatically federating and fusing data sources to establish event correlations becomes available. Such data fusion can not only aid the inferencing process but also serve to update existing models for phase/event classification, thereby further enhancing the validity of the final output. These techniques could prove extremely valuable for OSI where time to solution is of the essence. Continuous Station Self Calibration and Self Diagnosis: Today’s station-specific "recipes" and error masking rules require manual design on a per-station basis, and are non-adaptive to changing monitoring conditions, resulting in several stations being effectively non-functional on an ad hoc basis. To remedy this, one approach may be to implement algorithms for fault checking and self error correction at the collection nodes, so as to filter out random faults and sensor drift as well as keep the data processing pipeline (and by default, a human analyst) informed of upstream station uncertainties. In addition, techniques which provide feedback to the collection network based upon the validity of end results can provide a valuable learning tool for continuous sensor calibration. These methods are widely deployed in machine learning and robotics communities for mapping and localization and could prove useful even in the vDEC domain. Alternatively, it may be as or more effective to calibrate sensor parameters based upon data from other stations; this would require no manually defined ground-truth events, as long as a reasonable fraction of stations were operational. The views expressed here do not necessarily reflect the opinion of the United States Government, the United States Department of Energy, or Lawrence Livermore National Laboratory. vDEC will champion and harness innovations in data exploitation to improve the effectiveness of CTBT verification. Data Archives Analysts and Experts Processing Infrastructure Resources Methods Evaluation vDEC Virtual Data Exploitation Center for the CTBTO Signal Processing Statistical Inferencing Physics-Based Modeling Performance Metrics Validation & Verification Large volumes of data are currently contained in independent repositories and locally analyzed. These will be used to benchmark vDEC. In addition, approaches to enhance IDC data-search based upon pattern attributes, increasing rate of data collection, timeliness of information required, and the number of simultaneous users will be evolved. In addition to local analysts and experts, participation from a multidisciplinary scientific community will provide key insights for interpreting the data and designing new methods and models for data processing. To evolve prototype solutions to the large-scale processing challenge, vDEC will maintain a development and test environment for proving in next generation capabilities. Methods developed will include refinements to the existing signal processing pipeline, signal correlation techniques, continuous local sensor self-calibration with event feedback, and cooperative sensor processing to eliminate outliers. The feasibility of state-of-the-art statistical classification and inferencing methods coupled with multi-modal data fusion has not yet been adequately exploited for the IDC/OSI communities. These present attractive opportunities for enhancing the robustness and speed to solution, and complement existing model-driven approaches. More accurate earth models can not only enhance the fidelity of event location but also assist in providing the training data for more accurate sensor models and downstream knowledge discovery. vDEC will define a set of metrics to evaluate the performance of existing and new methods for meeting the stated objectives. These metrics will not only indicate the relative merits of competing approaches, but also set a baseline for sensitivity analysis to provide feedback for upstream data collection and processing. The techniques and tools developed will conform to current standards and data formats and will undergo extensive parametric validation and verification testing prior to prototyping. IDC/OSI Objectives International Monitoring System (IMS) Sensor Network T he I M S sensor data volume i s growi ng rapidly. E xpl oiting thi s valuabl e repository of hi stori cal data to extract better sensor model s, earth model s, phase classi fiers, and associati on al gorithms coul d prove valuabl e for detecti ng l ow threshold events. A dvanced data mining techni ques appl ied to such a database coul d not onl y i mprove the accuracy and effici ency of event characteri zati on, but could al so gui de future col lection strategi es. Furthermore, some of the more advanced data analysi s approaches i n consi deration coul d resul t i n a complete redesign of the processing environment to provide much higher fidelity in real time. The International Data Center (IDC) and On-Site Inspection (OSI) teams verify compliance with the Comprehensive Test Ban Treaty (CTBT). The IDC generates event bulletins and strives to detect and localize events at even lower thresholds, with greater accuracy, and less manual intervention. The OSI teams investigate potential violations of the Treaty and must collect and analyze multi-modal data samples efficiently. These objectives motivate the development of new data processing paradigms as incorporated in vDEC below: Data Centres IDC NDC’s USGS ISC

Transcript of Strategic Initiative in Support of CTBT Data Processing ...€¦ · Strategic Initiative in Support...

Page 1: Strategic Initiative in Support of CTBT Data Processing ...€¦ · Strategic Initiative in Support of CTBT Data Processing: vDEC (virtual Data Exploitation Center) S. Vaidya1, E.R.

Strategic Initiative in Support of CTBT Data Processing: vDEC (virtual Data Exploitation Center)S. Vaidya1, E.R. Engdahl2, R. LeBras3, K. Koch3, O. Dahlman3

1Lawrence Livermore National Laboratory, 2University of Colorado, 3CTBTO

Vision:The Comprehensive Test Ban Treaty (CTBT) is receiving renewed attention in the Obama administration, in light of this administration’s long range objective of steering towards a nuclear weapon free world. While the current International Data Center (IDC) has successfully detected several low magnitude explosions, opportunities exist to further enhance its detection and localization capabilities, drawing upon the more advanced tools and techniques for large scale data analysis blossoming in the private/public sector.Over the years, the IMS sensor data volume has grown to tens of terabytes, and will continue to grow as more receiver stations come on line. Exploiting this valuable repository of historical data to extract better sensor models, earth models, phase classifiers, and association algorithms could prove extremely valuable especially as the network targets even lower threshold occurrences. Innovative approaches for automatically mining large databases are providing a valuable service for applications ranging from text and image search to outlier detection in financial transactions and target tracking and analysis. These approaches have matured from the purely empirical to machine learning based and statistically driven algorithms, adapted to the physics of the problem. Furthermore, methods for extracting extremely weak signatures from noisy data and improving the accuracy of anomaly characterization are also becoming more robust; these could also be brought to bear to enhance IDC effectiveness for qualifying small magnitude events.As such, we believe that it is timely to formulate a strategic R&D thrust focused at the means of handling present and future IMS sensor data sets, taking into account the state-of-the-art, the potential for advances in areas such as data structures and query techniques, the shaping of sensor data for exploitation, and emerging methodologies for pattern discovery. The long term goal of this effort will be to assist the CTBTO analyst in making robust, expedient decisions, aided by a historical perspective, in the face of rapidly growing multi-sensory information and the increasing need for more accurate and timely feedback for event characterization.To facilitate such an activity, we propose to define a virtual development environment vDEC (Virtual Data Exploitation Center) which will (1) connect experts from different disciplines with the IDC framework, to assess and develop upgrades for the current data processing infrastructure, with a view to lower the cost and improve the accuracy and efficiency of data analysis; (2) enable a pathway for new technology insertion and upkeep within the CTBTO operational environment; and (3) aid in constructing data exploitation architectures and system solutions for enhancing the fidelity of time critical on-site inspections (OSI).

Mission:The objectives of this Center will be to link together via the world-wide web computer scientists and statisticians from academia, industry and stake holder agencies, with domain experts, to focus on the IDC data processing challenge. Current analyses rely upon slow, expensive and often error-prone manual correction of automated front end procedures. vDEC’s goal will be to evolve robust, efficient automated methodologies for waveform analysis and network processing, in an environ where new and innovative ideas can be tested and applied in context and with minimal delay. vDEC will exploit open IDC-defined data sets for algorithmic development and prototyping, consistent with current standards for data formats and interfaces. The Center will simultaneously champion industry partnerships so as to draw in methods and best practices from within the private sector, as well as define a conduit for technology migration and long term maintenance for IDC infrastructure enhancements. The goal will be to evolve and drive a phased roadmap which progressively integrates modules into the IDC data processing chain. Projects will be selected on the basis of their usefulness to the IDC over a rolling 3-5 year time horizon.

Technical Portfolio:The following enumerates some of the research themes, in varying degrees of maturity, which could be exploited for CTBT verification.

Probabilistic Classification and Inferencing:The current IDC software architecture automatically isolates and labels detections, assigns phases, and identifies events, creating an output bulletin of seismic activity for subsequent analyst review. Human analysts then meticulously check each event to remove redundancies and false positives and insert missed events. This discrepancy between the automated bulletin and human review could be substantially reduced by incorporating modern machine learning methods for phase/event classification which utilize historical data from past analyst review bulletins to provide learning sets for training classifiers. A number of prior-data driven techniques have evolved over the years to address such probabilistic classification challenges, including kernel methods, nonparametric methods, clustering and decision trees. These remain candidates for seismic event extraction. Drawing conclusions from IDC data sets currently relies upon a series of pipelined stages, where distinct software modules perform tasks based upon locally derived information. This has the advantage of efficiency and simplifies modular implementations. However, incorrect front end labeling can propagate through the chain to produce erroneous end results with no recourse to midcourse correction. Thus, tasks such as automated detection of weak sub-threshold signals, or cooperative correction of signatures based upon multiple sensors remain a human responsibility. The question to be asked is whether this sequential aspect of current processing can be converted into a bidirectional flow via probabilistic inference techniques, so that information at later stages about hypothesized events can effectively revise earlier detection and classification decisions. Emergent hierarchical probabilistic models offer the possibility of drawing conclusions based upon all the evidence available, not just the local context, by consistently updating the uncertainties in different information sources. The time may be right to explore some of these avenues for qualifying event signatures.

Seismic Data Storage and Search:The IDC currently stores seismic data in a relational database (e.g. Oracle, MySQL) where records are indexed and queried by a predetermined set of standard attributes (schema) such as the station name, number of channels and time. Relational databases however become impractical as the data size grows, especially when more complex attributes are desired, since computing those indices would require interrogating all the records in the database, an increasingly time consuming task. To address this challenge, data intensive businesses such as Google and Yahoo! have developed an efficient schema-neutral distributed storage paradigm, hosting the data on thousands of nodes and moving the computation closer to each node, and relying upon redundancy to protect against transient faults or hardware failures. In addition, there is a wide database research community exploring alternate methods for spatial and temporal data structure and layout optimization for rapid access and query. These skills can be leveraged to assist in IDC archival data search and content extraction.

Physics-Based/Empirical Earth Models:Production of high quality ground truth for seismic characterization remains a costly endeavor, requiring state-of-the-art supercomputing to calculate the geology of the transmission medium; and even then, the geographic sensor coverage in many areas is insufficient to either yield high-fidelity 3D models or comprehensive empirical calibration of travel times. Stochastic techniques are evolving which incorporate physics based constraints of seismic travel times to cull the incoming data sets and improve predictions. These methods can prove extremely valuable for tomographic inversion and event location as well as for providing more accurate training data for classification and inferencing.In addition to the above, as more diverse sensors come on line, the opportunity for automatically federating and fusing data sources to establish event correlations becomes available. Such data fusion can not only aid the inferencing process but also serve to update existing models for phase/event classification, thereby further enhancing the validity of the final output. These techniques could prove extremely valuable for OSI where time to solution is of the essence.

Continuous Station Self Calibration and Self Diagnosis:Today’s station-specific "recipes" and error masking rules require manual design on a per-station basis, and are non-adaptive to changing monitoring conditions, resulting in several stations being effectively non-functional on an ad hoc basis. To remedy this, one approach may be to implement algorithms for fault checking and self error correction at the collection nodes, so as to filter out random faults and sensor drift as well as keep the data processing pipeline (and by default, a human analyst) informed of upstream station uncertainties. In addition, techniques which provide feedback to the collection network based upon the validity of end results can provide a valuable learning tool for continuous sensor calibration. These methods are widely deployed in machine learning and robotics communities for mapping and localization and could prove useful even in the vDEC domain. Alternatively, it may be as or more effective to calibrate sensor parameters based upon data from other stations; this would require no manually defined ground-truth events, as long as a reasonable fraction of stations were operational.

The views expressed here do not necessarily reflect the opinion of the United States Government,the United States Department of Energy, or Lawrence Livermore National Laboratory.

vDEC will champion and harness innovations in data exploitation to improve the effectiveness of CTBT verification.

Data Archives

Analysts and Experts

Processing Infrastructure

Resources Methods Evaluation

vDEC Virtual Data Exploitation Center for the CTBTO

Signal Processing

Statistical Inferencing

Physics-Based Modeling

Performance Metrics

Validation & Verification

Large volumes of data are currently contained in independent repositories and locally analyzed. These will be used to benchmark vDEC. In addition, approaches to enhance IDC data-search based upon pattern attributes, increasing rate of data collection, timeliness of information required, and the number of simultaneous users will be evolved.

In addition to local analysts andexperts, participation from amultidisciplinary scientific community will provide key insights forinterpreting the data and designing newmethods and models for data processing.

To evolve prototype solutions to the large-scale processing challenge, vDEC will maintain a development and test environment for proving in next generation capabilities.

Methods developed will include refinements to the existing signal processing pipeline, signal correlation techniques, continuous local sensor self-calibration with event feedback, and cooperative sensor processing to eliminate outliers.

The feasibility of state-of-the-art statistical classification and inferencing methods coupled with multi-modal data fusion has not yet been adequately exploited for the IDC/OSI communities. These present attractive opportunities for enhancing the robustness and speed to solution, and complement existing model-driven approaches.

More accurate earth models can not only enhance the fidelity of event location but also assist in providing the training data for more accurate sensor models and downstream knowledge discovery.

vDEC will define a set of metrics toevaluate the performance of existingand new methods for meeting thestated objectives. These metrics will not only indicate the relative merits of competing approaches, but also set a baseline for sensitivity analysis to provide feedback for upstream data collection and processing.

The techniques and tools developed will conform to current standards and data formats and will undergo extensive parametric validation and verification testing prior to prototyping.

IDC/OSI Objectives

International Monitoring System (IMS) Sensor NetworkT he I MS sensor data volume is growing rapidly. E xploiting this valuable repository of historical data to extract better sensor models, earth models, phase classi�ers, and association algorithms could prove valuable for detecting low threshold events. A dvanced data mining techniques applied to such a database could not only improve the accuracy and e�ciency of event characterization, but could also guide future collection strategies. Furthermore, some of the more advanced data analysis approaches in consideration could result in a complete redesign of the processing environment to provide much higher fidelity in real time.

The International Data Center (IDC) and On-Site Inspection (OSI) teams verify compliance with the Comprehensive Test Ban Treaty (CTBT). The IDC generates event bulletins and strives to detect and localize events at even lower thresholds, with greater accuracy, and less manual intervention. The OSI teams investigate potential violations of the Treaty and must collect and analyze multi-modal data samples efficiently. These objectives motivate the development of new data processing paradigms as incorporated in vDEC below:

Data Centres

IDC

NDC’s

USGS

ISC