ODSC and iRODS
-
Upload
raminder-singh -
Category
Documents
-
view
116 -
download
2
Transcript of ODSC and iRODS
![Page 1: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/1.jpg)
1
Open Data Science Conference and iRODS User Group meeting
Raminder SinghResearch Data Services
Research Technologies, Indiana UniversityJuly 7th, 2016
![Page 2: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/2.jpg)
2
ODSC East 2016https://www.odsc.com/boston
![Page 3: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/3.jpg)
3
Technologies Discussed• Julia is a high-level, high-performance dynamic programming language for technical computing with
familiar syntax. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library.
• Stan is for statistical modeling, data analysis, and prediction in the social, biological, and physical sciences, engineering, and business
• Scikit-learn is a python library with classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with other libraries like NumPy and SciPy.
• Apache Spark is an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.
• Apache Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.
• Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
• Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
![Page 4: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/4.jpg)
4
Keynote Speakers
![Page 5: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/5.jpg)
5
About Companies of Keynote Speakers
• Booz Allen Hamilton: Core business is the provision of management, technology and security services, to civilian government agencies. http://www.boozallen.com/datascience
• Rapid Miner: Integrated environment for machine learning, data mining, text mining, predictive analytics and business analytics. https://rapidminer.com/
• CrowdFlower: Data enrichment, data mining as a Software as a Service. https://www.crowdflower.com/
![Page 6: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/6.jpg)
6
Other Interesting Speakers
![Page 7: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/7.jpg)
7
Topics for Training Workshops
• Using R for Data Analytics– https://github.com/zachmayer/forecast
• Building a Real-time Recommender Systems with Spark ML, Kafka, and the PANCAKE STACK– http://advancedspark.com/
• Analyzing Open Data in Healthcare using Public APIs and Reproducible Workflows
– https://github.com/jhajagos/health-open-data-workshop
![Page 8: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/8.jpg)
8
List of Good Talks Available Online• Kirk Borne – “2 Most Important Things in Data Science”
– https://www.opendatascience.com/conferences/odsc-east-2016-kirk-borne-the-2-most-important-things-in-data-science/• Experiment • Data collection
• Tomorrow’s Map Room: Data Portals– https://www.opendatascience.com/blog/tomorrows-map-room-data-portals/
• Interactive Data Visualizations in R with Shiny and ggplot2– https://www.opendatascience.com/conferences/odsc-east-2016-joe-cheng-zev-ross-interactive-data-vi
sualizations-in-r-with-shiny-and-ggplot2/
• Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Shiny in R or D3 in Java script. http://bokeh.pydata.org– https
://www.opendatascience.com/conferences/odsc-east-2016-peter-wang-interactive-viz-of-a-billion-points-with-bokeh-datashader/
• Exaptive Xap Store is an 'app store' for data applications. They are standardizing set of libraries to be used to create Networks. http://www.exaptive.com/data-application-gallery
![Page 9: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/9.jpg)
9
![Page 10: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/10.jpg)
10
Objective to Attend
• iRODS features and architecture• User Community• Use Cases and Solutions built over iRODS• Future development and directions
Questions• Can I write rules in other languages? • Is it possible to attach it to existing storage?• What does it take to implement data policy rules for Research Data Alliance
(RDA) practical policy recommendations?
![Page 11: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/11.jpg)
11
![Page 12: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/12.jpg)
12
![Page 13: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/13.jpg)
13
iRODS Implements Four Main Functions
Data Virtualization: iRODS provides a logical representation of files stored in physical storage locations. We call this logical view a virtual file system and the capabilities it provides.
Data Discovery: This information about data, called metadata, is extremely useful for Data Discovery, locating relevant data within large data sets.
Workflow Automation: Once data is stored and available in the catalog, it often needs to be migrated, secured, or otherwise processed.
Secure Collaboration: Data is most useful when it’s in the hands of the right people. There is a recognized need in the public research community to publish data sets that accompany written articles.
![Page 14: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/14.jpg)
14
![Page 15: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/15.jpg)
15
![Page 16: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/16.jpg)
16
![Page 17: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/17.jpg)
18
EMC2 Case of Adaptive Hierarchical Metadata Using MetaLnx
![Page 18: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/18.jpg)
19
![Page 19: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/19.jpg)
20
Getting R to talk to iRODSBernhard Sonderegger, Nestlé Institute of Health Sciences
• The R language is an environment with a large and highly active user community in the field of data science. At NIHS we have developed the R-irods package which allows user-friendly access to irods data objects and metadata from the R language. Information is passed to the R functions as native R objects (e.g. data-frames) to facilitate integration with existing R code and to allow data access using standard R constructs.
• To maximize performance and maintain a simple architecture, the implementation heavily relies on the icommands C++ code wrapped using Rcpp bindings.
• The R-irods package has been engineered to have semantics equivalent to the icommands and can easily be used as a basis for further customization. At the NIHS we have created an ontology aware package on top of R-irods to ensure consistent metadata annotations and to facilitate query construction.
![Page 20: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/20.jpg)
21
![Page 21: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/21.jpg)
22
![Page 22: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/22.jpg)
23
![Page 23: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/23.jpg)
24
Review
Questions• Can I write rules in other languages?
– YES• Is it possible to attach it to existing storage?
– YES. There are tools to load the data• What does it take to implement data policy rules for Research Data Alliance
(RDA) practical policy recommendations?– Here https://github.com/DICE-UNC/policy-workbook is a reference
implementation for RDA recommendations. It needs some work to update and test these with the latest version of iRODS.
![Page 24: ODSC and iRODS](https://reader033.fdocuments.us/reader033/viewer/2022042618/589c72a31a28abe96c8b68b7/html5/thumbnails/24.jpg)
25
iRODS User Group Meeting notes and slides
• http://irods.org/documentation/articles/irods-user-group-meeting-2016/ - Use Case slides• http://irods.org/wp-content/uploads/2016/06/technical-overview-2016-web.pdf - Tech
report• http://slides.com/irods/ : Workshop Slides• https://github.com/DICE-UNC/policy-workbook: RDS Policies implementation• http://www.cyverse.org/ : iRODS as a service• http://irods.org/documentation/articles/ : Other Articles• http://www.odum.unc.edu/ • http://datafed.org/about/use-cases/• http://renci.org/news/virtual-institute-for-social-research/