GAIA Tech1 Data Repositories Meeting
-
Upload
ricard-de-la-vega-sivera -
Category
Technology
-
view
647 -
download
5
Transcript of GAIA Tech1 Data Repositories Meeting
GAIA Tech1 Data Repositories MeetingGAIA Tech1 Data Repositories Meeting
Ingrid Bàrcena, HPC and Storage services manager
Ricard de la Vega, Portals and Repositories manager
GAIA Tech1 meeting
Madrid May 24 2011
OutlineOutline
1. ¿What is CESCA?
2. CESCA services
� HPC ans Storage
� Network
� University e-Administration
� Portals and Repositories
3. Digital Repositories
� Overview
� Two examples: DSpace and web archiving
� Long term preservation
4. CESCA and GAIA
� What is done
� What could be done
Centre de Centre de SupercomputaciSupercomputacióó de Catalunyade Catalunya
� Patrons:
• Generalitat de Catalunya
• Fundació Catalana per a la Recerca i la Innovació
• Universitat de Barcelona
• Universitat Autònomade Barcelona
• Universitat Politècnicade Catalunya
• Universitat Pompeu Fabra
• Universitat de Girona
• Universitat Rovira i Virgili
• Universitat de Lleida
• Universitat Obertade Catalunya
• Universitat Ramon Llull
• Consell Superiord’Investigacions Científiques
� Public Consortium created in 1991
� ICTS since 2000
OurOur ServicesServices
HPCHPC and and StorageStorage
19,48 Tflop/s Peack performance
50 research projects ( 203 users)
Main areas:
• Materials Science (31%)
• Life Science (32%)
• Environmental Science (28%)
• Astronomy and Astrophysics (5%)
+ 3.5 HC used during 2010
+ 50 scientific applications available
Disk Library
NetApp FAS3170
150 TB
21 TB FC drives
126 TB SATA drives
6 Pharma Labs10 Academic research groups
HPC Service Storage Service
Tape Library
ADIC i2000
156 TB
6 LTO-4 drives
300 slots
NetBackup 6.5
2 Software Packages
Drug Design Service
Network servicesNetwork services
+80 connected institutions
2 core nodes at 10 Gbps
Flexible bandwidth
Services: IPv6, multimedia, Remot
Access Service,Voice over IP,
Eduroam, Security...
21 institutions in Catalonia
40 countries
24 ISP and operators
Services: Multicast, IPv6, NTP Server, F root server (A and J,
.com and .net coming soon)...
University eUniversity e--Administration ProjectsAdministration Projects
e-Register
• URV: production 02-01-11
• UdL: production 03-14-11
• Sadiel: 32.692 €
e-Vote
• Bid price : 405.000 €
• Awarded (03-18-10): Scytl, 345.000 €
• Production: 02-01-11
SCD (e-Identitat i e-Signatura)
• Available:
EC-UR i EC-URV
ER-CESCA, -URV, -UPC
-UdL, -UPF
• In development: ER-UdG,
ER-UB, ER-UAB i ER-UVic
GPI
Improvements (02-03-11)
• Inteum Sentinel i Technology Publisher
• Office 2007; separació MVs per
universitat; enviament correus
Licence renewal. UB i UPC
Investment: 1.046,97 €
e-Archive
• Transfer agreement: 12-7-10
• Inst. ATLAS: 17.800 €
• Integr. Doc. Mgt:Award: IECI 51.920 € (02-12-11)
• Production: 06-01-11
Cluster: 15 BL460c G6 (2 x Intel Xeon E5530 QC); 480 GB; 4,3 TB;
XenServer Citrix; 2 load balancer F5 BIG-IP 1600; 110.487 €
Capa de dades
…
Balancejadors F5 BIG-IP
31-03-11
Portals and RepositoriesPortals and Repositories
Since 2001
18 universities
10,577 doctoral thesis
www.tdx.cat
Since 2005
22 institutions
24,564 research
papers, eprints…
www.recercat.cat
Since 2006
328 journals
129,235 articles
www.raco.cat
Since 2009
10 universities
1,814 learning objects
www.mdx.cat
Since 2006
39,587 websites crawled
118,039 versions crawled
249M files in 7.5 TB
www.padicat.cat
Since 2010
22 institutions
24,564 research
papers, eprints…
www.recercat.cat
Since 2006
Turnkey development
Evolutionary maintenance
http://recyt.fecyt.es
Pilot 2009-10
420 websites crawled
790 versions crawled
http://recyt.fecyt.es
(restricted IP address)
OutlineOutline
1. ¿What is CESCA?
2. CESCA services
� HPC ans Storage
� Network
� University e-Administration
� Portals and Repositories
3. Digital Repositories
� Overview
� Two examples: DSpace and web archiving
� Long term preservation
4. CESCA and GAIA
� What is done
� What could be done
Digital RepositoriesDigital Repositories
� A repository capture, store, index, preserve and distribute digital content.
� Data + Metadata• Dublin Core (DC)
• Mets, Mods, marc21…
• VO?
• Astronomical?
� Main issues• Access (search / browse)
• Preservation
• Interoperability
– Open Archive Initiative for metadada harvest (OAI-PMH)
(based on Dublin Core metadata)
Repositories taxonomyRepositories taxonomy
Towards a European e-Infrastructure for e-Science Digital Repositories. 7th e-Concentration Meeting, Brussels, 12-14th October, 2009
Repositories HardwareRepositories Hardware
� High availability
� Load balancing
� Easy scalability
� 24x7 monitoringBalancers
Services
Data
…
…
Storage Area Network
Disc Tape
Repositories SoftwareRepositories Software
� For general purpose
• DSpace, EPrints, Fedora, Islandora…• Implemented in
� For journal management
• Open Journal Systems (OJS)• Implemented in
� For web archives preservation
• Heritrix, NutchWAX, WERA, Wayback, Webcurator…• Implemented in
ExampleExample onon general general purposepurpose repositoryrepository ((DSpaceDSpace))
� For digital objects, like PDF, images, videos, data…
� Index metadata and PDF for searching
ExampleExample onon webweb archive (PADICAT)archive (PADICAT)
� PADICAT consists of collecting, processing and providing
permanent access to the entire cultural, scientific and general output of Catalonia in digital format. It is the
Catalan web sites archive.
PANDORA UK ARCHIVE IA VEFSAFN BNF Kulturarw3 Netarchive Scope Australia UK World Islandia France Sueden Denmark
Begin 1996 2004 1996 2004 2002 1996 2005
Open access � � � �since 2009 � � �
Search by URL � � � � � � �
S. by keyword � � � � � � �
Directori � � � � � � �
N. websites 26.630 8.308 - - - - > 1,1 milions N. crawls 60.276 32.618 150 billion - - - 4,5 bilions
Space 4,63 TB 7,59 TB - - 180 TB - 155TB
Data 16-12-2010 12-01-2011 13-12-2011 13-01-2011 13-01-2011 26-11-2010 08-2010
- Open Access
- Search by URL and keyword
- Catalogue and thematic directory
www.padicat.cat
Since 2006- 39,587 websites crawled
- 118,039 versions crawled
- 249M files in 7.5 TB
Web archive software architectureWeb archive software architecture
INDEX FOR KEYWORD SEARCHING
INDEX FOR URL SEARCHINGARXIUS
ARC
HADOOP +
NUTCHWAX
ARCINDEXER
HERITRIX
WAYBACK
WERA
CATALOG DATABASE
(Crawl Metadata)
WEB CURATOR TOOL
1. Harvest
2. Index and search
3. Catalogue and browse
PADICATPADICAT’’ss indexesindexes
� Until now (< 100.000 website version crawled)
• For search by URL (like Internet Archive)
– Index with ArcIndexer (~100 GB) + visualize with Wayback √
• For search by keyword
– Index with Hadoop+NutchWAX + visualize with WERA √
� Now (120.000 website version crawls)
• Performance problems for keyword indexing
• Two solutions under evaluation:
– Index with a new version of NutchWAX + visualize with TNH (the new
hotness, from IA)
– Index with JB (James Brown, from IA) + visualize with TNH
Long term preservationLong term preservation
� The e-infrastructure must ensure the long term data
access, without failure.
� To succeed, it must be taken into account:
• Replication (more than one copy)
• Media refresh
• Format migration
• Data integrity (checksums)
• Contingency and recovery plan
• Preservation plan
• ...
An example of long term preservationAn example of long term preservation
The “preservation history” of TDX (doctoral theses)…
� 2001 – 80 GB, 8.000 access hits
• SW: ETDdb (+ MySQL, Glimpse…) from Virginia Tech
• HW: HP V2500 with 16 processors, 4 GB memory, 227 GB disk
• HW: StorageTek TimberWolf 9740 with 2,7 TB of 9840 tapes
Born in a supercomputer!
An example of long term preservationAn example of long term preservation
The “preservation history” of TDX (doctoral theses)…
� Hardware migrations
• 2003 (cpu + disk)
– HP rp5430 with 2 processors, 704 GB memory
– HP EVA V.2 with 2,8 TB disk
• 2006 (cpu + tape)
– High availability HP cluster with 32 Proliant DL360 nodes
– Adic Scalar i2000 (from 9840 tapes to LTO3 tapes)
• 2009 (disk)
– NetApp FAS3170 with 60 TB disk
� Software migrations
• 2010 – DSpace (+ PostgreSQL, Java, solr, …) from MIT & HP labs
An example of long term preservationAn example of long term preservation
The “preservation history” of TDX (doctoral theses)…
� Replication
• On disk - Online version (1)
• One backup on the tape library (2)
• Other backup on a fireproof cabinet (3)
• Other backup on a 50 Km remote Centre (4)
• A dark copy on the MetaArchive Cooperative
– Private LOCKSS (Lots of Copies Keep Stuff Safe) Network
– 10 more copies around the world (14)
� Data Integrity
• Checksums on DSpace (online version)
• Checksums on LOCKSS (dark copies)
An example of long term preservationAn example of long term preservation
The “preservation history” of TDX (doctoral theses)…
� 2011 – 300 GB, + of 3,5 million access hits
• SW: DSpace (+ PostgreSQL, Java, solr, …) from MIT & HP labs
• HW: High availability HP cluster with 32 Prolian DL360 nodes
• HW: NetApp FAS3170 with 60 TB disk
• HW: Adic Scalar i2000
• SW: LOCKSS (+ Conspectus...)
• HW: HP DL380 (LOCKSS cache)
� xxxx – …
www.tdx.cat
OutlineOutline
1. ¿What is CESCA?
2. CESCA services
� HPC ans Storage
� Network
� University e-Administration
� Portals and Repositories
3. Digital Repositories
� Overview
� Two examples: DSpace and web archiving
� Long term preservation
4. CESCA and GAIA
� What is done
� What could be done
GAIA at CESCA: GAIA at CESCA: whatwhat isis donedone
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
Data
Processing
IDT/IDU
Storage
DatabaseGDASS/COG
Backup
Data processing
Database
GAiAGAiA andand CESCA: CESCA: whatwhat couldcould be donebe done
Preservation:
Dark copy, …
Data Repository
Large data
transfer
Powerful
Searches and
interoperability
Storage and Backup