Evolution models: competition Lectures I-II George Kampis ETSU 2007 Spring.
OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE...
-
Upload
thomasina-austin -
Category
Documents
-
view
214 -
download
0
Transcript of OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE...
O R ,BIG DATA IN HUNGARY - ARCHIV ING
AND MIN ING THE ACADEMIC WEB
GEORGE KAMPIS , CEOPETABYTE NONPROFIT RESEARCH LTD .
INNOVATION ACCELERATION BY PUBLIC DATA ANALYSIS
PETABYTE NONPROFIT RESEARCH LTD.
www.petabyte-research.org
www.hungarianscience.org
www.textrend.org
www.dynanets.org
www.futurict.szte.hu
WHAT WE DO IN FUTURICT.HU
• In the context of scientific research and higher education (in particular, in Hungary):
• Investment and return („ROI“) analysis• „science of success“
• Structural analysis of institutions• www.hungarianscience.org
• http://www.oktatas.hu/felsooktatas/projektek/tamop721_eszafejl/projekthirek/hazai_tudomanymetriai_felmeres
• New forms of publication (e.g. data sharing in papers)
CONTEXT IN FUTURICT
• „Innovation Accelerator“
• .. to help (scientific) innovation with [...] social media as well as data services
Helbing, D., & Balietti, S. (2011). How to create an innovation accelerator. The European Physical Journal Special Topics, 195(1), 101-136.
Leydesdorff, L., Rotolo, D., & De Nooy, W. (2012). Innovation as a Nonlinear Process, the Scientometric Perspective, and the Specification of an Innovation Opportunities Explorer. Technology Analysis & Strategic Management (Forthcoming).
van Harmelen, F., Kampis, G., Börner, K., van den Besselaar, P., Schultes, E., Goble, C., ... & Helbing, D. (2012). Theoretical and technological building blocks for an innovation accelerator. The European Physical Journal Special Topics, 214(1), 183-214.
„BIG DATA“
PARTIALLY SIMILAR DEVELOPMENTS
• Mendeley• Reference manager and collaboration network
• ResearchGate• Research network and publications portal w/ quality assessment
• Altmetrics• Article-level online metrics
• VIVO• Connect, share, discover
BIG (WEB) DATA IS A KEY
• Big Data in Google trends
• „deep data“• controversy...
• Massive Web Data: harvesting / archiving
• Google itself...• The Internet Archive• UK web archive, British Library
WEB ARCHIVING IN HUNGARY
• None. Nope.
• „MIA“ (Magyar Internet Archivum, HU Internet Archive)• Various documents, plans and small-scale pilots• Since 2006
• Our ambition: to archive and mine HU academia = „HUA“• 500 NIIF institutions (NIIF = Nat‘l Information Infrastructure Dev‘t.)• 42 HAS (HU Acad Sci) research institutes• 47 higher education entities (universities and polytechnics)
• Now in collaboration with: OSZK (National Library), NIIF...
A RUNNING „HUA“ PILOT IN PETABYTE/FUTURICT.HU
• Hardware: Dell T710 server (2x4 core Xeon E5520, 48GB RAM, 2TB HDD)
• Software: Heritrix crawlers called from API and CURL, spawned from timed sripts...
• Not downloaded: exe, gz, iso, jar, mp3, ogg, ppt, rar, wav, xls, xlsx, zip
• Many technical issues: Flash pages, portlet containers (e.g. WebSphere), CMSs (e.g. Joomla)...
• Operation since April 2013.• Longitudinal archiving in mirror format (2-weekly
periods), using a form of „diff“ in own development
THE PROCESSING OF RESULTS
• Future plans: keyword extraction, timed (dynamic) keyword nets, correlation with support programs and grant calls (to analyze ROI in publications, citations, ...terms)
• „The Science of Success“ (A.-L. Barabási)• http://www.eccs13.eu/index.php/satellites • http://barabasilab.com/success/ • http://www.facebook.com/SuccessScience
• Bottleneck: availability of public funding data, need for open data initiatives enforcement
• In this pilot phase: basic statistics, turnover rates etc.
QUICK RESULTS, BASIC STATS
• All 89 HU academic insitutitions: 86GB total (text 42GB)• Rank distributions (total)
HAS
Higher Ed.
QUICK RESULTS, BASIC STATS 2.
• Rank distributions (text, i.e. html, doc, docx, rtf, pdf, ps)
HAS
Higher Ed.
QUICK FIRST INSIGHTS
• (Outliers are chem.catalogs viz. astronomy datasets)
• Average size: 974 MB per site (median: 137 MB [!])• Average text size: 474 MB per site (median: 47 MB
[!])
• For comparison: • Kampis website @ ELTE = 180 MB (text only)
• Hypothesis: useful comparisons and metrics possible• Add dynamic aspect...
CONCLUSIONS, SUGGESTIONS
• Very first steps, only 2 months into the pilot• Data intensive, has natural timing
• Big (web) data are important for research assessment• Big data are often small (also elsewhere...)• Suggests itself for readily available indexes and
derivative measures• We have shown a simplest yet instructive case („size
matters“)• Caveat: need normalizations!