OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE...

16
OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE NONPROFIT RESEARCH LTD . INNOVATION ACCELERATION BY PUBLIC DATA ANALYSIS

Transcript of OR, BIG DATA IN HUNGARY - ARCHIVING AND MINING THE ACADEMIC WEB GEORGE KAMPIS, CEO PETABYTE...

O R ,BIG DATA IN HUNGARY - ARCHIV ING

AND MIN ING THE ACADEMIC WEB

GEORGE KAMPIS , CEOPETABYTE NONPROFIT RESEARCH LTD .

INNOVATION ACCELERATION BY PUBLIC DATA ANALYSIS

PETABYTE NONPROFIT RESEARCH LTD.

www.petabyte-research.org

www.hungarianscience.org

www.textrend.org

www.dynanets.org

www.futurict.szte.hu

CONTEXT IN FUTURICT

• „Innovation Accelerator“

• .. to help (scientific) innovation with [...] social media as well as data services

Helbing, D., & Balietti, S. (2011). How to create an innovation accelerator. The European Physical Journal Special Topics, 195(1), 101-136.

Leydesdorff, L., Rotolo, D., & De Nooy, W. (2012). Innovation as a Nonlinear Process, the Scientometric Perspective, and the Specification of an Innovation Opportunities Explorer. Technology Analysis & Strategic Management (Forthcoming).

van Harmelen, F., Kampis, G., Börner, K., van den Besselaar, P., Schultes, E., Goble, C., ... & Helbing, D. (2012). Theoretical and technological building blocks for an innovation accelerator. The European Physical Journal Special Topics, 214(1), 183-214.

„BIG DATA“

PARTIALLY SIMILAR DEVELOPMENTS

• Mendeley• Reference manager and collaboration network

• ResearchGate• Research network and publications portal w/ quality assessment

• Altmetrics• Article-level online metrics

• VIVO• Connect, share, discover

BIG (WEB) DATA IS A KEY

• Big Data in Google trends

• „deep data“• controversy...

• Massive Web Data: harvesting / archiving

• Google itself...• The Internet Archive• UK web archive, British Library

WEB ARCHIVING IN HUNGARY

• None. Nope.

• „MIA“ (Magyar Internet Archivum, HU Internet Archive)• Various documents, plans and small-scale pilots• Since 2006

• Our ambition: to archive and mine HU academia = „HUA“• 500 NIIF institutions (NIIF = Nat‘l Information Infrastructure Dev‘t.)• 42 HAS (HU Acad Sci) research institutes• 47 higher education entities (universities and polytechnics)

• Now in collaboration with: OSZK (National Library), NIIF...

A RUNNING „HUA“ PILOT IN PETABYTE/FUTURICT.HU

• Hardware: Dell T710 server (2x4 core Xeon E5520, 48GB RAM, 2TB HDD)

• Software: Heritrix crawlers called from API and CURL, spawned from timed sripts...

• Not downloaded: exe, gz, iso, jar, mp3, ogg, ppt, rar, wav, xls, xlsx, zip

• Many technical issues: Flash pages, portlet containers (e.g. WebSphere), CMSs (e.g. Joomla)...

• Operation since April 2013.• Longitudinal archiving in mirror format (2-weekly

periods), using a form of „diff“ in own development

THE PROCESSING OF RESULTS

• Future plans: keyword extraction, timed (dynamic) keyword nets, correlation with support programs and grant calls (to analyze ROI in publications, citations, ...terms)

• „The Science of Success“ (A.-L. Barabási)• http://www.eccs13.eu/index.php/satellites • http://barabasilab.com/success/ • http://www.facebook.com/SuccessScience

• Bottleneck: availability of public funding data, need for open data initiatives enforcement

• In this pilot phase: basic statistics, turnover rates etc.

HOW BIG IS BIG?

QUICK RESULTS, BASIC STATS

• All 89 HU academic insitutitions: 86GB total (text 42GB)• Rank distributions (total)

HAS

Higher Ed.

QUICK RESULTS, BASIC STATS 2.

• Rank distributions (text, i.e. html, doc, docx, rtf, pdf, ps)

HAS

Higher Ed.

QUICK FIRST INSIGHTS

• (Outliers are chem.catalogs viz. astronomy datasets)

• Average size: 974 MB per site (median: 137 MB [!])• Average text size: 474 MB per site (median: 47 MB

[!])

• For comparison: • Kampis website @ ELTE = 180 MB (text only)

• Hypothesis: useful comparisons and metrics possible• Add dynamic aspect...

CONCLUSIONS, SUGGESTIONS

• Very first steps, only 2 months into the pilot• Data intensive, has natural timing

• Big (web) data are important for research assessment• Big data are often small (also elsewhere...)• Suggests itself for readily available indexes and

derivative measures• We have shown a simplest yet instructive case („size

matters“)• Caveat: need normalizations!

FUTURE WORKS... ARE LEFT TO THE (NEAR) FUTURE

THANK YOU!

• Coworkers: Laszlo Gulyas (PhD), Sandor Soos (PhD), Balazs Balint (MSc), Zsolt Juranyi (BSc), Attila Palmai (BSc student)

• This work was partially supported by the European Union and the European Social Fund through project FuturICT.hu (grant no.: TÁMOP-4.2.2.C-11/1/KONV-2012-0013).