Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science...
-
Upload
aden-keays -
Category
Documents
-
view
215 -
download
0
Transcript of Web and Social Media research at SZTAKI Zsolt Fekete, Andras Benczur Insitute for Computer Science...
Web and Social Media
Web and Social Media research at SZTAKIZsolt Fekete, Andras Benczur
Insitute for Computer Science and Control Hungarian Academy of Sciencesand
Eötvös University [email protected]
http://datamining.sztaki.hu
14 June 2013
TÁMOP-4.2.2.C-11/1/KONV-2012-0013
Informatics Laboratory• Data Mining and Search Group
o Zsolt Fekete, head
• Data Warehouse and Business Intelligence groupo Csaba Sidlo, head
• Groups within the labo Lajos Ronyai, Theory of Computing groupo Daniel Marx, ERC Starting Grant
winner, Parameterized Complexityo Andras Kornai, Human Language
Technologies
Hardware• 50-node old dual
core Hadoop• 5-node new
Hadoop/HBASE• 260TB net Isilon
Big Data – „Momentum” groupAwarded by President of Hungarian Academy of
Sciences in 2012
SZTAKI Text Mining Center• Funded by the President of the Hungarian Academy of
Sciences• Led by Prof. Laszlo Monostori, Research Laboratory on
Engineering & Management Intelligence o Informatics Laboratory (András Benczúr)o Laboratory of Parallel and Distributed Systems (Péter Kacsuk)o Internet Technology Department (István Tétényi)o Department of Distributed Systems (László Kovács)
• Topics:o trend monitoring; novelty recognition; concept-flow, concept-mapping;o analysis, monitoring and visualization of theme, professional relation, joint
authorship, citations, etc.o opinion extraction; semantic annotation; domain ontology development;o identification and resolution of names of persons and organization;o plagiarism detection
Connection to FuturICT.hu Work Plan
• Science of Scienceo SZTAKI Text Mining Centero Web classificationo Metadata extractiono SZTAKI Plagiarism Detection toolkit
• Fully Distributed Learning (and Networks)o Recommender systemso Distributed and streaming architectureso Network influence in recommender systems
Automatic metadata extraction• Articles in pdf form• Extracting
o Titleo Authorso Referenceso Etc
• Used techniqueso Computing features (text, visual info)o Machine Learning: SVM, CRF
• Save resources, select quality and topic• Legal regulation (porn, illicit content)• Web scale data (Test: ClueWeb09 25TB –
0.5 Billion English language docs)
JulienPhilippeMasanes
RigauxInternet Memory Paris
Cross-Lingual Web Spam Classification. Garzó, Daróczy, Kiss, Siklósi, Benczúr. WebQuality 2013 (@WWW)The classification power of Web features. Erdelyi, Benczur, Daroczy, Garzo, Kiss, Siklosi Internet Mathematics, under revision
Crosslingual Web Classification
• Expensive human labeling task language by language?
• How can models be “translated”?
Terms in the English model translated into Portuguese to classify in the target language.
Strongest positive and negative predictions are used for training a model in the target language.
Crosslingual Web Classification
KopFIRE: Technology in the cloud• BonFIRE FP7
Future Internet Research and Experimentation testbed• KOPI: A plagiarism detection toolkit
o http://kopi.sztaki.hu/o Translation plagiarism (English and Hungarian)o Now serving Wikipediao Service puts very heavy load on search index
(sentence based checks, existing suboptimal code)o Index ported to several distributed key-value storeso New alpha version now fed with Web data
Search for events in time
Search for events in time
Search for events in time
SZTAKI Full Text Search Technology
Trend analysis• Temporal data (eg. blogs)• Visualizing trends
o Wordso Groups of words
• Challengeso Big data techniqueso Temporal text indexing
Network Influence in Recommenders
Mobility Data Stream processing (Orange D4D)
Stream Processing Architecture Overview
Goal is to hide Storm details from user• Streaming infrastructure pluggable
(could combine with Stratosphere)• Persistence layer pluggable
Conclusions• SZTAKI covers a chain of research topics
o Web data acquisitiono cleansing and metadata extractiono search, temporal analyticso influence detectiono recommendation
• Science of Scienceo SZTAKI Text Mining Centero Multilingual classification for quality, genre, spamo Metadata extraction from pdf publications over the Webo SZTAKI Plagiarism Detection toolkit
• Fully Distributed Learning and Networkso Distributed and streaming architectureso Network influence in recommender systems
Recent publications• Pálovics,Benczúr. Temporal influence over the Last.fm social network.
IEEE ASONAM 2013• Garzó et al., Cross-Lingual Web Spam Classification. WebQuality 2013• Erdélyi et al., The classication power of Web features. Internet
Mathematics, under revision• L. Kocsis, A. György, A. N. Bán., BoostingTree: Parallel Selection of Weak
Learners in Boosting, with Application to Ranking. Machine Learning, to appear.
• Garzo et al., Real-time streaming mobility analytics. NetMob 2013• Göbölös-Szabó, Prytkova, Spaniol, Weikum. Cross-Lingual Data Quality
for Knowledge Base Acceleration across Wikipedia Editions. QDB 2012• Eom, Frahm, Benczur, Shepelyansky. Time evolution of Wikipedia
network ranking. Arxiv, 2013.• C. Sidló, A. Garzó, A. Molnár, A.A. Benczúr, Infrastructures and Bound
for Distributed Entity Resolution, in Proc. QDB in conj. VLDB 2011.
Web and Social Media
Questions?Zsolt Fekete
Head,Data Mining and
Search
member of the“Big Data” lab
http://datamining.sztaki.hu/
[email protected] 14 June 2013