The Role of Data in IS Research
-
Upload
frank-hopfgartner -
Category
Data & Analytics
-
view
304 -
download
0
Transcript of The Role of Data in IS Research
Click to edit Master title style
The Role of Data in IS Research
Frank Hopfgartner
University of Glasgow
@OkapiBM25
Click to edit Master title styleQuestion
Do you use a
dataset for your
research?
Click to edit Master title styleIntended Learning Outcome
• By the end of this session, you will be able to
– Explain the need for datasets for scientific research
– List components that comprise test collections
– Identify appropriate datasets to answer research hypotheses
– Create your own test collections
Click to edit Master title styleOutline
• Importance of Data
• Getting Data
• Using Datasets for IS Research
Click to edit Master title styleWhy do we use data?
Because it helps us
to understand our
world
Click to edit Master title styleExample:
Ngram Viewer
Source: https://books.google.com/ngrams
Click to edit Master title styleExample:
Online publishing
D. Corney, D. Albakour, M. Martinez, S. Moussa
“What do a Million News Articles look like?” in Proc. NewsIR’16, pp. 42-47, 2016.
Sampling from over 93,000 different news sources recorded in September 2015
Large-scale main News outlets
Single-author Blogs
Click to edit Master title styleSummarising:
Types of data
Quantitative & Qualitative
Numeric and Textual
Comparison (like with like)
Context
Point-in-time
Longitudinal (series and interval)
Click to edit Master title styleOutline
• Importance of Data
• Getting Data
• Using Datasets for IS Research
Click to edit Master title styleExample:
Opening UK Government
Source: https://data.gov.uk/
Click to edit Master title styleExample:
UK Data Archive
Over 5,000 data
collections
Largely economic
and social
Founded in 1967
Office of National
Statistics
Medical Research
Council
http://www.data-archive.ac.uk/
Click to edit Master title styleExample:
UK Data Service
https://www.ukdataservice.ac.uk
large-scale
government surveys
international
macrodata
business microdata
qualitative studies
census data from
1971 to 2011
Click to edit Master title styleNon-Public Data
Example: Google Trends
https://www.google.com/trends/home/all/GB
Click to edit Master title styleQuestion
But what if I want to
analyse non-public
data?
Click to edit Master title styleSome people just hack…
http://www.theguardian.com/news/2016/apr/03/what-you-need-to-know-about-the-panama-papers
Disclaimer: This is not an appeal to perform any illegal activities.
Click to edit Master title styleCreate your own data
• Record data, e.g.,
– Log files of users using information access systems
– Sensor records
– Digitise documents (accepting copyright)
– …
Click to edit Master title styleExample:
Campus wide IPTV provider
• Campus wide IPTV provider
• Live and VoD content
• 16 genres
• 33 channels
• Over 7000 different programme names
• Over 500 unique users
J. Yuan, F. Sikrivaya, F. Hopfgartner, A. Lommatzsch, M. Mu. Context-Aware LDA: Balancing Relevance and Diversity in TV Content
Recommenders. In Proc. RecSysTV workshop, Vienna, Austria, 2015.
Click to edit Master title style
1
2
3
4
5
6
7
0246810121416182022
ARTS
CHILDRENS
COMEDY
DRAMA
ENTERTAINMENT
FACTUAL
FILM
LEARNING
LIFESTYLE
MUSIC
NEWS
NULL
RELIGIONANDETHICS
SPORT
SPORTS
WEATHER
day of w eek
Category Distribution
time of day
cate
gories
categories chosen count
20
40
60
80
100
120
140
Example:
Log user interaction data
J. Yuan, F. Sikrivaya, F. Hopfgartner, A. Lommatzsch, M. Mu. Context-Aware LDA: Balancing Relevance and Diversity in TV Content
Recommenders. In Proc. RecSysTV workshop, Vienna, Austria, 2015.
Click to edit Master title styleExample:
Video retrieval platform
F. Hopfgartner, D. Scott, H. Wang, Y. Yang, Z. Zhang, M. Zhou, C. gurrin. Helping the Helpers: How Video Retrieval Can Assist
Special Interest Groups. In Proc. MMM'13: 19th International Conference on Multimedia Modeling, pp. 493-495, 2013.
Click to edit Master title style
F. Hopfgartner and J. M. Jose. Semantic User Profiling Techniques for personalised multimedia recommendation. Multimedia Systems 14(4-5):255-
274, 2010.
F. Hopfgartner and J. M. Jose. An experimental evaluation of ontology-based user profiles. Multimedia Tools and Applications 73(2):1029-1051,
2014.
Click to edit Master title styleSummarising:
What do I need to consider?
Documentation
Terms of deposit
Permissions and re-use
Software
Methodology
Time
Place
Sampling
Data collection
Editorial control
Classification
Coding
21
Click to edit Master title styleOutline
• Importance of Data
• Getting Data
• Using Datasets for IS Research
Click to edit Master title styleUse Case: Evaluation of
Information Access Systems
Information Access System
Input
Output
Click to edit Master title styleExamples:
Web Search Engines
Click to edit Master title styleExample:
Social Media Search Engines
Click to edit Master title styleExample:
Product Search Engines
26
Click to edit Master title styleExamples:
Multimedia Search Engines
Click to edit Master title styleExample:
Libraries
Click to edit Master title styleHow do we evaluate
information access systems?
Document
collection
Topic
set
Relevance
assessments
Test colle
ction
Document
collection
But how can we compare with state-of-the-art?
SystemB
SystemA
Click to edit Master title styleEvaluation Campaigns
TRECCLEF
FIRE
NTCIR
Common dataset Pre-defined tasks Ground truth Evaluation protocol Evaluation metrics
Click to edit Master title styleFocus on different domains
Microblogging
Ad-hoc and Web Search
Multimedia
Federated Web Search
XML Retrieval
Information Access in the Legal Domain
Document Similarity
…
Click to edit Master title styleExample projects
Click to edit Master title styleCLEF InitativeSo
urc
e: h
ttp
://w
ww
.isic
al.a
c.in
/~fi
re/2
01
3/s
lide
s/o
the
r_cl
ef_f
ire1
3.p
df
Click to edit Master title styleCLEF Tracks
Source: http://www.clef-initiative.eu/track/series
eHealth
ImageCLEF
LifeCLEF
Living Labs for IR (LL4IR)
News Recommendation Evaluation Lab (NEWREEL)
Uncovering Plagiarism, Authorship and Social Software Misuse (PAN)
Social Book Search (SBS)
CL
EF
’16
Click to edit Master title style
In CLEF NewsREEL, participants can develop stream-based news
recommendation algorithms and have them benchmarked (a) online by
millions of users over the period of a few months in a living lab, and (b) offline
by simulating a live stream.
NEWSREEL
F. Hopfgartner, T. Brodt, J. Seiler, B. Kille, A. Lommatzsch, M. Larson, R. Turrin, A. Sereny
“Benchmarking News Recommendations: The CLEF NewsREEL Use Case,” in SIGIR Forum, 49(2):129-136, 2015
Click to edit Master title styleExample: News Articles
Source (Image): T. Brodt of plista.com
Click to edit Master title style
Profit = Clicks on recommendations
Benchmarking metric: Click-Through-
Rate
Request
article
Request
article
Request
recommendation
Request
recommendation
Click to edit Master title styleDataset
• Traffic and content
updates of nine German-
language news content
provider websites
• Traffic: Reading article,
clicking on
recommendations
• Updates: adding and
updating news articles
B. Kille, F. Hopfgartner, T. Brodt, T. Heintz
“The plista Dataset” in Proc. NRS'13: International Workshop and Challenge on News Recommender Systems, Hong Kong, China, pp. 16-23, 2013.
Click to edit Master title styleEvaluation using offline
dataset
Idomaar
request
articlessimulate
stream
Click to edit Master title styleExample results
B. Kille, A. Lommatzsch, R. Turrin, A. Sereny, M. Larson, T. Brodt, J. Seiler, F. Hopfgartner
“Overview of CLEF NewsREEL 2015: News Recommendation Evaluation Lab,” in Working Notes of CLEF 2015, Toulouse, France, 2015.
Click to edit Master title styleExample projects
Click to edit Master title styleNTCIRS
ourc
e: H
ideo
Jo
ho
Click to edit Master title styleNTCIR-12 TasksN
TC
IR-1
2
Second round:
Search-Intent Mining
Mobile Click
Temporal Information Access
Spoken Query & Spoken Document Retrieval
QA Lab for Entrance Exam
First round:
Medical NLP for Clinical Documents
Personal Lifelog Access & Retrieval
Short Text Conversation
Click to edit Master title style
Encourage research advances in organising and retrieving from lifelog data.
LifeLog @ NTCIR-12
Click to edit Master title styleWhat is The Quantified Self?
The Quantified Self is about obtaining self-knowledge through
self-tracking.
Click to edit Master title styleWhat is The Quantified Self?
Self-tracking is also referred to as lifelogging, self-analysis,
or self-hacking.
Click to edit Master title styleExample: Visual Lifelogging
Click to edit Master title styleVisual Lifelog of a day
2,000 pictures a day
Slide: Cathal Gurrin
Click to edit Master title styleLifelogging Challenges
The challenges are how to sense the person, their actions, their life and make it accessible using appropriate interfaces, search, recommendation engines and visual/aural feedback. Further, exploiting the lifelog to identify context for adaptive information services.
Source (Graphic): DAI-Labor, Berlin
Click to edit Master title styleMultimodal dataset with
information needs
Created by three individuals over
10+ days
TE
ST
CO
LL
EC
TIO
N
18.18GB 88,124 images Accompanying output of 1,000
concepts (825MB) Data processed pre-release
(removal of personal content; face blurring, translation of concepts)
Detailed user queries andjudgments generated by the lifelogging data gatherers
C. Gurrin, H. Joho, F. Hopfgartner, L. Zhou, R. Albatal
“NTCIR-Lifelog: The First Test Collection for Lifelog Research”, in Proc. SIGIR'16: ACM International Conference on Information Retrieval, Pisa, Italy, to appear.
Click to edit Master title style
Evaluate different methods of
retrieval and access.
TasksT1
: LI
FELO
G S
EMA
NTI
C A
CC
ESS
(LSA
T)
T2:
LIFE
LOG
IN
SIG
HT
Models the retrieval need from lifelogs (Known-Item Search)
Retrieve N segments that match information need
Interactive or Automatic participation
Interactive: Time limit for fair and comparative evaluation in an interactive system with users
Automatic: Fully-automatic retrieval system. Automated query processing
Models the need for reflection over lifelog data
Exploratory task, the aim is to:
encourage broad participation
novel methods to visualise and explore lifelogs
Same data as LSAT task
Presented via demo/poster.
Click to edit Master title styleTask 1: Lifelog Semantic
Access
Find the moment(s) where I
use my coffee machine.
Find the moment(s) where I am in the kitchen
Find the moment(s) where I
am playing with my phone.
Find the moment(s) where I
am preparing breakfast.
Click to edit Master title styleTask 2: Lifelog Insight Task
Provide insights on the time I spend taking breakfast.
Provide insights on the time I spend driving to work.
Provide insights on the time I spend reading a paper.
Provide insights on the time I spend working on the
computer.
Click to edit Master title styleFinal thoughts
• Data plays an essential role in scientific research since it is
used to prove or disprove a hypothesis
• You are now familiar with various sources where you can
get datasets that might be useful for your own research
• When selecting data, question its credibility, e.g., is it
biased? Can it be used to support your hypotheses?
• Consider accessibility of the data you want to analyse. Are
you allowed to use it? Can others (e.g., other
researchers?) access the data?