Data matters-bournemouth-2015
Transcript of Data matters-bournemouth-2015
Data Matters
Alan DixTalis & University of Birmingham
http://alandix.com/ref2014/
University ofBirmingham
Tiree
Tiree Tech Wave22-26 October 2015
today I am not talking about …
• intelligent internet interfaces• visualisation and sampling• situated displays, eCampus,
small device – large display interactions• fun and games, virtual crackers,
artistic performance, slow time• creativity and Bad Ideas• modelling dreams and regret
and the emergence of self
…
… or even lots of lights
http:/www.hcibook.com/alan/projects/firefly/
I am talking about ...
REF data analysis
long tail of small data
REF
REF 2014Research Excellence Framework
approx 5 yearly research assessment in the UK
not just about the UK …lots of countries thinking to do similar ... and looking to REF as example
REF elements
three elements:
outputs (mainly papers)
impact environment
focus of this work
REF panels
4 main panels, 36 sub-panels, ~200K outputs
sub-panel 11: computer science and informatics
I was on this panel but NO confidential data hereeverything public domain
REF profilesevery output graded: 4* / 3* / 2* / 1*
individual grades confidential and destroyed
each ‘Unit of Assessment’ (dept) given a profile
http://results.ref.ac.uk/Results/ByUoa/11/Outputs
sub-area profilesN.B. computing only
each output given ACM code
originally to enable allocation to panelists
… but, also used to create sub-area profiles …
sub-area profiles
From Morris Sloman’s slides & panel report
theoretical areas30-40% 4*
applied/human areas10-20% 4*
data not information
sub-panel report warning:"These data should be treated with circumspection …
however already affecting institutional policyhiring, internal investment
… and may influence research council policy
possible reasons for variation …
1. best applied work is weak– including HCI :-/
2. long tail– weak researchers choose applied areas
3. latent bias– despite panel’s efforts to be fair
can bibliometrics disentangle these?
metrics and assessment
citation metrics known to be good post-hoc correlates of sophisticated measures
… but not for individuals and small cohorts and danger of gaming and policy distortion
suitable for verifying large-scale patterns(and HEFCE using them for this)
data used for analysisall in public domain
(virtually) complete list of outputs:– excluding a few confidential ones– for each: name, doi, ACM topic area, Scopus citations
Google scholar citations for each– gathered after REF (not used in assessment)
UoA and sub-area profiles
metrics used
Scopus (late 2013 census )– with/without 2012/13 as few citations
‘Normalised Scopus’– using ‘contextual data’, corrects for
different citation patterns between areas– places output in top 1%, 5%, 10% of its area worldwide
Google Scholar (late 2014 census)– with/without 2012/13; zero treated as zero/missing
seven variants – all give similar results
results … massive differences% citations intop quartile
% REF 4* ratio
winners
losers
‘scatter’ graph
% outputs in top quartile for citations
% outputsawardedREF 4*
rank scores
winners
losers
diagram thanks to Andrew Howes
Another way of looking at it …world ranking within own field
recall REF …
for example,HCI research (web similar) …
on average …
• HCI/CSCW paper needs to be in top 0.5%worldwide to get 4*
• logic/algorithms paper just needs to be in top 5%
10 fold difference
and just as you thought it was all over …… institutional effectslook at +/- 25% REF compared with citationsN.B. use high-end weighted measure as money is focused (4:1:0:0)
of 35 losers, 25 are post-1992 universitiesof 17 winners, 16 are pre-1992 universities
an example …
XXXXXXX – a new universityYYYYYYYY – an old university
World Rankings
REF
and Gender?Female authors in main panel B were significantly less likely to achieve a 4* output than male authors with the same metrics ratings. When considered in the UOA models, women were significantly less likely to have 4* outputs than men whilst controlling for metric scores in the following UOAs: Psychology, Psychiatry and Neuroscience; Computer Science and Informatics; Architecture, Built Environment and Planning; Economics and Econometrics.
The Metric Tide (HEFCE, 2015)
implicit bias?
HEFCE analysis:male staff in computing is 1/3 more likely to get a 4* than female
areas and types institutions disadvantaged by REFoften those with more women
… implications for future recruitment?
future for research assessment?
• pure metrics?
• metrics as part (e.g. older outputs)
• metrics as under-girding (burden of proof)
• human process – metrics for in-process feedback
..
long tail of small data
Big Dataeveryone is talking about it
Twitter, Google, Facebook, NSA, universities, … and funding
Big Data does it with MapReduceSemantic Data does it with RDF
the long tail
size ofdata set
a few very large data setse.g. Twitter, streams,Open Govt., OS, geonames, dbpedia the small data of ordinary life:
from local bus timetables to squash club league tables
stories of small data …
Walking Wales
Learning analytics
Open Data Islands and Communities
Musicology
Alan Walks Wales
1058 miles (1700km)3 million footfalls3 ½ monthsApril-July 2013 focus on IT at the margins
one thousand miles of poetry, technology and community
vision
personalencircling, encompassing, pilgrimage, homecoming,
practicalIT for the walker & IT for local communities
philosophicalreflections on walking and space, locality and identity
researchpersonal agenda and living lab
lots of
data
data
locationGPX ... batteries ... sporadic signals ....
bio-sensingECG (heart), EDA (skin) and accelerometers
audio and imagesin the moment
textafter the event
implicit
explicit
The largest ECG trace in the public domain
challenges (1)
locationGPX – merging and mending
bio-sensingECG & EDA – special formats & volume
audio and imagesvolume, transcription and annotation
textsemantic markup, synchronising sources
challenges (2)
documentationmethodology of creation, data formatsfor other people to use!
meta-datafor machines to use
PRtelling the world about it!
academic culturewe do not value data!
an offer
multiple synchronisable data streamslargest public domain ECG trace
post-hoc analysissimulate real use
please use it!
Learning analytics
macro-analyticsuniversity strategyMOOCs
micro-analyticsindividual course, student, resource
time frames for learning analytics
days and hoursemail, during lectures and labs, stduent meetings, gaps
weekpreparing for teaching, exercises
months/mid-semesterreporting points, staff meetings, cohort/student progress
end of semester/term/yearexams, exam boards, course revew,
start of semester/term/yearpreparing for new courses or re-runs, rollover!
yearsnew courses, professional development, appraisal, promotion
Open Data
everyone is doing it
Governments, Cities, local gov.
In C21 Data is Power
why not an island?
island data flows
Community
groups and individuals
rest ofthe world
othercommunities
12
3
4
island data flowsfrom community to world
Community
groups and individuals
rest ofthe world
1• visibility and
control• identity and
empowerment• level of detail• local knowledge
island data flowsfrom world to community
Community
groups and individuals
rest ofthe world
2 • making the mostof open data
• local decision making
• lobbying and negotiation
island data flowswithin the community
Community
groups and individuals
3
• gossip is not enough!• sparse, dispersed population• social cohesion and economic benefits
island data flowsbetween communities
Community
groups and individuals
othercommunities
4
• sharing best practice• brand presence• interlinked data
benefits to …
the communityempowerment and controlavailability of informationcommunication within and between communities
the worldimproved quality of datalevel of detail of datalocal knowledge and understanding
In Concert
Concert ephemera1750–1800 Calendar of London Concerts1815–1895 Concert Life in London1894–1944 Concert Programme Exchange (BL)
External sourcesMusicBrainzMBz id as connect into Linked Data, BBC, etc.
Authoritative sources (future)e.g. British Library BNB, Concert Programmes metadata
concert databaseclassic digital humanities?
original sources
selectedsources
systematicsample
transcription& extraction
(medium expertise)
interpretation(high expertise)
digitisedsources
authoritativedata
analysis & use(high expertise)
academicpublication
large digitalarchive
(e.g. BBC)
possiblycreatelinkage
Barriers to progress
effort and expertiseauthority and qualitydigital acontextualityopenness
Openness and Reward
Career developmentLeverhulme & REFBuilding the discipline?
Re-envisioning the Digital Archive:Curation and Use
big bang to incremental
digitisedsources
authoritativedata
academicpublication
...
big bang to incremental
problem focused augmentationtransform cost-benefit
digitialarchive
academicpublications
...
partialenhancement
& interpretation
scenario-focused investigations
=> reflection and requirements
digital symbiosissuggestion and confirmation
provenance and authority
spreadsheet as user interface
semantics through interaction
themes and take-aways ...
data in context
heterogeneity and linking
value and values
ethics and empowerment
…. and please use my data