Ordering the chaos: Creating websites with imperfect data
-
Upload
andy-stretton -
Category
Data & Analytics
-
view
273 -
download
0
Transcript of Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: creating websites using
imperfect dataAndrew Stretton
Oxford University Web SIG November 2014
Who am I, what is ChemBio Hub?
• Andrew Stretton – Data Architect and Developer
github.com/strets123
@strets123
linkedin (google me)
• Chembio Hub
http://chembiohub.ox.ac.uk (feel free to link to us!)
@oxchembiohub
github.com/thesgc
Chembio Hub exists to support research at the
interface of chemistry and biology
by enabling sharing of reagents, expertise and data across 20+ departments
Who are we trying to connect and how?
User 1:Scientist at Oxford
User 2:Potential collaborator
Could be in industry or anywhere in academia
Unpublished results
Negative Data
Equipment
Methods
Areas of expertise
Questions and answers
Contacts
Reagents
Publications
Held on other sites or social networksOrganised/linked to by ChemBio Hub
Stored and curated by ChemBio Hub
? Not sure yet
Who are we trying to connect and how?
User 1:Scientist at Oxford
User 2:Potential collaborator
Could be in industry or anywhere in academia
Unpublished results
Negative Data
Equipment
Methods
Areas of expertise
Questions and answers
Contacts
Reagents
Publications
Held on other sites or social networksOrganised/linked to by ChemBio Hub
Stored and curated by ChemBio Hub
? Not sure yet
All of these parts require tagging entities in text, how can we do it
cheaply and sustainably?
What sorts of messy data are we working with?
• Full text from procedures, biographies, web sites
• Raw CSV/ Excel formats from multiple machines or departmental processes
• “Standard” XML and JSON formats from various sources that do not map perfectly to our application
• Multiple external databases to submit data to
How do most of our users like their web-based tools?
Simple Search
Flexible data management
Comprehensive, overlapping tagging
Clear progress, seamless experience
What do we sometimes give them?
• Incomplete or many-to-one tagging
• Hyperlinks instead of the right information from the other site
• Dumb search
• Inflexible schemas
• Lack of linking between datasets
What strategies do we have to deal with messy data?
Create more helpful data management apps
Fill in gaps in tagging by using search engines
Consider creating databases of flat files
Create map reduce / Database viewsfor schema Normalisation and data analysis
Web crawling - not as hard or messy as it used to be
What strategies do we have to deal with messy data?
Create more helpful data management apps
Fill in gaps in tagging by using search engines
Consider creating databases of flat files
Create map reduce / Database viewsfor schema Normalisation and data analysis
Web crawling - not as hard or messy as it used to be
Let’s look at this one first, happy to discuss other areas later…
How do we fill in gaps on un-tagged data?
Let’s do an experiment…
github.com/strets123/web-sig-2014/
Elasicsearch - information extraction on-the-fly
• Take a dataset of 18801 companies
~ 50% tagged
> 80% have some
text data
0% 50% 100%
Overview ordescription
Overview
Description
Tags
Source data : http://jsonstudio.com/resources/ github.com/strets123/web-sig-2014/
Use the “significant terms” feature…
• What description/overview words most strongly linked to each tag?
travel education music realestateSearch engine
optimizationjobs onlinemarketing projectmanagement
travel students music estate seo job marketing project
travelers teachers artists real optimization jobs seo projects
trip learning musicians agents engine employers agency task
trips education songs property ppc career optimization collaboration
hotels student labels listings marketing teams
flights educational playlists search management
traveler bands click
travellers song pay
airline artist
hotel fans
Now let’s test these queries
• Which companies have no tag but are most likely to need tagging with “music”…uPlaya
Description uPlaya provides independent or unsigned musicians with immediate feedback on their music….
Category games_video
Tags -
Webceleb
Description Webceleb is music marketplace and community where musicians and fans engage and profit from discovering, purchasing and downloading the latest independent music.….
Category games_video
Tags -
But what if we have
NO TAGS?
A process to extract tags from text…
Index DataAssign resources (e.g. Amazon spot instance
for large dataset)
List word counts with the least frequent
first
Exclude lowest countsAggregate the
significant terms for each word
Filter words that have a lot of high scoring
significant terms
What does this give us?
athletes: [athletes, coaches, athlete, coach, sports, fans]
avatars: [avatars, avatar, multiplayer, virtual, casual, 3d, games, chat, create, game]
clouds: [clouds, cloud, hybrid, computing, private, deploy, public, infrastructure]
dashboards: [dashboards, bi, reports, analytics, reporting, self, analysis, intelligence, features]
dial: [dial, calling, calls, voip, number, call, voice, phone]
exercise: [exercise, sleep, nutrition, fitness, weight, healthy, health]
indie: [indie, labels, artists, music]
logos: [logos, branding, flash, design]
pci: [pci, dss, hipaa, compliance, sensitive, compliant]
portland: [portland, oregon, inc, founded]
ringtones: [ringtones, ringtone, personalization, games]
traders: [traders, forex, trader, trading, quotes, stock, trade]
yellow: [yellow, pages, directory, local]
abc: [abc, cnn, nbc, television]
argentina: [argentina, buenos, aires, chile, uruguay, colombia, brazil, mexico, latin]
aviation: [aviation, aircraft, aerospace, defense, transportation]
airline: [airline, fares, airlines, flights, flight, travel, tickets, hotel, air]
What else can we do with this?
Filter words that have a lot of high scoring
significant terms
De duplicate where large overlaps exist
Assign levels of tags in order of frequency
Use to categorise new data on the fly
using percolate
Curate manuallyGenerate a sidebar
menu
github.com/strets123/web-sig-2014/
Use elasticsearchphrase suggester to create phrase tags
Advantages over direct curation / supervised learning:
• Simplicity and pragmatism
• Applicable to novel domains
– e.g. Chemical Biology
• Auto generated tags choose more appropriate word combinations than manual curators
• No need for complex data formats like rdf
• Data from many sources can be mixed
– e.g. categories from other university’s sites…
Where might this technology lead?
• How about a tag-based file system?
• How about an implicit social network?
• Elasticsearch is really easy to scale…
• Which websites, filesystems and datasets do you need to categorise?
– Do you really need RDF ontologies, curators etc. or can you just do something simple?
Summary
• We now have many options to categorise and tidy up messy data
• Managing variations on schemas takes a lot of resources – leave it to the data owners if you can!
• When it comes to tagging…
– Perfection is in the eye of the beholder
– Sustainability is really important
Thanks
• Thanks to the Research informatics team at the NDM Structural Genomics Consortium– Paul Barrett– Karen Porter– Michael O’Hagan– Brian Marsden– David Damerell– Sefa Garsot– Anthony Bradley
• Thanks to the InfoDev team at IT services for answering my endless questions about webauth
• Funders:
– John Fell Fund
– NDM Strategic
– Welcome Trust
– Higher Education Funding Council
• To everyone here for listening
Any Questions?
• Andrew Strettongithub.com/strets123
@strets123
linkedin (google me)
• Chembio Hubhttp://chembiohub.ox.ac.uk
@oxchembiohub
github.com/thesgc
Simple example categorisation code available here in python
github.com/strets123/web-sig-2014/
Appendix of other messy data techniques
How do we make it easy to add spreadsheet data to a
system?
Working with flat files
• Sometimes a flat file is the right schema for a dataset– User defined formats
– Different types of research
– Only some of the fields are relevant when comparing experiments
– Data is not in memory unless needed
• Pandas and HDF allows SQL-like queries on flat files
Helpful data management
• Data Wrangler
– https://player.vimeo.com/video/19185801
• Raw
– http://raw.densitydesign.org
• Take these as inspiration for our tool for re-shaping biochemistry data
Simplifying web crawling
• Modern web crawling patterns use class selectors instead of xPath
– Less likelihood of change
• Content can be crawled using a backend web browser
– Dynamic javascript elements are included
• Using a website’s data for classification is more acceptable than wholesale reproduction
Managing multiple JSON schemas with views
Couchbase
PostgreSQL – also supported by Rails/Activerecord
Why views over JSON can be useful
• Expose only required fields from e.g. RDF
• Input format may change but we don’t want crawler to break
• Required fields may change
• Versions are easy to support if format normalisation is in the database layer
• Storage is cheap
• View code is executed only once