Ordering the chaos: Creating websites with imperfect data

Post on 11-Jul-2015

273 views 0 download

Transcript of Ordering the chaos: Creating websites with imperfect data

Ordering the chaos: creating websites using

imperfect dataAndrew Stretton

Oxford University Web SIG November 2014

Who am I, what is ChemBio Hub?

• Andrew Stretton – Data Architect and Developer

github.com/strets123

@strets123

linkedin (google me)

• Chembio Hub

http://chembiohub.ox.ac.uk (feel free to link to us!)

@oxchembiohub

github.com/thesgc

Chembio Hub exists to support research at the

interface of chemistry and biology

by enabling sharing of reagents, expertise and data across 20+ departments

Who are we trying to connect and how?

User 1:Scientist at Oxford

User 2:Potential collaborator

Could be in industry or anywhere in academia

Unpublished results

Negative Data

Equipment

Methods

Areas of expertise

Questions and answers

Contacts

Reagents

Publications

Held on other sites or social networksOrganised/linked to by ChemBio Hub

Stored and curated by ChemBio Hub

? Not sure yet

Who are we trying to connect and how?

User 1:Scientist at Oxford

User 2:Potential collaborator

Could be in industry or anywhere in academia

Unpublished results

Negative Data

Equipment

Methods

Areas of expertise

Questions and answers

Contacts

Reagents

Publications

Held on other sites or social networksOrganised/linked to by ChemBio Hub

Stored and curated by ChemBio Hub

? Not sure yet

All of these parts require tagging entities in text, how can we do it

cheaply and sustainably?

What sorts of messy data are we working with?

• Full text from procedures, biographies, web sites

• Raw CSV/ Excel formats from multiple machines or departmental processes

• “Standard” XML and JSON formats from various sources that do not map perfectly to our application

• Multiple external databases to submit data to

How do most of our users like their web-based tools?

Simple Search

Flexible data management

Comprehensive, overlapping tagging

Clear progress, seamless experience

What do we sometimes give them?

• Incomplete or many-to-one tagging

• Hyperlinks instead of the right information from the other site

• Dumb search

• Inflexible schemas

• Lack of linking between datasets

What strategies do we have to deal with messy data?

Create more helpful data management apps

Fill in gaps in tagging by using search engines

Consider creating databases of flat files

Create map reduce / Database viewsfor schema Normalisation and data analysis

Web crawling - not as hard or messy as it used to be

What strategies do we have to deal with messy data?

Create more helpful data management apps

Fill in gaps in tagging by using search engines

Consider creating databases of flat files

Create map reduce / Database viewsfor schema Normalisation and data analysis

Web crawling - not as hard or messy as it used to be

Let’s look at this one first, happy to discuss other areas later…

How do we fill in gaps on un-tagged data?

Let’s do an experiment…

github.com/strets123/web-sig-2014/

Elasicsearch - information extraction on-the-fly

• Take a dataset of 18801 companies

~ 50% tagged

> 80% have some

text data

0% 50% 100%

Overview ordescription

Overview

Description

Tags

Source data : http://jsonstudio.com/resources/ github.com/strets123/web-sig-2014/

Use the “significant terms” feature…

• What description/overview words most strongly linked to each tag?

travel education music realestateSearch engine

optimizationjobs onlinemarketing projectmanagement

travel students music estate seo job marketing project

travelers teachers artists real optimization jobs seo projects

trip learning musicians agents engine employers agency task

trips education songs property ppc career optimization collaboration

hotels student labels listings marketing teams

flights educational playlists search management

traveler bands click

travellers song pay

airline artist

hotel fans

Now let’s test these queries

• Which companies have no tag but are most likely to need tagging with “music”…uPlaya

Description uPlaya provides independent or unsigned musicians with immediate feedback on their music….

Category games_video

Tags -

Webceleb

Description Webceleb is music marketplace and community where musicians and fans engage and profit from discovering, purchasing and downloading the latest independent music.….

Category games_video

Tags -

But what if we have

NO TAGS?

A process to extract tags from text…

Index DataAssign resources (e.g. Amazon spot instance

for large dataset)

List word counts with the least frequent

first

Exclude lowest countsAggregate the

significant terms for each word

Filter words that have a lot of high scoring

significant terms

What does this give us?

athletes: [athletes, coaches, athlete, coach, sports, fans]

avatars: [avatars, avatar, multiplayer, virtual, casual, 3d, games, chat, create, game]

clouds: [clouds, cloud, hybrid, computing, private, deploy, public, infrastructure]

dashboards: [dashboards, bi, reports, analytics, reporting, self, analysis, intelligence, features]

dial: [dial, calling, calls, voip, number, call, voice, phone]

exercise: [exercise, sleep, nutrition, fitness, weight, healthy, health]

indie: [indie, labels, artists, music]

logos: [logos, branding, flash, design]

pci: [pci, dss, hipaa, compliance, sensitive, compliant]

portland: [portland, oregon, inc, founded]

ringtones: [ringtones, ringtone, personalization, games]

traders: [traders, forex, trader, trading, quotes, stock, trade]

yellow: [yellow, pages, directory, local]

abc: [abc, cnn, nbc, television]

argentina: [argentina, buenos, aires, chile, uruguay, colombia, brazil, mexico, latin]

aviation: [aviation, aircraft, aerospace, defense, transportation]

airline: [airline, fares, airlines, flights, flight, travel, tickets, hotel, air]

What else can we do with this?

Filter words that have a lot of high scoring

significant terms

De duplicate where large overlaps exist

Assign levels of tags in order of frequency

Use to categorise new data on the fly

using percolate

Curate manuallyGenerate a sidebar

menu

github.com/strets123/web-sig-2014/

Use elasticsearchphrase suggester to create phrase tags

Advantages over direct curation / supervised learning:

• Simplicity and pragmatism

• Applicable to novel domains

– e.g. Chemical Biology

• Auto generated tags choose more appropriate word combinations than manual curators

• No need for complex data formats like rdf

• Data from many sources can be mixed

– e.g. categories from other university’s sites…

Where might this technology lead?

• How about a tag-based file system?

• How about an implicit social network?

• Elasticsearch is really easy to scale…

• Which websites, filesystems and datasets do you need to categorise?

– Do you really need RDF ontologies, curators etc. or can you just do something simple?

Summary

• We now have many options to categorise and tidy up messy data

• Managing variations on schemas takes a lot of resources – leave it to the data owners if you can!

• When it comes to tagging…

– Perfection is in the eye of the beholder

– Sustainability is really important

Thanks

• Thanks to the Research informatics team at the NDM Structural Genomics Consortium– Paul Barrett– Karen Porter– Michael O’Hagan– Brian Marsden– David Damerell– Sefa Garsot– Anthony Bradley

• Thanks to the InfoDev team at IT services for answering my endless questions about webauth

• Funders:

– John Fell Fund

– NDM Strategic

– Welcome Trust

– Higher Education Funding Council

• To everyone here for listening

Any Questions?

• Andrew Strettongithub.com/strets123

@strets123

linkedin (google me)

• Chembio Hubhttp://chembiohub.ox.ac.uk

@oxchembiohub

github.com/thesgc

Simple example categorisation code available here in python

github.com/strets123/web-sig-2014/

Appendix of other messy data techniques

How do we make it easy to add spreadsheet data to a

system?

Working with flat files

• Sometimes a flat file is the right schema for a dataset– User defined formats

– Different types of research

– Only some of the fields are relevant when comparing experiments

– Data is not in memory unless needed

• Pandas and HDF allows SQL-like queries on flat files

Helpful data management

• Data Wrangler

– https://player.vimeo.com/video/19185801

• Raw

– http://raw.densitydesign.org

• Take these as inspiration for our tool for re-shaping biochemistry data

Simplifying web crawling

• Modern web crawling patterns use class selectors instead of xPath

– Less likelihood of change

• Content can be crawled using a backend web browser

– Dynamic javascript elements are included

• Using a website’s data for classification is more acceptable than wholesale reproduction

Managing multiple JSON schemas with views

Couchbase

PostgreSQL – also supported by Rails/Activerecord

Why views over JSON can be useful

• Expose only required fields from e.g. RDF

• Input format may change but we don’t want crawler to break

• Required fields may change

• Versions are easy to support if format normalisation is in the database layer

• Storage is cheap

• View code is executed only once