Ordering the chaos: Creating websites with imperfect data

30
Ordering the chaos: creating websites using imperfect data Andrew Stretton Oxford University Web SIG November 2014

Transcript of Ordering the chaos: Creating websites with imperfect data

Page 1: Ordering the chaos: Creating websites with imperfect data

Ordering the chaos: creating websites using

imperfect dataAndrew Stretton

Oxford University Web SIG November 2014

Page 2: Ordering the chaos: Creating websites with imperfect data

Who am I, what is ChemBio Hub?

• Andrew Stretton – Data Architect and Developer

github.com/strets123

@strets123

linkedin (google me)

• Chembio Hub

http://chembiohub.ox.ac.uk (feel free to link to us!)

@oxchembiohub

github.com/thesgc

Page 3: Ordering the chaos: Creating websites with imperfect data

Chembio Hub exists to support research at the

interface of chemistry and biology

by enabling sharing of reagents, expertise and data across 20+ departments

Page 4: Ordering the chaos: Creating websites with imperfect data

Who are we trying to connect and how?

User 1:Scientist at Oxford

User 2:Potential collaborator

Could be in industry or anywhere in academia

Unpublished results

Negative Data

Equipment

Methods

Areas of expertise

Questions and answers

Contacts

Reagents

Publications

Held on other sites or social networksOrganised/linked to by ChemBio Hub

Stored and curated by ChemBio Hub

? Not sure yet

Page 5: Ordering the chaos: Creating websites with imperfect data

Who are we trying to connect and how?

User 1:Scientist at Oxford

User 2:Potential collaborator

Could be in industry or anywhere in academia

Unpublished results

Negative Data

Equipment

Methods

Areas of expertise

Questions and answers

Contacts

Reagents

Publications

Held on other sites or social networksOrganised/linked to by ChemBio Hub

Stored and curated by ChemBio Hub

? Not sure yet

All of these parts require tagging entities in text, how can we do it

cheaply and sustainably?

Page 6: Ordering the chaos: Creating websites with imperfect data

What sorts of messy data are we working with?

• Full text from procedures, biographies, web sites

• Raw CSV/ Excel formats from multiple machines or departmental processes

• “Standard” XML and JSON formats from various sources that do not map perfectly to our application

• Multiple external databases to submit data to

Page 7: Ordering the chaos: Creating websites with imperfect data

How do most of our users like their web-based tools?

Simple Search

Flexible data management

Comprehensive, overlapping tagging

Clear progress, seamless experience

Page 8: Ordering the chaos: Creating websites with imperfect data

What do we sometimes give them?

• Incomplete or many-to-one tagging

• Hyperlinks instead of the right information from the other site

• Dumb search

• Inflexible schemas

• Lack of linking between datasets

Page 9: Ordering the chaos: Creating websites with imperfect data

What strategies do we have to deal with messy data?

Create more helpful data management apps

Fill in gaps in tagging by using search engines

Consider creating databases of flat files

Create map reduce / Database viewsfor schema Normalisation and data analysis

Web crawling - not as hard or messy as it used to be

Page 10: Ordering the chaos: Creating websites with imperfect data

What strategies do we have to deal with messy data?

Create more helpful data management apps

Fill in gaps in tagging by using search engines

Consider creating databases of flat files

Create map reduce / Database viewsfor schema Normalisation and data analysis

Web crawling - not as hard or messy as it used to be

Let’s look at this one first, happy to discuss other areas later…

Page 11: Ordering the chaos: Creating websites with imperfect data

How do we fill in gaps on un-tagged data?

Let’s do an experiment…

github.com/strets123/web-sig-2014/

Page 12: Ordering the chaos: Creating websites with imperfect data

Elasicsearch - information extraction on-the-fly

• Take a dataset of 18801 companies

~ 50% tagged

> 80% have some

text data

0% 50% 100%

Overview ordescription

Overview

Description

Tags

Source data : http://jsonstudio.com/resources/ github.com/strets123/web-sig-2014/

Page 13: Ordering the chaos: Creating websites with imperfect data

Use the “significant terms” feature…

• What description/overview words most strongly linked to each tag?

travel education music realestateSearch engine

optimizationjobs onlinemarketing projectmanagement

travel students music estate seo job marketing project

travelers teachers artists real optimization jobs seo projects

trip learning musicians agents engine employers agency task

trips education songs property ppc career optimization collaboration

hotels student labels listings marketing teams

flights educational playlists search management

traveler bands click

travellers song pay

airline artist

hotel fans

Page 14: Ordering the chaos: Creating websites with imperfect data

Now let’s test these queries

• Which companies have no tag but are most likely to need tagging with “music”…uPlaya

Description uPlaya provides independent or unsigned musicians with immediate feedback on their music….

Category games_video

Tags -

Webceleb

Description Webceleb is music marketplace and community where musicians and fans engage and profit from discovering, purchasing and downloading the latest independent music.….

Category games_video

Tags -

Page 15: Ordering the chaos: Creating websites with imperfect data

But what if we have

NO TAGS?

Page 16: Ordering the chaos: Creating websites with imperfect data

A process to extract tags from text…

Index DataAssign resources (e.g. Amazon spot instance

for large dataset)

List word counts with the least frequent

first

Exclude lowest countsAggregate the

significant terms for each word

Filter words that have a lot of high scoring

significant terms

Page 17: Ordering the chaos: Creating websites with imperfect data

What does this give us?

athletes: [athletes, coaches, athlete, coach, sports, fans]

avatars: [avatars, avatar, multiplayer, virtual, casual, 3d, games, chat, create, game]

clouds: [clouds, cloud, hybrid, computing, private, deploy, public, infrastructure]

dashboards: [dashboards, bi, reports, analytics, reporting, self, analysis, intelligence, features]

dial: [dial, calling, calls, voip, number, call, voice, phone]

exercise: [exercise, sleep, nutrition, fitness, weight, healthy, health]

indie: [indie, labels, artists, music]

logos: [logos, branding, flash, design]

pci: [pci, dss, hipaa, compliance, sensitive, compliant]

portland: [portland, oregon, inc, founded]

ringtones: [ringtones, ringtone, personalization, games]

traders: [traders, forex, trader, trading, quotes, stock, trade]

yellow: [yellow, pages, directory, local]

abc: [abc, cnn, nbc, television]

argentina: [argentina, buenos, aires, chile, uruguay, colombia, brazil, mexico, latin]

aviation: [aviation, aircraft, aerospace, defense, transportation]

airline: [airline, fares, airlines, flights, flight, travel, tickets, hotel, air]

Page 18: Ordering the chaos: Creating websites with imperfect data

What else can we do with this?

Filter words that have a lot of high scoring

significant terms

De duplicate where large overlaps exist

Assign levels of tags in order of frequency

Use to categorise new data on the fly

using percolate

Curate manuallyGenerate a sidebar

menu

github.com/strets123/web-sig-2014/

Use elasticsearchphrase suggester to create phrase tags

Page 19: Ordering the chaos: Creating websites with imperfect data

Advantages over direct curation / supervised learning:

• Simplicity and pragmatism

• Applicable to novel domains

– e.g. Chemical Biology

• Auto generated tags choose more appropriate word combinations than manual curators

• No need for complex data formats like rdf

• Data from many sources can be mixed

– e.g. categories from other university’s sites…

Page 20: Ordering the chaos: Creating websites with imperfect data

Where might this technology lead?

• How about a tag-based file system?

• How about an implicit social network?

• Elasticsearch is really easy to scale…

• Which websites, filesystems and datasets do you need to categorise?

– Do you really need RDF ontologies, curators etc. or can you just do something simple?

Page 21: Ordering the chaos: Creating websites with imperfect data

Summary

• We now have many options to categorise and tidy up messy data

• Managing variations on schemas takes a lot of resources – leave it to the data owners if you can!

• When it comes to tagging…

– Perfection is in the eye of the beholder

– Sustainability is really important

Page 22: Ordering the chaos: Creating websites with imperfect data

Thanks

• Thanks to the Research informatics team at the NDM Structural Genomics Consortium– Paul Barrett– Karen Porter– Michael O’Hagan– Brian Marsden– David Damerell– Sefa Garsot– Anthony Bradley

• Thanks to the InfoDev team at IT services for answering my endless questions about webauth

• Funders:

– John Fell Fund

– NDM Strategic

– Welcome Trust

– Higher Education Funding Council

• To everyone here for listening

Page 23: Ordering the chaos: Creating websites with imperfect data

Any Questions?

• Andrew Strettongithub.com/strets123

@strets123

linkedin (google me)

• Chembio Hubhttp://chembiohub.ox.ac.uk

@oxchembiohub

github.com/thesgc

Simple example categorisation code available here in python

github.com/strets123/web-sig-2014/

Page 24: Ordering the chaos: Creating websites with imperfect data

Appendix of other messy data techniques

Page 25: Ordering the chaos: Creating websites with imperfect data

How do we make it easy to add spreadsheet data to a

system?

Page 26: Ordering the chaos: Creating websites with imperfect data

Working with flat files

• Sometimes a flat file is the right schema for a dataset– User defined formats

– Different types of research

– Only some of the fields are relevant when comparing experiments

– Data is not in memory unless needed

• Pandas and HDF allows SQL-like queries on flat files

Page 27: Ordering the chaos: Creating websites with imperfect data

Helpful data management

• Data Wrangler

– https://player.vimeo.com/video/19185801

• Raw

– http://raw.densitydesign.org

• Take these as inspiration for our tool for re-shaping biochemistry data

Page 28: Ordering the chaos: Creating websites with imperfect data

Simplifying web crawling

• Modern web crawling patterns use class selectors instead of xPath

– Less likelihood of change

• Content can be crawled using a backend web browser

– Dynamic javascript elements are included

• Using a website’s data for classification is more acceptable than wholesale reproduction

Page 29: Ordering the chaos: Creating websites with imperfect data

Managing multiple JSON schemas with views

Couchbase

PostgreSQL – also supported by Rails/Activerecord

Page 30: Ordering the chaos: Creating websites with imperfect data

Why views over JSON can be useful

• Expose only required fields from e.g. RDF

• Input format may change but we don’t want crawler to break

• Required fields may change

• Versions are easy to support if format normalisation is in the database layer

• Storage is cheap

• View code is executed only once