Metadata based statistics for DSpace

45
www.atmire.com Metadata based usage statistics

Transcript of Metadata based statistics for DSpace

Page 1: Metadata based statistics for DSpace

www.atmire.com

Metadata based usage statistics

Page 2: Metadata based statistics for DSpace

OVERVIEW

1. Why DSpace statistics?

2. Usage event vs. Item metadata

3. Generating metadata based statistics

4. Linking metadata to usage events

5. Performance

6. Problem solved?

Page 3: Metadata based statistics for DSpace

Statistics solution that knows DSpace:

Structure

“Which are the most downloaded bitstreams in a collection”

Metadata

“Who are the most popular authors in terms of downloads?”

1 - WHY DSPACE STATISTICS?

Page 4: Metadata based statistics for DSpace

USAGE EVENT VS. ITEM METADATA

2 types of metadata:

Usage event metadata

Additional information about the usage event

Item metadata

Additional information about the target of the usage event

Page 5: Metadata based statistics for DSpace

USAGE EVENT METADATA

Additional information about the usage event

Not related to repository

Also possible with other statistics solutions:

• IP address• Country• User Agent• HTTP Referrer• ...

Page 6: Metadata based statistics for DSpace

ITEM METADATA

Relate usage event to information stored in your repository.

Allows statistics queries based on item metadata.

→ Not possible with a statistics solution that is not tied to the repository.

Page 7: Metadata based statistics for DSpace

GENERATING METADATA BASED STATISTICS

How many downloads did author "Barnes, Douglas F.” get in the last year, grouped

by month

Page 8: Metadata based statistics for DSpace
Page 9: Metadata based statistics for DSpace
Page 10: Metadata based statistics for DSpace
Page 11: Metadata based statistics for DSpace
Page 12: Metadata based statistics for DSpace
Page 13: Metadata based statistics for DSpace

LINKING METADATA TO USAGE EVENTS

Solr Query http://localhost:8080/solr/statistics/select?facet=true&facet.offset=0&facet.mincount=1&facet.sort=false&q=*:*&facet.limit=24&facet.field=dateYearMonth&facet.method=enum&fq=bundleName:ORIGINAL&fq=type:+0&fq=statistics_type:view&fq=-isBot:true&fq=-isInternal:true&fq=time:[2014-07-01T00:00:00.000Z+TO+2015-06-06T00:00:00.000Z]&fq=+(author_mtdt:Barnes,\+Douglas\+F.)+&wt=javabin&rows=0

Page 14: Metadata based statistics for DSpace

LINKING METADATA TO USAGE EVENTS

facet.field=dateYearMonthgroup by the field dateYearMonth

fq=type:+0only include bitstream downloads

fq=bundleName:ORIGINALonly include files in bundle “ORIGINAL”

fq=-isBot:truefilter out all bot statistics

fq=-isInternal:truefilter out all internal statistics

fq=time:[2014-07-01+TO+2015-06-06]only include stats that are between Jul 1st 2014 and Jun 6th 2015

fq=+(author_mtdt:Barnes,\+Douglas\+F.)+only include statistics that are by author Barnes, Douglas F.

Page 15: Metadata based statistics for DSpace

<response> <lst name="responseHeader"> ... </lst> <result name="response" numFound="164" start="0"></result> <lst name="facet_counts"> <lst name="facet_fields"> <lst name="dateYearMonth"> <int name="2014-07">15</int> <int name="2014-08">19</int> <int name="2014-09">15</int> <int name="2014-10">10</int> <int name="2014-11">7</int> <int name="2014-12">13</int> <int name="2015-01">13</int> <int name="2015-02">15</int> <int name="2015-03">21</int> <int name="2015-04">22</int> <int name="2015-05">12</int> <int name="2015-06">2</int> </lst> </lst> </lst></response>

Page 16: Metadata based statistics for DSpace

LINKING METADATA TO USAGE EVENTS

In a vanilla DSpace installation:

• Usage statistics only contain bitstream IDs: no metadata

• The metadata is stored in the database

Page 17: Metadata based statistics for DSpace

PROPOSED SOLUTION

1. Query the database for bitstream IDs based on the author metadata

2. Use those IDs to query solr for statistics

Page 18: Metadata based statistics for DSpace

PROPOSED SOLUTION: DOWNSIDES

• Two queries to answer one question

• The solr query can get very long and inefficient to execute

• Inefficient but still possible

Page 19: Metadata based statistics for DSpace

PROPOSED SOLUTION: DOWNSIDES

What if we want to show the 10 authors with the most downloads?

• query the database for all authors

• query SOLR to get the number of usage events for each author

• sort those counts, and return the 10 highest

Page 20: Metadata based statistics for DSpace

PROPOSED SOLUTION: DOWNSIDES

Very inefficient!

• do a lot of queries

• throw away most of the results: we only need top 10

Page 21: Metadata based statistics for DSpace

SOLR FACETS

To do a facet query:

• specify ”facet.field” along with the regular query

• results will be grouped by the values they have for that field

Page 22: Metadata based statistics for DSpace

SOLR FACETS: EXAMPLE

q=type:0&facet.field=owningItem

q=type:0

search for all usage events that are bitstream downloads

facet.field=owningItem

group these by item

count the # records in each group

Page 23: Metadata based statistics for DSpace

OUR SOLUTION

• Add Item metadata to SOLR.

• Use built-in filtering and grouping

Page 24: Metadata based statistics for DSpace

CHALLENGE: SIZE OF THE SOLR CORE

That solution creates new challenges

Metadata is duplicated in every statistical record

that takes up a lot of space

and it needs to be kept in sync

Page 25: Metadata based statistics for DSpace

SIZE OF SINGLE USAGE EVENT

<doc> <str name="ip">177.21.194.80</str> <arr name="ip_search"><str>177.21.194.80</str></arr> <arr name="ip_ngram"><str>177.21.194.80</str></arr> <int name="type">0</int> <int name="id">54</int> <date name="time">2015-05-11T04:33:49.077Z</date> <str name="dateYearMonth">2015-05</str> <str name="dateYear">2015</str> <str name="continent">SA</str> <str name="countryCode">BR</str> <float name="latitude">-10.0</float> <float name="longitude">-55.0</float> <arr name="bundleName"><str>ORIGINAL</str></arr> <arr name="containerBitstream"><int>54</int></arr> <arr name="owningItem"><int>1652</int></arr> <arr name="containerItem"><int>1652</int></arr> <arr name="owningColl"><int>14</int></arr> <arr name="containerCollection"><int>14</int></arr> <arr name="owningComm"><int>1</int></arr> <arr name="containerCommunity"><int>1</int></arr> <str name="uid">60fe8ebb-b8a9-454c-8eef-3f9f800d1399</str> <bool name="isBot">false</bool> <bool name="isInternal">false</bool> <str name="statistics_type">view</str> <long name="_version_">1501767933804675072</long></doc>

25 elements

Page 26: Metadata based statistics for DSpace

<doc> <str name="ip">177.21.194.80</str> ... <arr name="author_mtdt"> <str>Khandker, Shahidur R.</str> <str>Barnes, Douglas F.</str> <str>Samad, Hussain A.</str> </arr> <arr name="subject_mtdt"> <str>ACCESS TO LIGHTING</str> <str>ACCESS TO MODERN ENERGY</str> <str>AGRICULTURAL LAND</str> <str>AGRICULTURAL RESIDUE</str> <str>AIR CONDITIONERS</str> <str>AIR POLLUTION</str> <str>ALTERNATIVE ENERGY</str> <str>ALTERNATIVE SOURCES OF ENERGY</str> <str>APPROACH</str> <str>ATMOSPHERE</str> <str>AVAILABILITY</str> <str>BASIC ENERGY</str> <str>BIOMASS</str> <str>BIOMASS BURNING</str> <str>BIOMASS COLLECTION</str> <str>BIOMASS CONSUMPTION</str> <str>BIOMASS ENERGY</str> ... <str>WORLD ENERGY</str> <str>WORLD ENERGY OUTLOOK</str> </arr> ...</doc>

SIZE OF SINGLE USAGE EVENT WITH METADATA

3 authors

140 subjects

Page 27: Metadata based statistics for DSpace

KEEPING METADATA IN SYNC

When the metadata of an item changes

• a mistake was corrected

• extra info was added

the statistical records for that item need to be updated as well

Page 28: Metadata based statistics for DSpace

KEEPING METADATA IN SYNC

Item with 7,000 page visits and 5,000 downloads → that means updating 12,000 usage events.

• That takes time

• During that time, it takes longer to view other statistical reports

Page 29: Metadata based statistics for DSpace

PERFORMANCE

Size of single usage event

Metadata updates

Amount of events

Live search queries

Page 30: Metadata based statistics for DSpace

PERFORMANCE ENHANCEMENT: SYNCING

Try to keep the load created by synching metadata in the statistics as low as possible:

→ only sync while solr is idle

interrupt the operation when a search request can’t be handled in time

interrupt the operation when Solr’s memory usage nears its max

Page 31: Metadata based statistics for DSpace

PERFORMANCE ENHANCEMENT: CACHING

Caching

store generated reports in a separate Solr core

retrieving them is very fast

invalidate cached reports after a set time (e.g. 24 hours)

Page 32: Metadata based statistics for DSpace

PERFORMANCE ENHANCEMENT: CACHING

Don’t delete expired cached reports

If a user requests a report that is cached→ show the outdated version

In the mean time→ generate a new version

Automatically show new report when it’s done

Page 33: Metadata based statistics for DSpace

EXAMPLE: CACHE MISS

Page 34: Metadata based statistics for DSpace

EXAMPLE: CACHE MISS

Page 35: Metadata based statistics for DSpace

PROBLEM SOLVED?

Additional complexity

Number of usage events

keeps growing

Name variants

Different names for one author

Page 36: Metadata based statistics for DSpace

“Who are the Most Popular Authors in terms

of downloads?”

NAME VARIANTS USE CASE

Page 37: Metadata based statistics for DSpace

https://openknowledge.worldbank.org/most-popular/author

Page 38: Metadata based statistics for DSpace

Ferreira, Francisco H. G. Ferreira, Francisco H.G.Ferreira, Francisco

3 name variants:

Page 39: Metadata based statistics for DSpace
Page 40: Metadata based statistics for DSpace

SOLUTION FOR NAME VARIANTS

include all name variants in Solr query:

author_mtdt:(Ferreira, Francisco H. G.) OR (Ferreira, Francisco H.G.) OR (Ferreira, Francisco)

Page 41: Metadata based statistics for DSpace

ALTERNATIVE SOLUTION

If you have unique IDs (e.g. ORCID)

Index, and search for them instead

Page 42: Metadata based statistics for DSpace

www.atmire.com

Thank you!Questions?

Page 43: Metadata based statistics for DSpace

Desktop view Phone view

Page 44: Metadata based statistics for DSpace

Desktop view

Phone view

Page 45: Metadata based statistics for DSpace

Desktop view

Phone view