Metadata based statistics for DSpace

www.atmire.com

Metadata based usage statistics

http://www.atmire.com


OVERVIEW

1. Why DSpace statistics?

2. Usage event vs. Item metadata

3. Generating metadata based statistics

4. Linking metadata to usage events

5. Performance

6. Problem solved?

Statistics solution that knows DSpace:

Structure

“Which are the most downloaded bitstreams in a collection”

Metadata

“Who are the most popular authors in terms of downloads?”

1 - WHY DSPACE STATISTICS?

USAGE EVENT VS. ITEM METADATA

2 types of metadata:

Usage event metadata

Additional information about the usage event

Item metadata

Additional information about the target of the usage event

USAGE EVENT METADATA

Additional information about the usage event

Not related to repository

Also possible with other statistics solutions:

• IP address• Country• User Agent• HTTP Referrer• ...

ITEM METADATA

Relate usage event to information stored in your repository.

Allows statistics queries based on item metadata.

→ Not possible with a statistics solution that is not tied to the repository.

GENERATING METADATA BASED STATISTICS

How many downloads did author "Barnes, Douglas F.” get in the last year, grouped

by month

LINKING METADATA TO USAGE EVENTS

Solr Query http://localhost:8080/solr/statistics/select?facet=true&facet.offset=0&facet.mincount=1&facet.sort=false&q=*:*&facet.limit=24&facet.field=dateYearMonth&facet.method=enum&fq=bundleName:ORIGINAL&fq=type:+0&fq=statistics_type:view&fq=-isBot:true&fq=-isInternal:true&fq=time:[2014-07-01T00:00:00.000Z+TO+2015-06-06T00:00:00.000Z]&fq=+(author_mtdt:Barnes,\+Douglas\+F.)+&wt=javabin&rows=0


facet.field=dateYearMonthgroup by the field dateYearMonth

fq=type:+0only include bitstream downloads

fq=bundleName:ORIGINALonly include files in bundle “ORIGINAL”

fq=-isBot:truefilter out all bot statistics

fq=-isInternal:truefilter out all internal statistics

fq=time:[2014-07-01+TO+2015-06-06]only include stats that are between Jul 1st 2014 and Jun 6th 2015

fq=+(author_mtdt:Barnes,\+Douglas\+F.)+only include statistics that are by author Barnes, Douglas F.


In a vanilla DSpace installation:

• Usage statistics only contain bitstream IDs: no metadata

• The metadata is stored in the database

PROPOSED SOLUTION

1. Query the database for bitstream IDs based on the author metadata

2. Use those IDs to query solr for statistics

PROPOSED SOLUTION: DOWNSIDES

• Two queries to answer one question

• The solr query can get very long and inefficient to execute

• Inefficient but still possible


What if we want to show the 10 authors with the most downloads?

• query the database for all authors

• query SOLR to get the number of usage events for each author

• sort those counts, and return the 10 highest


Very inefficient!

• do a lot of queries

• throw away most of the results: we only need top 10

SOLR FACETS

To do a facet query:

• specify ”facet.field” along with the regular query

• results will be grouped by the values they have for that field

SOLR FACETS: EXAMPLE

q=type:0&facet.field=owningItem

q=type:0

search for all usage events that are bitstream downloads

facet.field=owningItem

group these by item

count the # records in each group

OUR SOLUTION

• Add Item metadata to SOLR.

• Use built-in filtering and grouping

CHALLENGE: SIZE OF THE SOLR CORE

That solution creates new challenges

Metadata is duplicated in every statistical record

that takes up a lot of space

and it needs to be kept in sync

SIZE OF SINGLE USAGE EVENT

<doc> <str name="ip">177.21.194.80</str> <arr name="ip_search"><str>177.21.194.80</str></arr> <arr name="ip_ngram"><str>177.21.194.80</str></arr> <int name="type">0</int> <int name="id">54</int> <date name="time">2015-05-11T04:33:49.077Z</date> <str name="dateYearMonth">2015-05</str> <str name="dateYear">2015</str> <str name="continent">SA</str> <str name="countryCode">BR</str> <float name="latitude">-10.0</float> <float name="longitude">-55.0</float> <arr name="bundleName"><str>ORIGINAL</str></arr> <arr name="containerBitstream"><int>54</int></arr> <arr name="owningItem"><int>1652</int></arr> <arr name="containerItem"><int>1652</int></arr> <arr name="owningColl"><int>14</int></arr> <arr name="containerCollection"><int>14</int></arr> <arr name="owningComm"><int>1</int></arr> <arr name="containerCommunity"><int>1</int></arr> <str name="uid">60fe8ebb-b8a9-454c-8eef-3f9f800d1399</str> <bool name="isBot">false</bool> <bool name="isInternal">false</bool> <str name="statistics_type">view</str> <long name="_version_">1501767933804675072</long></doc>

25 elements

<doc> <str name="ip">177.21.194.80</str> ... <arr name="author_mtdt"> <str>Khandker, Shahidur R.</str> <str>Barnes, Douglas F.</str> <str>Samad, Hussain A.</str> </arr> <arr name="subject_mtdt"> <str>ACCESS TO LIGHTING</str> <str>ACCESS TO MODERN ENERGY</str> <str>AGRICULTURAL LAND</str> <str>AGRICULTURAL RESIDUE</str> <str>AIR CONDITIONERS</str> <str>AIR POLLUTION</str> <str>ALTERNATIVE ENERGY</str> <str>ALTERNATIVE SOURCES OF ENERGY</str> <str>APPROACH</str> <str>ATMOSPHERE</str> <str>AVAILABILITY</str> <str>BASIC ENERGY</str> <str>BIOMASS</str> <str>BIOMASS BURNING</str> <str>BIOMASS COLLECTION</str> <str>BIOMASS CONSUMPTION</str> <str>BIOMASS ENERGY</str> ... <str>WORLD ENERGY</str> <str>WORLD ENERGY OUTLOOK</str> </arr> ...</doc>

SIZE OF SINGLE USAGE EVENT WITH METADATA

3 authors

140 subjects

KEEPING METADATA IN SYNC

When the metadata of an item changes

• a mistake was corrected

• extra info was added

the statistical records for that item need to be updated as well

KEEPING METADATA IN SYNC

Item with 7,000 page visits and 5,000 downloads → that means updating 12,000 usage events.

• That takes time

• During that time, it takes longer to view other statistical reports

PERFORMANCE

Size of single usage event

Metadata updates

Amount of events

Live search queries

PERFORMANCE ENHANCEMENT: SYNCING

Try to keep the load created by synching metadata in the statistics as low as possible:

→ only sync while solr is idle

interrupt the operation when a search request can’t be handled in time

interrupt the operation when Solr’s memory usage nears its max

PERFORMANCE ENHANCEMENT: CACHING

Caching

store generated reports in a separate Solr core

retrieving them is very fast

invalidate cached reports after a set time (e.g. 24 hours)

PERFORMANCE ENHANCEMENT: CACHING

Don’t delete expired cached reports

If a user requests a report that is cached→ show the outdated version

In the mean time→ generate a new version

Automatically show new report when it’s done

EXAMPLE: CACHE MISS

PROBLEM SOLVED?

Additional complexity

Number of usage events

keeps growing

Name variants

Different names for one author

“Who are the Most Popular Authors in terms

of downloads?”

NAME VARIANTS USE CASE

https://openknowledge.worldbank.org/most-popular/author

Ferreira, Francisco H. G. Ferreira, Francisco H.G.Ferreira, Francisco

3 name variants:

SOLUTION FOR NAME VARIANTS

include all name variants in Solr query:

author_mtdt:(Ferreira, Francisco H. G.) OR (Ferreira, Francisco H.G.) OR (Ferreira, Francisco)

ALTERNATIVE SOLUTION

If you have unique IDs (e.g. ORCID)

Index, and search for them instead

www.atmire.com

Thank you!Questions?



Desktop view Phone view

Desktop view

Phone view

Metadata based statistics for DSpace

Technology

Transcript of Metadata based statistics for DSpace