Comparison of Statistics Finland’s and FSD’s Metadata Models
Metadata based statistics for DSpace
-
Upload
bram-luyten -
Category
Technology
-
view
410 -
download
4
Transcript of Metadata based statistics for DSpace
OVERVIEW
1. Why DSpace statistics?
2. Usage event vs. Item metadata
3. Generating metadata based statistics
4. Linking metadata to usage events
5. Performance
6. Problem solved?
Statistics solution that knows DSpace:
Structure
“Which are the most downloaded bitstreams in a collection”
Metadata
“Who are the most popular authors in terms of downloads?”
1 - WHY DSPACE STATISTICS?
USAGE EVENT VS. ITEM METADATA
2 types of metadata:
Usage event metadata
Additional information about the usage event
Item metadata
Additional information about the target of the usage event
USAGE EVENT METADATA
Additional information about the usage event
Not related to repository
Also possible with other statistics solutions:
• IP address• Country• User Agent• HTTP Referrer• ...
ITEM METADATA
Relate usage event to information stored in your repository.
Allows statistics queries based on item metadata.
→ Not possible with a statistics solution that is not tied to the repository.
GENERATING METADATA BASED STATISTICS
How many downloads did author "Barnes, Douglas F.” get in the last year, grouped
by month
LINKING METADATA TO USAGE EVENTS
Solr Query http://localhost:8080/solr/statistics/select?facet=true&facet.offset=0&facet.mincount=1&facet.sort=false&q=*:*&facet.limit=24&facet.field=dateYearMonth&facet.method=enum&fq=bundleName:ORIGINAL&fq=type:+0&fq=statistics_type:view&fq=-isBot:true&fq=-isInternal:true&fq=time:[2014-07-01T00:00:00.000Z+TO+2015-06-06T00:00:00.000Z]&fq=+(author_mtdt:Barnes,\+Douglas\+F.)+&wt=javabin&rows=0
LINKING METADATA TO USAGE EVENTS
facet.field=dateYearMonthgroup by the field dateYearMonth
fq=type:+0only include bitstream downloads
fq=bundleName:ORIGINALonly include files in bundle “ORIGINAL”
fq=-isBot:truefilter out all bot statistics
fq=-isInternal:truefilter out all internal statistics
fq=time:[2014-07-01+TO+2015-06-06]only include stats that are between Jul 1st 2014 and Jun 6th 2015
fq=+(author_mtdt:Barnes,\+Douglas\+F.)+only include statistics that are by author Barnes, Douglas F.
<response> <lst name="responseHeader"> ... </lst> <result name="response" numFound="164" start="0"></result> <lst name="facet_counts"> <lst name="facet_fields"> <lst name="dateYearMonth"> <int name="2014-07">15</int> <int name="2014-08">19</int> <int name="2014-09">15</int> <int name="2014-10">10</int> <int name="2014-11">7</int> <int name="2014-12">13</int> <int name="2015-01">13</int> <int name="2015-02">15</int> <int name="2015-03">21</int> <int name="2015-04">22</int> <int name="2015-05">12</int> <int name="2015-06">2</int> </lst> </lst> </lst></response>
LINKING METADATA TO USAGE EVENTS
In a vanilla DSpace installation:
• Usage statistics only contain bitstream IDs: no metadata
• The metadata is stored in the database
PROPOSED SOLUTION
1. Query the database for bitstream IDs based on the author metadata
2. Use those IDs to query solr for statistics
PROPOSED SOLUTION: DOWNSIDES
• Two queries to answer one question
• The solr query can get very long and inefficient to execute
• Inefficient but still possible
PROPOSED SOLUTION: DOWNSIDES
What if we want to show the 10 authors with the most downloads?
• query the database for all authors
• query SOLR to get the number of usage events for each author
• sort those counts, and return the 10 highest
PROPOSED SOLUTION: DOWNSIDES
Very inefficient!
• do a lot of queries
• throw away most of the results: we only need top 10
SOLR FACETS
To do a facet query:
• specify ”facet.field” along with the regular query
• results will be grouped by the values they have for that field
SOLR FACETS: EXAMPLE
q=type:0&facet.field=owningItem
q=type:0
search for all usage events that are bitstream downloads
facet.field=owningItem
group these by item
count the # records in each group
OUR SOLUTION
• Add Item metadata to SOLR.
• Use built-in filtering and grouping
CHALLENGE: SIZE OF THE SOLR CORE
That solution creates new challenges
Metadata is duplicated in every statistical record
that takes up a lot of space
and it needs to be kept in sync
SIZE OF SINGLE USAGE EVENT
<doc> <str name="ip">177.21.194.80</str> <arr name="ip_search"><str>177.21.194.80</str></arr> <arr name="ip_ngram"><str>177.21.194.80</str></arr> <int name="type">0</int> <int name="id">54</int> <date name="time">2015-05-11T04:33:49.077Z</date> <str name="dateYearMonth">2015-05</str> <str name="dateYear">2015</str> <str name="continent">SA</str> <str name="countryCode">BR</str> <float name="latitude">-10.0</float> <float name="longitude">-55.0</float> <arr name="bundleName"><str>ORIGINAL</str></arr> <arr name="containerBitstream"><int>54</int></arr> <arr name="owningItem"><int>1652</int></arr> <arr name="containerItem"><int>1652</int></arr> <arr name="owningColl"><int>14</int></arr> <arr name="containerCollection"><int>14</int></arr> <arr name="owningComm"><int>1</int></arr> <arr name="containerCommunity"><int>1</int></arr> <str name="uid">60fe8ebb-b8a9-454c-8eef-3f9f800d1399</str> <bool name="isBot">false</bool> <bool name="isInternal">false</bool> <str name="statistics_type">view</str> <long name="_version_">1501767933804675072</long></doc>
25 elements
<doc> <str name="ip">177.21.194.80</str> ... <arr name="author_mtdt"> <str>Khandker, Shahidur R.</str> <str>Barnes, Douglas F.</str> <str>Samad, Hussain A.</str> </arr> <arr name="subject_mtdt"> <str>ACCESS TO LIGHTING</str> <str>ACCESS TO MODERN ENERGY</str> <str>AGRICULTURAL LAND</str> <str>AGRICULTURAL RESIDUE</str> <str>AIR CONDITIONERS</str> <str>AIR POLLUTION</str> <str>ALTERNATIVE ENERGY</str> <str>ALTERNATIVE SOURCES OF ENERGY</str> <str>APPROACH</str> <str>ATMOSPHERE</str> <str>AVAILABILITY</str> <str>BASIC ENERGY</str> <str>BIOMASS</str> <str>BIOMASS BURNING</str> <str>BIOMASS COLLECTION</str> <str>BIOMASS CONSUMPTION</str> <str>BIOMASS ENERGY</str> ... <str>WORLD ENERGY</str> <str>WORLD ENERGY OUTLOOK</str> </arr> ...</doc>
SIZE OF SINGLE USAGE EVENT WITH METADATA
3 authors
140 subjects
KEEPING METADATA IN SYNC
When the metadata of an item changes
• a mistake was corrected
• extra info was added
the statistical records for that item need to be updated as well
KEEPING METADATA IN SYNC
Item with 7,000 page visits and 5,000 downloads → that means updating 12,000 usage events.
• That takes time
• During that time, it takes longer to view other statistical reports
PERFORMANCE
Size of single usage event
Metadata updates
Amount of events
Live search queries
PERFORMANCE ENHANCEMENT: SYNCING
Try to keep the load created by synching metadata in the statistics as low as possible:
→ only sync while solr is idle
interrupt the operation when a search request can’t be handled in time
interrupt the operation when Solr’s memory usage nears its max
PERFORMANCE ENHANCEMENT: CACHING
Caching
store generated reports in a separate Solr core
retrieving them is very fast
invalidate cached reports after a set time (e.g. 24 hours)
PERFORMANCE ENHANCEMENT: CACHING
Don’t delete expired cached reports
If a user requests a report that is cached→ show the outdated version
In the mean time→ generate a new version
Automatically show new report when it’s done
EXAMPLE: CACHE MISS
EXAMPLE: CACHE MISS
PROBLEM SOLVED?
Additional complexity
Number of usage events
keeps growing
Name variants
Different names for one author
“Who are the Most Popular Authors in terms
of downloads?”
NAME VARIANTS USE CASE
https://openknowledge.worldbank.org/most-popular/author
Ferreira, Francisco H. G. Ferreira, Francisco H.G.Ferreira, Francisco
3 name variants:
SOLUTION FOR NAME VARIANTS
include all name variants in Solr query:
author_mtdt:(Ferreira, Francisco H. G.) OR (Ferreira, Francisco H.G.) OR (Ferreira, Francisco)
ALTERNATIVE SOLUTION
If you have unique IDs (e.g. ORCID)
Index, and search for them instead
Desktop view Phone view
Desktop view
Phone view
Desktop view
Phone view