Adding browse to Koha using Solr

download Adding browse to Koha using Solr

If you can't read please download the document

Transcript of Adding browse to Koha using Solr

KohaCon12 Edinburgh, June 5th, 2012Adding browse to Koha using SolrStefano BargioniPontifical University Santa Croce RomeSlide 1It's very exciting for me to take part to the Koha Conference for the first time. Thanks a lot to the Community for everything I learnt during these days.Slide The PUSC LibraryBasic data about my library are resumed in this slide. We are very young, since my university was founded only 26 years ago. It was inspired by Saint Josemara Escriv, founder of Opus Dei.Twenty years ago we participated in the foundation of a consortium, URBE, the Roman Union of Ecclesiastical Libraries.Slide Why we need browse at PUSC?The idea of alphabetically sorted lists of headings (authors, titles, series, subjects and so on) is implemented in some LMS like another kind of search. We think it is not a must, thanks to the power of simple and advanced searches. However, our users and the typology of our data suggested us to add it to Koha.Starting from Koha, our catalog experienced a strong increase in quality: we added full authority records (we had only cross-references), and we started introducing subject headings. This is why we are interested in browsing headings, coming from authority records as well as bibliographic records.Slide How do you say?Ancient authors, Popes, institutions, and other kind of authors, also due to the cataloguing rules adopted by the library, can generate the needing of helping users and cataloguers to choose the correct form for searching the catalog.In the Virtual International Authority File, Dante Alighieri, who wrote the famous Divine Comedy, has hundreds of varying forms. Which is the chosen form in your library?Slide GroupingClustering and counting headings is another reason to use browse: it is interesting for managing and searching series, looking at your catalog using Dewey, and so on.Slide Browse FunctionalitiesWhat do you may ask to a browse tool? Basically, to navigate alphabetically sorted lists. So you will need to extract headings from your catalog, build a sort form, and add information like, first of all, usage count.Slide Browse requirementsWe tried to write a utility with the following requirements. The most important, maybe, is the ability to include in the same list headings coming from different tags, from authority or bibliographic records.If I'm not wrong, its implementation is independent from the MARC flavour.Slide The engineWe tried using Zebra, but it is very difficult for me to configure.We considered MySQL, but SQL dbms do not have good performances when required to extract a little subset of sorted records from a very large set of headings.Solr was our choice as the search engine, due to its ability to work with facets. And its future integration in Koha could be a win-win for the browse.Slide The Solr document (1)Solr works using document as a metaphor. Every heading we are interested in include in a list, will be a Solr document.In the Solr schema we defined some fields that we are going to discuss now in some slides.The most important field is the ID. Since we can have identical sort forms under the same list, we cannot use the sort form in the ID. For example, we need to distinguish title The Bible from title Bible, even if their sort form is the same, due to the non-filing characters that strip out the initial article.The ID is of course the way used by Solr to delete or replace a document. It will be discussed in detail afterwards.Every document belongs to a list, it comes from authority or bibliographic records, from a tag and from an occurrence of the tag. It also has a type: it can be a main heading, a see from, see also, and so on.It is unuseful in the Solr document to store information about subfields used to extract information. Many times, every subfield will be extracted, but in other cases we only need some of them. The configuration file will reflect this.Slide The Solr document (2)Here is an example of Solr document for the main author Dante Alighieri. Please note its ID.Slide The Solr document (3)And this is an example of Solr document for a title. Titles rather than uniform titles are not from authority records. They will always have type 'acc', that is 'main'. Also note the ID.Slide The Solr document (4)The ID has a complex structure: we built it using a concatenation of list name, a for authority or b for bibliographic, the authid or the biblionumber, the tag, the zero based occurrence number.We think this is a unique identifier. If no, only the last heading with the same ID entered in Solr will survive, leading to a silent error.Slide The Solr document (5)This screen shows the algorithm we use to build the sort form.Maybe there is a better way to generate sort forms, taking into account that Koha is used in many languages and in the same catalog there can be more than one script. Is International Components for Unicode, aka ICU, the solution? I'm not so experienced... sorry.Slide ArchitectureThe architecture is simple: a Solr db is updated with new or modified Koha records.At the same time, users access the Solr db through the web and a Perl CGI.Slide Loading & Synchronizing (1)An important component of browse is the loader. We wrote it in Perl, with the ability to run for the initial bulk loader as well as the updater.It connects to Koha SQL tables in reading and adds or updates Solr documents.The experience with Solr suggested us to issue commit and optimize commands on a regular basis, to avoid memory consumption and ensure the fastest load. These parameters can vary depending on the server running Solr.Slide Loading & Synchronizing (2)The configuration of the loader can be a large file. I chose XML but I know that the Koha developer Community prefers YAML. Sorry.It contains two main sections, one that gathers tag coming from authority records, the second one for records coming from bibliographic records.Here are two examples: on the left side, MARC21 authority tag 400 is sent to the list of authors, type see. Every subfield will be copied. Suffix will ensure that the heading will end with the specified string.The example on the right side refers to a MARC21 bibliographic tag 245, i.e. a title. The skip_indicator contains the number of the indicator where the skip in filing value is contained.More preferences are available for each tag, like required_subfields and omit_subfields. They allow to process tags with a higher level of detail.Slide Loading & Synchronizing (3)Solr db also contains some special documents, whose type is system. Two timestamps register the start and the end of the update process, while each list has a counter to monitor its usage.Four MySQL tables are involved. One of them, deleted_auth_header, is new. Whenever an authority record is deleted, a slightly modified C4::AuthoritiesMarc.pm logs the event in this table.The synchronizing process runs as a cron job. We chose to run it once a minute. A lock file ensures that only one instance is running at the same time.Slide Querying (1)To access lists, we created a new page in Koha, with a link near the Advanced Search. The screenshot shows public lists, the starting from text field and the number of results per page available.This page is generated by a a CGI Perl script.Slide Querying (2)When listing 5 authors starting from Alighieri, we obtain this result. Each heading can be clicked to access related documents, whose count is the number in the 3rd column. See also and Used for headings, if any, are listed in the 4th column.The red link, available only for authors, starts a search on the rich VIAF catalog. Due to its completeness, very often we obtain a successful result. Of course, more links could be added, for instance to the Wikipedia Biography Portal.The count usage is performed on the fly. It is not stored in the Solr db. For headings coming from authorities, this ensures that pressing the author name, will show the exact number of bibliographic records even if the synchronization is not running.Slide Querying (3)When listing titles, the result page contains titles from many tags, including series titles, even if we have a list with only series titles. To set apart series titles, we added a special gray label.The usage count for headings that comes from bibliographic records is performed by Solr facets. In fact, there will be for instance seventeen Solr documents (see the last line) with the same sort form in the titles list.Slide StatisticsA special button for statistics is available. It shows fresh counts for each list, as well as the search counts (not shown here). A good way to monitor the Solr browse db.Slide SecuritySolr interaction is driven by http requests. In a standard installation, anybody could access documents. It is very dangerous.There are many ways to solve this issue. We chose to manage security setting a Jetty username and password. Jetty is the application server included in the Solr standard distribution.Slide License and portabilityThis implementation of browse is open sourced with the same license of Koha. However it is not published yet. It requires more work to become a standard Koha tool, since the manufacturer is not a Koha developer, is an abecedarian. I know that Claire Hernandez of BibLibre has a lot of experience in Solr. I would be happy to share the source code with her.Slide GrazieThank you very much to the Koha Community, now in Scotland! /