AN INVESTIGATION INTO QUERIES SUBMITTED TO THE EUROPEANA...
Transcript of AN INVESTIGATION INTO QUERIES SUBMITTED TO THE EUROPEANA...
AN INVESTIGATION INTO QUERIES SUBMITTED TO THE
EUROPEANA WEB PORTAL
A study submitted in partial fulfillment
of the requirements for the degree of
Masters (MA) in Librarianship
at
THE UNIVERSITY OF SHEFFIELD
by
EMMA C.M. SILVEY
September 2012
1
Structured Abstract
Background
Online availability significantly affects how people find and use different information
resources (see, e.g. Levene, 2006). This study considers information-seeking in the cultural
heritage environment, characterised by its especially wide range of users (see, e.g.
Chaudhry & Jiun, 2005). It focuses on the Europeana Web portal (Europeana, [2012?]a),
which allows users to easily locate diverse information types and formats from different
providers.
Aims
The study aims to investigate users’ information-seeking behaviour, as revealed through
queries submitted to the Europeana portal. Of particular concern are filter usage, indicating
query refinement, and query topics shown by query classification. Potential practical
outcomes include informing how cultural heritage information is provided online, including
suggested search functionalities.
Methods
The study utilises query log analysis, concentrating initially on search filter specifications.
Further datasets contain query text and frequencies: the 150 most popular queries, random
samples (total 300 queries) and 100 queries filtered by media type. A subject-based query
classification scheme is developed and presented for the popular and random queries, then
evaluated through classification of media-filtered queries. Aspects like query languages are
also considered.
Results
Approximately one-third of Filters Dataset queries are filtered, primarily by media type. Of
the classified datasets, frequent queries most often concern collections or places; whilst
place-related queries remain popular, random query samples contain far fewer collection-
based queries. Across different media, proportions of queries classified as ‘Music, Film or
Theatre’ are especially varied.
Conclusions
Overall, study findings such as the prominence of place-related queries dovetail well with
existing literature (see, e.g. Jansen et al., 2011). The study classification scheme
nevertheless indicates the importance of subject-based search and browse support. It is
2
therefore recommended that Europeana incorporates greater functionality concerning
query topics. More detailed consideration of individual users’ search patterns is suggested
as a future research area.
(Abstract Word Count: 297 words)
3
Acknowledgements
I would like to thank my dissertation supervisor, Dr Paul Clough, for his help and support
throughout this project and Dr Mark Hall for providing the query log data for analysis.
Acknowledgement is also due to the Arts and Humanities Research Council (AHRC) for
funding my MA Librarianship programme of study at the University of Sheffield (2011-
2012).
4
Table of Contents
Structured Abstract ............................................................................................. 1
Background .......................................................................................................................... 1
Aims ..................................................................................................................................... 1
Methods ............................................................................................................................... 1
Results .................................................................................................................................. 1
Conclusions .......................................................................................................................... 1
Acknowledgements ............................................................................................. 3
Tables and Figures ............................................................................................... 7
Tables ................................................................................................................................... 7
Figures .................................................................................................................................. 8
Chapter 1: Introduction and Aims ........................................................................ 9
1.1 The Information Environment ....................................................................................... 9
1.2 The Europeana Web Portal ............................................................................................ 9
1.3 Study Aim and Objectives ............................................................................................ 11
1.3.1 Aim ........................................................................................................................ 11
1.3.2 Objectives .............................................................................................................. 11
Chapter 2: Literature Review ............................................................................. 13
2.1 The Modern Information Environment ....................................................................... 13
2.2 Investigating Search ..................................................................................................... 14
2.3 Query Log Analysis ....................................................................................................... 15
2.4 Applications of Query Log Analysis .............................................................................. 16
2.5 Query Classification ..................................................................................................... 17
2.6 Cultural Heritage Information ...................................................................................... 19
2.7 Query Log Analysis and Query Classification in Cultural Heritage ............................... 21
Chapter 3: Methodology .................................................................................... 24
3.1 Background and Theoretical Basis ............................................................................... 24
5
3.1.1 Query Log Analysis ................................................................................................ 24
3.1.2 Query Classification .............................................................................................. 25
3.2 Data Collection and Analytical Approach ..................................................................... 26
3.2.1 Datasets ................................................................................................................ 26
3.2.2 Data Analysis and Classification Scheme Development ....................................... 27
Chapter 4: Results .............................................................................................. 31
4.1 Filters Dataset .............................................................................................................. 31
4.1.1 Filters and Query ‘Selections’ ............................................................................... 31
4.1.2 Filter Specification ................................................................................................. 32
4.2 Popular Query Analysis ................................................................................................ 38
4.2.1 Dataset Characteristics ......................................................................................... 38
4.2.2 Query ‘Selections’ and Language .......................................................................... 39
4.3 Random Sample Query Analysis .................................................................................. 41
4.3.1 Dataset Characteristics ......................................................................................... 41
4.3.2 Query ‘Selections’ and Language .......................................................................... 42
4.3.3 Query Classification: Final Scheme Refinement ................................................... 44
4.3.4 Classification Scheme Mapping ............................................................................ 51
4.4 Comparison of Frequent and Random Query Classification Patterns ......................... 52
4.5 Queries Filtered by Media Type ................................................................................... 56
Chapter 5: Discussion ......................................................................................... 60
5.1 Europeana Querying Patterns: Filters, Query ‘Selections’ and Languages .................. 60
5.2 Classification Scheme Development ............................................................................ 63
Chapter 6: Conclusion ........................................................................................ 67
6.1 Fulfilment of Study Objectives (Section 1.3.2) ............................................................. 67
6.2 Recommendations for Cultural Heritage Information Provision ................................. 67
6.3 Study Limitations and Areas for Future Research ........................................................ 68
References ........................................................................................................ 70
6
Appendix ........................................................................................................... 77
Appendix 1: A Summary of the Classification Scheme Developed for Filters Dataset ‘text’
Query Refinements ............................................................................................................ 77
Appendix 2: Results of Exploratory Mapping between Preliminary Classification ‘Primary
Category’ Terms and Library of Congress Subject Headings (Library of Congress, [2012?])
........................................................................................................................................... 79
Appendix 3: A Summary of the Study Classification Scheme Following Refinement based
on Popular Queries Dataset Analysis ................................................................................. 82
Appendix 4: Results of Potential Mapping between Final Classification Scheme Primary,
Secondary and Tertiary Category terms and Existing Schemes ......................................... 86
Information School: Address & Other Confirmations
University of Sheffield - Information School: First Employment Destination Details
for School Records
Information School: Access to Dissertation
7
Tables and Figures
Tables
Table Number Title Page
1 Baseline classification of Filters Dataset ‘text’ query selections 28
2 The ten most popular options specified using the ‘LANGUAGE’
filter
36
3 The ten most popular options specified using the ‘COUNTRY’
filter
36
4 Language/Country pairs and associated ranks from Tables 2
and 3
37
5 The ten most popular options specified using the ‘PROVIDER’
filter
38
6 A summary of the final study classification scheme following
refinement based on analysis of Random Queries datasets
46 -
51
7 Categorisation percentages for queries in different study
datasets using the Primary Categories of the study
classification scheme
53
8 Comparative ranks for queries in different datasets classified
using the Primary Categories of the study classification
scheme
55
9 Categorisation percentages for popular Europeana queries
filtered by media type (Europeana, [2012?]d) using the
Primary Categories of the study classification scheme
58
8
Figures
Figure Number Title Page
1 A screenshot of a Europeana results page following query
submission (with query ‘faust’), showing filtering options
down the left hand side (Europeana, 2012)
31
2 A screenshot of the Europeana Web portal Homepage
(Europeana, [2012?]a), showing the ‘Explore’ option along the
top of the screen (Europeana, [2012?]e: n.p.)
32
3 A pie chart showing usage proportions (%) of Europeana filters
specified in the Filters Dataset
33
4 A pie chart showing the usage proportions (%) of media type
options specified in the Filters Dataset ‘TYPE’ filter
34
5 A chart showing frequency of usage (aggregated by century)
for the Filters Dataset ‘YEAR’ filter
35
6 Popular Queries dataset frequency distribution, excluding the
top-ranked result
39
7 Usage frequencies for different query ‘selections’ in the
Popular Queries dataset
40
8 Language frequencies for the Popular Queries dataset 41
9 Frequency distributions for both Random Queries datasets 42
10 Usage frequencies for different query ‘selections’ in the
Random Queries datasets
43
11 Language frequencies for the Random Queries datasets 44
12 Frequency distributions for the most popular queries
specifying different Europeana media-based filtering options
(Europeana, [2012?]d)
56
13 Frequency distributions for the most popular queries
specifying different Europeana media-based filtering options
(Europeana, [2012?]d), excluding the top-ranked ‘Text’ query
57
14 Primary Category percentages for popular Europeana queries
filtered by media type (Europeana, [2012?]d)
59
9
Chapter 1: Introduction and Aims
1.1 The Information Environment
Bowler et al.’s (2011: 746) characterisation of an emerging “knowledge era” indicates the
vital importance of information in modern society; technological developments have
transformed both information provision and information-seeking behaviour (Hearst, 2009;
Jansen et al., 2011; Levene, 2006; Nicholas & Clark, 2012; Purday, 2010). In particular, rising
available volumes of Web-based information have prompted the emergence of search
services that aim to help users locate relevant material (see, e.g. Broder, 2002; Clough,
2009; Eirinaki & Vazirgiannis, 2003; Jansen & Pooch, 2001; Jansen & Spink, 2006; Levene,
2006).
Two important recent developments in online services and information provision are Web
portals, providing access to information aggregated from different providers, and
transaction-based Websites, facilitating greater interactivity between users and systems
(Agosti et al., 2012; Eirinaki & Vazirgiannis, 2003; Hearst, 2009; Levene, 2006). The former
can include specialised portals, for example tailored to different users or areas of interest
(see, e.g. Agosti et al., 2012; Kirchoff et al., 2008; Minelli et al., 2007; Ott & Pozzi, 2011;
Purday, 2009; Voorbij, 2010).
This study is situated at the interface between modern information provision and the
cultural heritage sphere. Cultural heritage information, encompassing current and historical
multimedia material, appeals to diverse users in research, education, professional and
personal interest contexts (Chaudhry & Jiun, 2005; Concordia et al. 2010; Europeana,
[2012?]c; Kirchoff et al., 2008; Meyer et al., 2007; Minelli et al., 2007; Purday, 2009).
Kirchoff et al. (2008: 255) further state that “the quantity of cultural information on the
Internet is growing rapidly”, illustrating how cultural heritage organisations are taking
advantage of new technologies to promote their resources online, for example through the
Europeana portal (Europeana, [2012?]a).
1.2 The Europeana Web Portal
Launched initially in 2008, Europeana is described as “Europe’s digital library, archive and
museum” (Purday, 2009: 919). Contributors to the Web portal, which enables “access
to...over 23 million objects” (Europeana, [2012?]c: n.p.), represent different geographical
areas and domains of “cultural and scientific heritage” (Europeana, [2012?]b: n.p.). Indeed,
the Europeana Strategic Plan 2011-2015 states the service’s “aim to give access to all of
10
Europe’s digitised cultural heritage by 2025” (Europeana, [2011?]a: 5). The portal therefore
has multiple distinct current and potential user groups (Concordia et al., 2010; Europeana,
[2011?]a; Europeana, [2012?]c; Purday, 2009). Funding is received from a variety of
sources, including the European Commission (Purday, 2009: 932. See also: Europeana,
[2011?]b).
The portal provides a variety of search and browse functionalities, including “a multilingual
interface” (Purday, 2009: 919) that reflects its international focus (see also: Europeana,
[2011?]a; Europeana, [2011?]b; Europeana, [2012?]c). The service has a strong interest in
keeping up-to-date with online information-seeking trends, including consideration of how
to support different stages of the search process (see, e.g. Concordia et al., 2010;
Europeana, [2011?]b). For example, a report by CIBER Research Ltd. (2011: 6) focuses
especially on developing mobile access to Europeana, since several important search
functionalities are not supported by the current mobile interface; this area is also
highlighted in the Europeana Business Plan 2012 (Europeana, [2011?]b: 16. See also:
Nicholas & Clark, 2012).
Discoverability is an additional concern. The CIBER Research Ltd. (2011: 12) report
emphasises that “it is now possible to do a detailed search for Europeana content using a
popular search engine like Google”, therefore overcoming a common problem for earlier
digital libraries (see also: Europeana, [2011?]a; Nicholas & Clark, 2012). Subject-based
access is also considered, including the potential for future provision of “thematic browse
entry points” (Europeana, [2011?]b: 16), alongside the incorporation of new multimedia
content like “3D visualisations” (Europeana, [2011?]a: 12).
The Europeana service is not limited to the Web portal. Indeed, Concordia et al. (2010: 61)
state that “the main goal of Europeana is...to build an open services platform”, noting its
contribution to partners and the wider cultural heritage sphere alongside end users (see
also: Europeana, [2011?]a; Europeana, [2011?]b). Europeana’s Strategic Plan 2011-2015,
for example, identifies “four strategic tracks – aggregate, facilitate, distribute and engage”
(Europeana, [2011?]a: 11), highlighting its multiple service facets (see also: Europeana,
[2011?]b). Similarly, Kirchoff et al. (2008: 256) note that “metadata standards” are key for
organising cultural heritage information online. The Europeana Public Domain Charter adds
a more political dimension, emphasising that Europeana “belongs to the public and must
represent the public interest” (Europeana, 2010: 1).
11
1.3 Study Aim and Objectives
This study investigates online queries submitted to the Europeana Web portal, focusing
specifically on query text and the specification of filters (e.g. language, named collections)
that can be selected from results pages following initial query entry (Europeana, [2012?]d)
and identified through query log metadata. Exploring broad querying patterns, alongside
more granular consideration of filtering specifications, is considered relevant for potentially
informing the portal’s search interface design and content structuring based on user needs
and preferences (see, e.g. Chaudhry & Jiun, 2005; Jansen, 2009; Levene, 2006; Minelli et al.,
2007).
1.3.1 Aim
To investigate queries submitted online to the Europeana Web portal and develop a
query classification scheme; to utilise this scheme to compare popular and other
queries submitted to the portal; and to evaluate how study findings could inform
cultural heritage information provision online.
1.3.2 Objectives
1. To conduct a literature review concerning information provision and information-
seeking online, in general and cultural heritage contexts, and situating query log
analysis within its broader methodological framework.
2. To investigate filter specifications for a sample of online queries submitted to the
Europeana Web portal.
3. To analyse the 150 most popular queries (by frequency) submitted online to
Europeana and develop a query classification scheme.
4. To refine the classification scheme developed in Objective 3 by classifying and
analysing two samples of 150 random online queries to Europeana. Hence, to
evaluate and enhance the transferability of the classification scheme to queries
other than the most popular.
5. To utilise the study classification scheme to classify and compare the 25 most
frequent queries refined by different options within Europeana’s ‘Type’ filter (see,
e.g. Europeana, [2012?]f).
12
6. To consider the implications of study findings for organising and presenting cultural
heritage information online, via Europeana and other providers, in particular
through tailored content structuring, interface design and provision of search
functionalities.
13
Chapter 2: Literature Review
2.1 The Modern Information Environment
As noted in Section 1.1, online information-seeking is particularly significant in the modern
information environment (Europeana, 2010; Purday, 2010). Online information is
distinguished especially from other information resources by its diverse user groups, plus
the fact that both “Web content and Web user behavior are highly dynamic” (Li et al., 2012:
10740. See also: Cooper, 2001; Jansen & Pooch, 2001, Park et al., 2005). Indeed, Hargittai
(2002: 1243) conceptualises “the Web as a complex set of information retrieval services”,
potentially increasing the complexity of search and the likelihood of ‘information overload’
resulting from the large and rising volume of information available online (Eirinaki &
Vazirgiannis, 2003: 1).
Several authors therefore emphasise the importance of search engines (SEs) in online
information-seeking (see, e.g. Broder, 2002; Jansen & Pooch, 2001; Jansen & Spink, 2006).
Although Levene (2006: 10) notes that general SEs cannot always index “deep web”
information stored in databases or digital libraries, whilst Hargittai (2002: 1243) questions
their popularity, Broder (2002: 8) nevertheless suggests that modern SEs are increasingly
sophisticated and effective, focusing on “attempts to blend data from multiple sources”.
Additionally, “specialized search services” (Levene, 2006: 58) catering to different user
groups and/or specific (e.g. topic-based) information requirements are increasingly being
developed alongside general Web search tools (see also: Jansen, 2009).
A particular growth area in both general and specialist online information provision is
personalisation: “adapt[ing] the information or services provided by a Web site to the
needs of a particular user or set of users” (Eirinaki & Vazirgiannis, 2003: 1). Personalisation
is generally based on characteristics like cultural background, language or geographical
location that can potentially influence users’ information-seeking behaviour (see, e.g.
Clough, 2009; Cooper, 2008; Eirinaki & Vazirgiannis, 2003; Hearst, 2009; Jansen & Spink,
2006; Jansen et al., 2007; Jansen et al., 2011; Levene, 2006; Park et al., 2005; Spink et al.,
2002).
Other emerging trends include rising mobile Internet access, with implications for system
design features like tailored mobile interfaces (Agosti et al., 2012; Bowler et al., 2011; CIBER
Research Ltd., 2011; Clough, 2009; Hearst, 2009; Levene, 2006; Purday, 2009). In particular,
CIBER Research Ltd. (2011: 6) describe how search support must be tailored to the
14
functionalities of different mobile devices (see also: Agosti et al., 2012; Nicholas & Clark,
2012). The report adds that online information-seeking patterns can vary depending on
access type, reflecting how “people shift between different contexts and personas” (CIBER
Research Ltd., 2011: 22). There is also growing concern with the impact of social media on
information-seeking through engagement with features like tagging and “folksonomies”
(Gabrilovich et al., 2009: 26. See also: Agosti et al., 2012; Europeana, [2011?]b). Jansen et
al. (2011: 492), for example, highlight the evolution of “real time search engines” that can
incorporate dynamic, social media-type content.
2.2 Investigating Search
Researching online information-seeking requires consideration of diverse and interlinked
system and user factors. For example, compared with pre-Web information retrieval (IR), a
wider variety of users with different levels of knowledge and experience perform Web
searches (Bowler et al., 2011; Clough, 2009; Hearst, 2009; Ross & Wolfram, 2000;
Silverstein et al., 1999). Jansen (2006: 408) further summarises how “[a] Web search engine
may be a general-purpose search engine, a niche search engine, or a searching application
on a single Web site”. Alongside each of these having different functionalities, search
engines can also potentially be accessed indirectly, for example through an Application
Program Interface (API) (see, e.g. Concordia et al., 2010; Cousins et al., 2008; Jansen et al.,
2011).
Cooper (2001: 139) states that “[t]he search process is iterative”, emphasising the need to
support different stages of information-seeking, such as formulating queries plus navigation
or browsing behaviours; understanding search can therefore inform both content
structuring and SE design, particularly for complex multimedia information (see, e.g. Agosti
et al., 2012; Eirinaki & Vazirgiannis, 2003; Hearst, 2009; Jansen, 2009; Levene, 2006; Spink
et al., 2002). Web searching is further impacted by contextual factors like non-search
system/user interactions (Cooper, 2001: 141), alongside people’s “use of other media for
information retrieval, their demographics, and their social support networks” (Hargittai,
2002: 1239). For example, Spink et al. (2002: 37) note the continuing importance of users’
information and technical literacy.
Information needs themselves are highly variable. Broder (2002: 3), for example, identifies
“informational…navigational…or transactional” queries, representing different searching
behaviours and thus requiring tailored search support (see also: Bowler et al., 2011;
Levene, 2006; Park et al., 2005). Jansen et al. (2011: 499) further consider different user
15
characteristics, proposing that differences between queries submitted to the real-time
‘Collecta’ search engine compared to general Web search could “indicat[e] a possible early
adopter audience relatively more technical than the general population”. Additionally,
Jansen and Pooch (2001: 239) hypothesise that querying patterns may vary across media
types, with three studies in their literature review suggesting that “multimedia Web queries
contain more terms than the average Web query”. Information needs may also change
during search; later queries generally become more complex and specific in response to
earlier search results (Hearst, 2009: 79).
It is also important to consider “the impact of system differences [e.g. interface features]
on user behavior” (Kurth, 1993: 99), especially when comparing querying patterns across
systems with different features, or design changes for single systems (see, e.g. Clough,
2009; Jansen & Pooch, 2001; Jansen & Spink, 2006; Jansen et al., 2011; Koch et al., 2004 in
Agosti et al., 2012: 683; Levene, 2006; Spink et al., 2002). For example, features like “query
suggestion” (Agosti et al., 2012: 666) are likely to affect query formulation, especially for
novice searchers. To summarise, SE design is closely related to wider IR theory; information
needs and system functionalities both influence search behaviour (Levene, 2006). It can
therefore be difficult to generalise study findings from different time periods and across
different systems and user groups (Jansen & Pooch, 2001; Jansen & Spink, 2006; Ross &
Wolfram, 2000).
Some common trends have nevertheless emerged from previous studies of online
information-seeking. For example, considering the ‘AltaVista’ Web SE, Silverstein et al.
(1999: 6) find that “web users type in short queries, mostly look at the first 10 results only,
and seldom modify the query”, with these patterns corroborated by the findings of other
studies (see, e.g. Agosti et al., 2012; Gabrilovich et al., 2009; Jansen & Pooch, 2001; Jansen
& Spink, 2006; Li et al., 2012). Cooper (2001: 144) similarly notes that “most users are
satisfied with the most standard features of a system”, supported by Park et al.’s (2005:
213-214) research concerning the Korean ‘NAVER’ search engine (see also: Jansen et al.,
2011). Indeed, an additional concern is the extent to which users are aware of available
search functionalities, highlighting the need for support beyond the provision of ‘advanced
search’ features (Cooper, 2001; Meyer et al., 2007).
2.3 Query Log Analysis
Query log analysis (Section 3.1.1) is a popular methodology for investigating online
information-seeking that has high practical applicability, for example influencing “the
16
design, personalisation and evaluation of systems” (Clough, 2009: 4. See also: Hearst, 2009;
Jansen, 2009; Jansen et al., 2009; Levene, 2006; Park et al., 2005). Query logs generally
contain standard fields such as query terms, plus metadata like date/time and some form of
user identification (Eirinaki & Vazirgiannis, 2003; Jansen, 2009; Jansen & Spink, 2006;
Jansen et al., 2011; Levene, 2006; Nicholas & Clark, 2012). Additional information can
include “user-specified modifiers” like search filters (Silverstein et al., 1999: 7), or “referrer
site” (Jansen, 2006: 409), thus also considering navigation between different parts of the
Web. The available query log fields necessarily impact on possible areas of investigation,
including the extent to which logs from different systems are comparable (Agosti et al.,
2012: 680-681).
User identification is especially complex, since common surrogates like IP address or
cookies cannot always identify individuals reliably; it can be difficult to separate human and
non-human (e.g. robot) queries, whilst user privacy is an important concern from an ethical
standpoint (Agosti et al., 2012; Bowler et al., 2011; Clough, 2009; Cooper, 2001; Cooper,
2008; Eirinaki & Vazirgiannis, 2003; Jansen, 2006; Jansen et al., 2009; Jansen et al., 2011;
Kurth, 1993; Silverstein et al., 1999). Log analysis can also be conducted based on “term,
query, and session” (Jansen, 2006: 417), with the latter often especially difficult to identify
(see also: Jansen & Pooch, 2001). Given a frequent lack of contextual information,
delineating both users and their information needs can therefore be problematic (Cooper,
2001; Eirinaki & Vazirgiannis, 2003; Jansen & Spink, 2006; Kurth, 1993).
The primary disadvantage of query log analysis is its descriptive nature, whereby data
cannot account for “the underlying situational, cognitive, or affective elements of the
searching process” (Jansen, 2006: 411. See also: Bowler et al., 2011; Jansen & Pooch, 2001;
Jansen & Spink, 2006; Kurth, 1993). Kurth (1993: 100) further extends the conception of
information-seeking to consider the difficulty of determining “the information needs that
users are unable to express in the search statements that they enter into online systems”.
As noted above, it is therefore important to consider the generalisability of study findings
concerning patterns of search behaviour (see, e.g. Agosti et al., 2012; Jansen & Spink,
2006), particularly since users do not always perform searches as individuals (Bowler et al.,
2011; Hargittai, 2002; Kurth, 1993).
2.4 Applications of Query Log Analysis
As noted above (Section 2.2), online queries generally contain few terms; examining large
volumes of data through query log analysis can therefore help to broadly profile system
17
users’ information and service requirements (Jansen, 2009; Park et al., 2005; Spink et al.,
2002). For example, Cooper (2001: 140-141) uses query log data to investigate changes in
querying patterns through time for a University of California library catalogue, specifically
focusing on query volumes across University term/holiday periods (see also: Jansen et al.,
2011). Applications of query log analysis are nevertheless somewhat system-dependent.
Agosti et al. (2012: 663), for example, distinguish “Web search engine log analysis and
Digital Library System log analysis” based on differences in system content, functionalities
and user groups.
However, query log analysis can showcase key aspects of user-system interaction, thus
potentially “enlighten[ing]…interface development, and devising the information
architecture for content collections” (Jansen, 2006: 407. See also: Agosti et al., 2012;
Hearst, 2009; Jansen & Spink, 2006). The latter is considered particularly important for
organising and facilitating access to complex “Web sites whose content is increasing on a
daily basis, such as news sites or portals” (Eirinaki & Vazirgiannis, 2003: 3. See also: Agosti
et al., 2012). Query log data can additionally inform group or individual “user profiling”
(Eirinaki & Vazirgiannis, 2003: 3), which is an important concern given the growth of system
personalisation, particularly in e-commerce contexts (Eirinaki & Vazirgiannis, 2003;
Hargittai, 2002; Ross & Wolfram, 2000).
Query log analysis can therefore be highly relevant for both content providers and system
users. Its effective practical application nevertheless requires a broad understanding of
users and their context, meaning that findings may not apply across different systems. For
example, Cooper (2001: 143) emphasises that “[u]ser behavior varies significantly
depending upon the database being searched”, reflecting different users’ information
needs, expertise and the functionalities available to support different facets of system
usage, such as search and browse (Agosti et al., 2012: 683. See also: Jansen et al., 2011). As
such, whilst it has potential business relevance “for making managerial decisions and
establishing priorities” (Agosti et al., 2012: 681), query log analysis is unlikely to be
sufficient for supporting the growing area of truly “[u]ser-centered design” (Bowler et al.,
2011: 723).
2.5 Query Classification
Query classification (Section 3.1.2) is an approach to query log analysis that requires
consideration of several practical factors. For example, especially for short Web queries,
“search queries may be ambiguous” (Li et al., 2012: 10739), thus complicating the
18
classification process (see also: Gabrilovich et al., 2009). Queries may also be “affected by
ephemeral trends” like current affairs (Silverstein et al., 1999: 6), which can alter both
patterns of query subjects (see, e.g. Ross & Wolfram, 2000) and their expression through
language and terminology; query classification schemes must therefore be flexible and
adaptable enough to incorporate new fields (Beitzel et al., 2004 in Li et al., 2012: 10739;
Gabrilovich et al., 2009). Jansen et al. (2011: 499) further state that a “power-law
distribution…[is] typical for Web query terms”, suggesting that frequent and rare queries
are likely to have different characteristics. Similarly, Gabrilovich et al. (2009: 9) argue that
“rare queries…tend to contain rare words, be longer, and match fewer documents” (see
also: Beitzel et al., 2007a, 2007b in Agosti et al., 2012: 673), nevertheless representing a
significant volume of data about users’ search behaviour when aggregated (Silverstein et
al., 1999).
Query classification, commonly with multiple levels, has been utilised to help develop
general Web search taxonomies, in particular concerning query subjects (see, e.g. Agosti et
al., 2012; Chuang & Chien, 2003 in Agosti et al., 2012: 671; Gabrilovich et al., 2009; Li et al.,
2012). For example, Li et al. (2012: 10742) employ “a hierarchical category taxonomy”,
aiming to maintain the currency of Web query classification by categorising both queries
and results pages to consider the “semantic distance” between these groups (Li et al. 2012:
10740). Categories are generally not mutually exclusive (see, e.g. Gabrilovich et al., 2009:
11).
Study results have illuminated diverse aspects of online information-seeking. For example,
whilst frequent queries from Silverstein et al.’s (1999: 9, Table 4) study of the ‘AltaVista’ SE
include “sex” and “porno” (see also: Ross & Wolfram, 2000), Jansen et al.’s (2011: 504)
study of the real-time ‘Collecta’ SE conversely reveals “a high occurrence of society,
entertainment, technology, and politics”. They attribute this discrepancy to the relatively
specialist nature of real-time search and its distinct user groups, resulting in a querying
pattern that “differs from the topical characteristics of the traditional Web search” (Jansen
et al., 2011: 501). Jansen and Spink (2006: 258, Table 2) also consider variations in
searching behaviour across different locations, suggesting that UK and US searchers exhibit
different querying patterns. Based on query classification of six Web SE datasets, they
further conclude that “[t]he overall trend is towards using the Web as a tool for information
or commerce, rather than entertainment”, providing an alternative perspective to the
studies outlined above (Jansen & Spink, 2006: 260).
19
Query classification can therefore be relevant for both system and service development.
Gabrilovich et al. (2009: 6), for example, focus on targeted advertisements, whilst Ross and
Wolfram (2000: 957) consider “subject-based access tools” (see also: Jansen et al., 2011).
Significantly, assistance can be provided at different stages of the information-seeking
process, such as query formulation and modification, facet/filtering options and results
display; the latter could include topic-based results clustering to help provide clarity in
situations of “ambiguity or multiple aspects of a topic” (Agosti et al., 2012: 676. See also: Li
et al., 2012). Eirinaki and Vazirgiannis (2003: 7) further consider trends in online
information provision like “recommendation systems”, which allow organisations to cater
to diverse Web users.
Alternatively, classification can utilise Broder’s (2002: 3) categories (Section 2.2), although
the author does note that distinguishing the query types can be difficult (Broder, 2002: 5.
See also: Li et al., 2012). Indeed, Cooper (2001: 143) considers aspects of both query type
and topic for a library catalogue query log, noting that “[m]ost searches (40%) are power
searches [i.e. searching across catalogue fields]”, thus highlighting the importance of
considering the impact of system environment (e.g. available search options) on search
behaviour.
2.6 Cultural Heritage Information
Providing content and search functionalities aligned with user needs is especially important
for cultural heritage information given its widespread appeal (Chaudhry & Jiun, 2005;
Meyer et al., 2007; Minelli et al., 2007). As noted above (Sections 1.1/1.2), cultural heritage
organisations are focusing in particular on facilitating online and mobile access to their
collections. Even so, Agosti et al. (2012: 678) note that information in ‘Digital Library
Systems’ (DLS) – often associated with cultural heritage – retains its distinctive character in
digital environments and can be distinguished from general Web content on the basis that
“collections are explicitly organized, managed, described, and preserved”. Indeed,
Europeana’s Strategic Plan 2011-2015 (Europeana, [2011?]a: 12) highlights its aim to
provide a “comprehensive, trustworthy and authoritative collection”, which is therefore
potentially more closely aligned with library or archive models than broader online
provision.
Multilingual and multimedia search are particularly significant in cultural heritage contexts
(Concordia et al., 2010; Cousins et al., 2008; Kirchoff et al., 2008; Meyer et al., 2007; Minelli
et al., 2007; Ott & Pozzi, 2011). Image resources are especially popular (Kirchoff et al., 2008;
20
Politou et al., 2004), whilst the Europeana Business Plan 2012 (Europeana, [2011?]b: 7) also
notes high demand for audio-visual resources (see also: Nicholas & Clark, 2012). Technical
considerations relating to multimedia information storage and display are therefore key; as
an example, Politou et al. (2004: 300) suggest that functionalities supported by the
‘JPEG2000’ image format make it highly “applicable to cultural heritage databases”.
Understanding the nature of cultural heritage information is therefore vital for developing
effectively tailored online systems.
Voorbij (2010: 275) emphasises the variation between cultural heritage organisations,
noting especially that “libraries mainly provide access to external digital resources…while
archives and museums place their own unique resources in digitized form on their web
site”, resulting in different content management and intellectual property concerns
(Concordia et al., 2010: 68). Additionally, Meyer et al. (2007: 397) consider a potential
cultural heritage information system whose Web interface is adaptable for “professionals”
and “the general public”, reflecting the need for system design that caters to different
cultural heritage user groups (see also: Cousins et al., 2008; Dempsey, 2000 in Kirchoff et
al., 2008: 252). Indeed, Bowler et al. (2011: 745) suggest that focusing explicitly on different
users may be the most effective approach, despite the resultant complexity in system
design (see also: Europeana, [2011?]a).
Recent developments in online cultural heritage information provision have included
tailored “web portals” (Ott & Pozzi, 2011: 1366). These can help users overcome issues of
technical literacy that potentially complicate information-seeking (see, e.g. Europeana,
2010; Kirchoff et al., 2008), alongside simplifying search by aggregating content, which can
include different types (e.g. subjects, formats) of information (Agosti et al., 2012; Minelli et
al., 2007; Purday, 2009; Voorbij, 2010). Indeed, whilst many cultural heritage organisations
remain physically distinct, Bowler et al. (2011: 746) argue that “users increasingly do not
think in such organizationally restricted terms”, illustrating the need for portal-type
information access. Kirchoff et al. (2008: 258), for example, highlight the German ‘BAM’
cultural heritage Web portal, in particular describing its option to search via “a simple
Google search field”.
New technologies are therefore affecting the wider character of cultural heritage education
and research, encouraging international and interdisciplinary perspectives (Ott & Pozzi,
2011: 1369-1370). Concordia et al. (2010: 67) further suggest that a “digital cultural
commons” is evolving, aided by developments like portals, whilst Europeana’s Business
21
Plan 2012 (Europeana, [2011?]b: 11) states its aim to “[d]evelop the ‘European Cultural
Commons’ as a concept, a movement and a business model within the Europeana
Network”.
Personalisation and participation are increasingly important aspects of cultural heritage
information provision, as well as for Web information services more generally (Bowler et
al., 2011; Europeana, [2011?]a). Indeed, Bowler et al. (2011: 739) argue that “[l]ibrary
reference service has traditionally been adaptive and personalized”, suggesting that –
whilst its large-scale implementation in online environments may be relatively novel –
personalisation itself is not a new approach. Europeana itself is aiming to increase its
service/user interaction, such as by incorporating “a corpus of user-generated objects”
(Europeana, [2011?]b: 20).
It is important to note that Web-based cultural heritage information systems are often
concerned with both “preservation and dissemination” (Politou et al., 2004: 293), meaning
that accessibility and ease of use are not the only system priorities (Europeana, 2010;
Kirchoff et al., 2008; Meyer et al., 2007; Purday, 2010). For example, digitising objects can
support both aims, influencing the changing relationship between organisations’ physical
and online presence (Europeana, 2010; Europeana, [2012?]b; Kirchoff et al., 2008; Voorbij,
2010). Nevertheless, Kirchoff et al. (2008: 252) argue that “[d]igital memory institutions do
not compete with archives, libraries and museums”, suggesting that they have different
primary remits. Focusing on archaeological information, Meyer et al. (2007: 398, citing
Richards, 1998) therefore stress the importance of developing content management and
information access provision tailored specifically to cultural heritage. For example, mobile
access and systems that account for user location are two significant and interlinked areas
that are likely to be especially pertinent to cultural heritage tourism (see, e.g. CIBER
Research Ltd., 2011; Gravano et al., 2003 in Gabrilovich et al., 2009: 5 and in Li et al., 2012:
10746).
2.7 Query Log Analysis and Query Classification in Cultural Heritage
Query log analysis is highly applicable to information system design in the complex cultural
heritage environment, helping to illuminate user information needs and behaviour, and
thus aiding resource discovery through informing the development of systems and search
features. For example, Voorbij (2010: 267) describes how data from “log file analysis…and
tools based on page tagging” have enabled cultural heritage organisations in The
Netherlands to consider both presentation and content provision, through “adapting the
22
web site or setting priorities for further digitization” (Voorbij, 2010: 278). Concerning
Europeana specifically, CIBER Research Ltd. (2011: 7) employ “deep log analysis” to
investigate mobile access and make system recommendations (see also: Nicholas & Clark,
2012).
Web query classification schemes may not easily translate to cultural heritage topic areas;
Li et al. (2012: 10746, Figure 2), for example, note categories like “Computers” in the
experimental KDDCUP2005 taxonomy, which may be largely irrelevant in this field. Cultural
heritage information-seeking is further complicated by the wide variety of organisational
schemes currently in use, alongside the need to consider multilingualism and multimedia
information formats (Chaudhry & Jiun, 2005; Meyer et al., 2007; Minelli et al., 2007).
Indeed, Walsh (2011: 334) notes that “[l]ocally developed taxonomies have become a
popular method for subject description in digital collections”, sometimes combined with
existing schemes like ‘Library of Congress Subject Headings’ (Walsh, 2011: 333-334; Library
of Congress, [2012?]).
Several authors have therefore suggested that domain ontologies, taxonomies and faceted
search/browse are particularly useful for cultural heritage information-seeking, as noted by
Walsh (2011) in the context of digital libraries. For example, Chaudhry and Jiun (2005) focus
on cultural heritage taxonomy development in Singapore, whilst Minelli et al. (2007)
consider results filters (see also: Chaudhry & Jiun, 2005; Clough, 2009; Kirchoff et al., 2008;
Meyer et al., 2007). An advantage of portals specifically is highlighted by Kirchoff et al.
(2008: 262), who argue that combining “[i]nformation from…heterogeneous sources – by
location, time, person, and subject – are the added value provided by portals such as BAM
or Europeana”.
This quote introduces a classification element that can be similarly applied to queries; for
example, Minelli et al. (2007: 4) classify popular online queries from several cultural
heritage organisations into the comparable categories of “Proper names”, “Subject”,
“Place” and “Time”. Other authors emphasise the on-going importance of content
classification in facilitating effective search and browse, despite the difficulties inherent in
achieving this (see, e.g. Kirchoff et al., 2008; Walsh, 2011). Indeed, in accordance with wider
trends, Bowler et al. (2011: 727) consider the potential for “[s]ocial tagging” to enable
subject-based access to library materials. Voorbij (2010: 274) similarly notes “interest in
users’ search terms”, which could provide an alternative starting point to top-down
classification schemes, thus reflecting emerging principles of “participatory design” (Bowler
23
et al., 2011: 734) in online cultural heritage information provision. Indeed, Europeana itself
offers individual users some tagging functionality via the ‘My Europeana’ area (Europeana,
[2012?]g).
24
Chapter 3: Methodology
3.1 Background and Theoretical Basis
As noted in Section 1.3, a primary aim of this study is to create a query classification
scheme for online queries submitted to the Europeana Web portal, illuminating querying
patterns and facilitating comparison between different query types (e.g. frequent versus
random queries). The study hypothesis is that there are significant differences between
popular and other queries, meaning that its overall approach is broadly deductive.
However, as outlined below, query classification was initially undertaken using a more
inductive, data-focused approach (see, e.g. Jansen, 2009: 65).
Query classification is subject-based, with the intention of developing a classification
scheme that is specific to this study but informed by existing examples (e.g. Chaudhry &
Jiun, 2005; Minelli et al., 2007). The decision to create a new scheme reflects the lack of
existing classifications specific to cultural heritage settings like Europeana; as noted in
Section 2.7, existing subject-based schemes (e.g. for Web queries) may not be easily
transferable to this context.
3.1.1 Query Log Analysis
Query log analysis, a methodology related to broader transaction log analysis (Jansen et al.,
2009: 2), is the primary data analysis approach adopted for this study (see also: Jansen,
2009; Nicholas & Clark, 2012). Although not a new methodology, it is becoming increasingly
established in the context of online and digital system environments, thus requiring novel
approaches (Agosti et al., 2012; Clough, 2009; Cooper, 2001; Jansen, 2006; Jansen & Pooch,
2001; Nicholas & Clark, 2012). Jansen and Spink (2006: 254), focusing on the Web,
summarise the benefits of logs as data sources:
“Web transaction logs…unobtrusively [record] real interactions by real users in the
pursuit of real information needs in the complex Web information environment”.
Large volumes of data originating from different systems can therefore be gathered in
natural settings, without requiring direct interaction between users and researchers; Jansen
(2006: 424) further highlights the comparatively low cost of obtaining log data (see also:
Agosti et al., 2012; Clough, 2009; Cooper, 2001; Kurth, 1993). Nevertheless, its descriptive
nature can limit query log analysis, since it “cannot explain why something has occurred”
(Minelli et al., 2007: 3) and may be most effective when combined with other approaches
25
like interviews or surveys (see, e.g. Agosti et al., 2012; Clough, 2009; Jansen et al., 2009;
Kurth, 1993; Minelli et al., 2007).
Ethical issues also require consideration, since queries and query log metadata like IP
addresses can potentially identify individuals (Section 2.3). However, this project examines
only query text, frequencies and some non-identifying metadata like the use of search
filters; it was considered unlikely that query text itself would contain identifying
information given the cultural heritage context. The study therefore received a ‘No Risk’
ethical designation.
Query log analysis encompasses different stages of data “collection…preparation…and
analysis” (Jansen, 2006: 412. See also: Agosti et al., 2012; Eirinaki & Vazirgiannis, 2003;
Jansen & Pooch, 2001; Kurth, 1993). However, by using an existing dataset that was
collected and prepared before being passed to the researcher, this study focuses on the
analytical stage. Although a potential limitation, since data collection is therefore not
directed specifically towards the research (see, e.g. Jansen, 2009; Jansen et al., 2009), this is
considered the best approach given time constraints and the researcher’s limited
experience of dealing with log data. Analysis itself is conducted at the query level: “[a]
query is defined as a string list of zero or more terms submitted to a search engine” (Jansen,
2006: 418).
3.1.2 Query Classification
Query classification is a popular analytical approach, generally using topic categories like
“people, places or things” (Levene, 2006: 66. See also: Agosti et al., 2012; Clough, 2009;
Jansen & Spink, 2006; Spink et al., 2002). It can be performed either automatically or
manually and is often associated with taxonomy development (see, e.g. Agosti et al., 2012:
666). Indeed, reflecting the large numbers of queries submitted to online search engines
and comparatively high cost of manual classification (Li et al., 2012: 10746), automatic
approaches are generally considered necessary for Web-based datasets; however, some
initial manual input is often utilised, such as to help develop classification categories (see,
e.g. Agosti et al., 2012; Gabrilovich et al., 2009).
A manual approach is nevertheless considered appropriate for this study given its focus on
both frequent and random query samples (Section 1.3.2), plus Europeana’s specialist
subject area. Gabrilovich et al. (2009: 3), for example, argue that ““Tail” queries…do not
have enough occurrences to allow statistical learning on a per-query basis”, implying that
26
manual query classification could be more effective for gaining a nuanced understanding of
less popular queries.
Practical classification can involve different approaches. Li et al. (2012: 10743), in their Web
query taxonomy, note that “we can expand the names of categories from other sources”,
increasing the flexibility and relevance of the scheme for different user groups and subject
areas (see also: Concordia et al., 2010; Gabrilovich et al., 2009). Similarly, Jansen et al.
(2011: 491) classify Web queries “using the Google Directory topical hierarchy”, whilst
Agosti et al. (2012: 678) approach classification from a more library-based perspective,
considering the relevance of existing schemes and standards like “authority control rules”.
The character and granularity of the classification scheme employed or developed must
therefore be appropriate for the type of data and study purpose (Gabrilovich et al., 2009; Li
et al., 2012).
Query log analysis is strongly linked to inductive methodologies, particularly grounded
theory. Defined as “the discovery of theory from data” (Glaser & Strauss, 1967: 1), this
approach is noted specifically in the context of query log analysis by Jansen and Pooch
(2001) and Jansen (2006). Ross and Wolfram (2000: 951) similarly describe how “[c]oding
categories were developed inductively” in their Web query study. The inductive
classification approach adopted here is therefore informed by these examples.
3.2 Data Collection and Analytical Approach
Data collection involved existing Europeana query log data held by the researcher’s project
supervisor; relevant information was extracted from the log before being passed to the
researcher for analysis.
3.2.1 Datasets
Several distinct datasets are analysed in this study:
1. Filters Dataset: usage frequencies of different Europeana filters for a sample of
approximately 100,000 queries from early 2012. Intended to give a broad overview
of querying patterns and including some examples of query text (84) to seed initial
classification scheme development.
2. Popular Queries: query text and frequencies for the 150 most frequent queries
submitted to Europeana between 01.01.2012-30.06.2012. Utilised for further query
classification scheme development.
27
3. Random Queries: query text and frequencies for two samples of 150 random
queries submitted to Europeana between 01.01.2012-30.06.2012. Utilised for
classification scheme refinement and comparison with the ‘Popular Queries’
dataset.
4. Media Types Dataset: query text and frequencies for the 25 most popular queries
submitted to each of the Europeana ‘Type’ filter’s four main options (Europeana,
[2012?]d) between 01.01.2012-30.06.2012. Intended to further illuminate filter
usage, in particular through comparison of query classifications across media type
options.
The total sample therefore includes approximately 500 query text examples. Existing
studies involving manual query classification have generally considered around 1000-2000
queries (see, e.g. Gabrilovich et al., 2009; Ross & Wolfram, 2000). However, a smaller
sample is considered appropriate here given the relatively small scale and time constraints
of the study, alongside its consideration of aspects like filter specifications in addition to
query text.
3.2.2 Data Analysis and Classification Scheme Development
Data analysis encompasses investigation of both broad querying characteristics and query
classification scheme development. The Filters Dataset is first considered (Section 4.1),
including filter specification patterns (Section 4.1.2) and classification of query text
examples (see below). Analysis then focuses on features like language and query
‘selections’, the latter of which are defined in Section 4.1.1, for the Popular Queries
(Section 4.2) and Random Queries (Section 4.3) datasets, plus refinement of the study
classification scheme.
The Filters Dataset included 84 results with ‘text’ ‘selections’ (see: Section 4.1.1), thought
to represent user-generated query refinement terms (Europeana, [2012?]d). Minelli et al.’s
(2007) categories (Section 2.7) provided a basic initial classification framework, which was
intended to offer an outline structure without restricting the exploratory nature of the
classification and emergence of categories from the data; however, “Proper names”
(Minelli et al., 2007: 4) was substituted here by ‘People’ to avoid confusion with
institutional names (Table 1).
28
Category (based on Minelli et al., 2007: 4) Frequency
“Subject” 48
People 17
“Place” 12
“Time” 1
Unknown 6
Table 1: Baseline classification of Filters Dataset ‘text’ query selections
As shown in Table 1, only one query (‘2010’) fit well within the ‘Time’ category, whilst over
half were classified as ‘Subject’. It should be noted that several queries were ambiguous,
meaning that exact classification figures cannot be guaranteed without further contextual
information; for example, the query ‘china’ could describe either a place or material (i.e. a
subject). The classification pattern was nevertheless considered sufficiently clear for useful
analysis.
Reflecting the predominance of ‘Subject’ queries (Table 1), plus particularly wide variability
within this group, a subject-based approach was adopted for preliminary classification
scheme development, with multiple levels to incorporate other categories (e.g. personal
names). The classification scheme as it stood at this stage is given in Appendix 1. As a result
of the data-driven approach, only categories emerging directly from the Filters Dataset are
included in Appendix 1. Given the small sample, this was also necessarily intended as a
seeding point conducive to future refinement rather than a comprehensive description of
Europeana querying patterns.
Investigation then considered whether the classification scheme’s top-level headings could
be mapped to an existing scheme, namely the ‘Library of Congress Subject Headings (LCSH)’
(Library of Congress, [2012?]). This was considered more appropriate for the cultural
heritage context than general Web classification schemes noted in Sections 2.5/2.7 (see,
e.g. Jansen et al., 2011; Li et al., 2012). Indeed, Walsh (2011: 329) notes that “LCSH has
become one of the major tools for online information retrieval”, with potentially high
practical applicability for organising content and facilitating search. Again, this was an
exploratory comparison intended to inform subsequent classification scheme refinement.
For example, the lack of an ‘LCSH’ (Library of Congress, [2012?]) ‘Current Affairs’ descriptor
prompted consideration that this heading might not be appropriate for content structuring
due to the issue of constantly changing material; the related term ‘Politics’ was therefore
searched instead.
29
Comparison was made by searching terms from the scheme’s Primary Categories via the
Library of Congress Subject Headings area of the Library of Congress Website (Library of
Congress, [2012?]). It was found that, whilst Primary Categories were generally too broad
to have direct ‘LCSH’ equivalents (Library of Congress, [2012?]), mapping of the narrower
Secondary/Tertiary categories was likely to be feasible. Potentially relevant descriptors
emerging from this exploratory search are given in Appendix 2.
Analysis of the larger Popular and Random Queries datasets facilitated further classification
scheme development, aiming to reach a saturation point where no new categories
emerged. Kurth (1993: 101) highlights the potential impact of sampling strategy on study
results; in this case, consideration of popular queries was intended to give the study high
practical relevance (Section 1.3.2, Objective 3), whilst considering random query samples
(Section 1.3.2, Objective 4) involved deeper, more detailed analysis of Europeana querying
characteristics (see, e.g. Gabrilovich et al., 2009; Jansen, 2006; Ross & Wolfram, 2000).
Including both frequent and other queries was also intended to introduce an element of
“[c]omparative analysis”, which is considered an essential component of the grounded
theory approach (Glaser & Strauss, 1967: 21).
Analysis of popular queries enabled substantial refinement of the classification scheme. In
particular, the Primary Category ‘Society and Current Affairs’ was renamed ‘Politics and
Society’, whilst the Tertiary Category ‘Organisation or Institution’ was separated to
become a new Primary Category (‘Collections, Organisations and Institutions’), reflecting
the large number of popular queries with collection or provider-based query ‘selections’.
An additional Primary Category (‘Business and Industry’) emerged, alongside multiple new
Secondary/Tertiary Categories.
The revised scheme, following modification based on frequent query analysis, is given in
Appendix 3. It became apparent that some aspects of the scheme required further
refinement, such as potential overlap between sub-categories in ‘The Arts’ and ‘Object or
Form Descriptors’. Other aspects appeared to work well, including the three-level structure
and maintenance of ‘Military and Military History’ as a distinct Primary Category, with an
expanded number of both Secondary and Tertiary categories compared with its limited
original size (Appendix 1).
Subsequent analysis of Random Queries datasets helps to clarify the scheme; the final
scheme is discussed and presented in Section 4.3.3. Classification of frequent and random
30
queries is again informed by a combination of inductive and deductive approaches, focusing
initially on categories emerging from the data and then considering whether these can be
mapped to existing schemes (Section 4.3.4).
The final classification scheme is then used to compare frequent and random queries, plus
those refined by media type using one of Europeana’s filtering options (Europeana,
[2012?]d) (Section 1.3.2, Objective 5). Selection of this filter is informed by both theoretical
and practical considerations, aiming to facilitate a meaningful comparison by incorporating
queries concerning different types of material (e.g. modern and historical), but taking
practical issues like the number of available options into account. Additionally, multimedia
content provision is considered particularly important in the cultural heritage sphere (see:
Section 2.6).
This study therefore considers both technical and subject-based aspects of online cultural
heritage information-seeking, with data analysis and evaluation encompassing qualitative
and quantitative approaches; both are considered valid for studies employing query log
analysis (see, e.g. Jansen & Pooch, 2001; Kurth, 1993; Ross & Wolfram, 2000), alongside LIS
research more generally (Eldredge, 2004; Hider & Pymm, 2008). It was anticipated that
some query topics would be unclear or unknown to the researcher. Foreign language
queries were therefore translated using ‘Google Translate’ (Google, [2012?]) and queries
with unknown subjects entered either into Wikipedia (Wikipedia: The Free Encyclopedia,
[2012?]), chosen for its wide subject coverage and existing association with Europeana
(Europeana, [2012?]f), or Europeana itself (Europeana, [2012?]a). A consistent approach is
intended, so that queries remaining unclear are classified as such without consultation of
additional sources.
31
Chapter 4: Results
4.1 Filters Dataset
4.1.1 Filters and Query ‘Selections’
This dataset contains filter specifications for a sample of 110,691 Europeana queries from
early 2012 (Section 3.2.1), illustrating usage frequencies for different filters available
through the portal. It should be noted that data includes both filters available via results
pages for query refinement (Figure 1) and what are defined in this study as ‘query
selections’.
Figure 1: A screenshot of a Europeana results page following query submission (with query
‘faust’), showing filtering options down the left hand side (Europeana, 2012)
‘Selections’, as defined for this study, include descriptors like “who, what, where or when”
(Europeana, [2012?]d: n.p.) that can be entered either directly by users or generated by
selecting options from individual object pages within Europeana, the latter therefore
indicating browsing rather than search behaviour. Indeed, whilst not present in the Filters
Dataset, other datasets considered here (Section 3.2.1) include additional ‘selections’ like
‘europeana_collectionName’ that appear to arise via browsing when content is selected
32
from options available via Europeana’s ‘Explore’ function (Europeana, [2012?]e: n.p.), as
shown in Figure 2.
Figure 2: A screenshot of the Europeana Web portal Homepage (Europeana, [2012?]a),
showing the ‘Explore’ option along the top of the screen (Europeana, [2012?]e: n.p.)
‘Selections’ therefore appear to represent both user-generated (i.e. free-text entry) and
system-generated (i.e. browsing) queries (see, e.g. Nicholas & Clark, 2012). They are distinct
from filters, generating new queries rather than modifying existing queries. The ‘text’
selection nevertheless seems to arise from utilisation of Europeana’s ‘Refine your search’
option, whereby users can input new terms to search within the results set of an existing
query (Europeana, [2012?]d: n.p.), therefore combining free-text querying with query
refinement.
4.1.2 Filter Specification
In total, 35,684 (32.2%) Filters Dataset queries are either filtered or involve query
‘selections’, the majority (35,365, or 31.9%) of these filtered. Figure 3 shows the usage
proportions for different filters, excluding ‘selection’ options. The “-TYPE” filter – too small
to be visible on the chart – refers to 27 queries, 26 of which specify ‘Wikipedia’. The
definition of this filter is not entirely clear, but it may arise from the existing links between
33
Europeana and Wikipedia (see, e.g. Europeana, [2011?]a; Europeana, [2012?]f; Nicholas &
Clark, 2012).
Figure 3: A pie chart showing usage proportions (%) of Europeana filters specified in the
Filters Dataset
‘TYPE’ (54%) is therefore clearly the most commonly specified filter, followed by
‘PROVIDER’ (16.2%) and ‘YEAR’ (13.5%). The ‘TYPE’ filter refers to “media type”
(Europeana, [2012?]d: n.p. See also: Europeana, [2012?]f), which can be further subdivided
as shown in Figure 4.
In this example, and others below, the data contains some fields with identical names bar
the addition of speech marks (e.g. Text and “Text”). Since these appear to represent the
same options, values are combined to create Tables (2-5) and Figure 4 below, identifiable
by the label ‘Corrected’ (or ‘C.’).
7% 1%
8%
14%
0%
16%
54%
LANGUAGE
RIGHTS
COUNTRY
YEAR
"-TYPE"
PROVIDER
TYPE
34
Figure 4: A pie chart showing the usage proportions (%) of media type options specified in
the Filters Dataset ‘TYPE’ filter
Figure 4 shows that the media type options ‘Image (C.)’ and ‘Text (C.)’ are clearly dominant,
together representing almost 80% of the total filter usage. The least specified option is ‘3D’,
with only 11 occurrences, representing less than 1% of filter usage and not visible in Figure
4. The ‘Unknown’ option refers to 1045 (≈5%) occurrences that appear separately in the
dataset but whose field is unnamed.
‘YEAR’ values specified in the Filters Dataset are aggregated by century to create Figure 5. It
can be seen that the distribution is broadly positively skewed. Years within the 20th century
are most commonly specified, with 30.5% of total filter usage, followed by the 17th and 16th
centuries, each with approximately 15% of total specifications. A small number of examples
(19/4783 ≈0.4%) specify future dates; for example, there are four occurrences of the year
‘5640’.
6%
36%
9%
44%
0%
5%
Sound (C.)
Text (C.)
Video (C.)
Image (C.)
3D
Unknown
35
Figure 5: A chart showing frequency of usage (aggregated by century) for the Filters Dataset
‘YEAR’ filter
Tables 2 and 3 show usage frequencies for the ‘LANGUAGE’ and ‘COUNTRY’ filters. Both
show only the ten most popular options; in total, there are 32 ‘LANGUAGE’ options (28
distinct, accounting for ‘Corrected’ options, including an ‘Unknown’ option) and 40
‘COUNTRY’ options (33 distinct, including an ‘Unknown’ option and amalgamating the
multi-country options ‘united kingdom’ and ‘uk’). These filters are utilised to a similar
degree, with 2447 and 2936 occurrences respectively. Languages are identified from their
abbreviations using the Library of Congress Codes for the Representation of Names of
Languages (Library of Congress, 2010).
0
200
400
600
800
1000
1200
1400
1600
Futu
re
21
st (
to 2
01
2)
20
th
19
th
18
th
17
th
16
th
15
th
14
th
13
th
12
th
11
th
10
th
9th
8th
7th
6th
5th
4th
3rd
2n
d
1st
Un
kno
wn
Fre
qu
en
cy
Century
36
Language Language Name
Quoted from:
Library of Congress (2010:
n.p.)
Frequency
fr (C.) “French” 530
de (C.) “German” 475
es (C.) “Spanish; Castilian” 339
mul “Multiple Languages” 203
pl “Polish” 144
en “English” 143
nl “Dutch; Flemish” 126
Unknown Unknown 75
it “Italian” 67
sl “Slovenian” 53
Table 2: The ten most popular options specified using the ‘LANGUAGE’ filter
Country or Country Group Frequency
Germany 830
Belgium 249
Austria (C.) 249
France 218
United Kingdom/UK (C.) 200
The Netherlands (C.) 170
Europe 168
Spain (C.) 167
Poland (C.) 106
Unknown 71
Table 3: The ten most popular options specified using the ‘COUNTRY’ filter
Table 2 shows that French, German and Spanish are the most popular languages by a
significant margin; a ‘tail’ of less popular languages is visible even within this limited
selection. The most frequently specified countries (Table 3) are Germany, Belgium and
Austria. With the exception of the notably high figure for ‘Germany’, which is over three
times greater than that for ‘Belgium’, there appears to be less variation between country
specification frequencies compared to the language options.
37
Tables 2 and 3 show six clearly identifiable language/country pairs, summarised below
(Table 4), thus illustrating a reasonably strong relationship between usage patterns of these
filters in the Filters Dataset.
Language Name
Quoted from:
Library of Congress
(2010: n.p.)
Rank Country or Country
Group
Rank Rank Difference
(Language – Country)
“French” (C.) 1 France 4 -3
“German” (C.) 2 Germany 1 1
“Spanish; Castilian”
(C.)
3 Spain (C.) 8 -5
“Polish” 5 Poland (C.) 9 -4
“English” 6 UK (C.) 5 1
“Dutch; Flemish” 7 The Netherlands (C.) 6 1
Table 4: Language/Country pairs and associated ranks from Tables 2 and 3
Interestingly, Table 4 shows no exactly correspondent ranks. There are particular
discrepancies between ‘Spanish/Spain’, ‘Polish/Poland’ and ‘French/France’, each with
language ranked higher than country. This potentially illustrates how, whilst interlinked,
usage of these filters is also likely to be influenced by factors like available content. This is
related in turn to ‘PROVIDER’ specifications, with ‘PROVIDER’ the second most popular
Filters Dataset filter (Figure 3). The ten most frequently specified providers are given in
Table 5.
38
Provider Name Frequency
Koninklijk Instituut voor het
Kunstpatrimonium (KIK) [Brussel, België]
2257
moteur Collections 501
Athena 282
The European Library (C.) 255
Institut National de l’Audiovisuel 181
Europeana 1914-1918 (C.) 133
Nationaal Archief 131
Musée Royal de Mariemont 101
Erfgoedplus.be 96
Svenska litteratursällskapet i Finland 78
Table 5: The ten most popular options specified using the ‘PROVIDER’ filter
Table 5 shows particularly high specification of French and Belgian providers in the Filters
Dataset. For example, the Belgian ‘Koninklijk Instituut voor het Kunstpatrimonium (KIK)’ is
over four times more frequently specified than the next most popular provider, perhaps
impacting Belgium’s high position in Table 3. ‘PROVIDER’ specifications also illustrate how
Europeana brings together existing aggregators and portals like ‘Athena’ and ‘The
European Library’ (see, e.g. Cousins et al., 2008).
4.2 Popular Query Analysis
4.2.1 Dataset Characteristics
The Popular Queries dataset (Section 3.2.1) contains query text and frequencies for the 150
most frequent queries submitted to Europeana between 01/01/2012-30/06/2012. Query
text includes both free text and ‘selected’ queries. Indeed, the most popular query is ‘*:*’
(frequency = 301,419), which appears from the researcher’s exploration of Europeana to
arise following selection of certain named providers via the ‘By provider’ option of
Europeana’s ‘Explore’ function (Europeana, [2012?]e: n.p.).
The lowest query frequency is 880, giving a frequency range of 300,539. Excluding the ‘*:*’
query, which is potentially anomalous, the frequency range is 22,885, distributed as shown
in Figure 6.
39
Figure 6: Popular Queries dataset frequency distribution, excluding the top-ranked result
Figure 6 shows a clearly long-tailed and relatively smooth frequency distribution, even
within the most popular Europeana queries; the frequency range is 21,230 between the
query ranks 2-20 alone, compared with only 1655 between the remaining query ranks 20-
150.
4.2.2 Query ‘Selections’ and Language
Based on the dataset of 150 queries (rather than absolute query frequencies), over half of
the popular queries (86/150 or 57.3%) involve query selections, rising to 58% when the
query ‘*:*’ is included. The numbers of queries that employ particular selections are shown
in Figure 7.
0
5000
10000
15000
20000
25000
0 20 40 60 80 100 120 140 160
Fre
qu
en
cy
Query Rank
40
Figure 7: Usage frequencies for different query ‘selections’ in the Popular Queries dataset
Figure 7 shows that the most common selections relate to Europeana providers and
collections. Of those selections that have potentially been entered as free text, ‘what’
clearly has the highest frequency. This is also combined in four cases with ‘dc_type’, where
‘dc’ refers to ‘Dublin Core’, an established and prominent metadata scheme (see, e.g.
Kirchoff et al., 2008).
Query languages are also considered, although language identification could be difficult.
Place, personal and collection names and named works are therefore excluded from the
analysis, alongside language-ambiguous queries (e.g. ‘synagoge’ = ‘Synagogue’, potentially
German, Dutch or French). However, this approach has the limitation of excluding some
less ambiguous queries like ‘wien’ (=’Vienna’ in German) and ‘maria maddalena’ (=‘Mary
Magdalene’ in Italian), meaning that absolute language frequencies are likely to be
significantly higher than those shown in Figure 8.
0
5
10
15
20
25
30
Fre
qu
en
cy
'Selection' Name
41
Multiple counts (one per language, excluding names) are included for the multilingual
query ‘sprookjes OR fairy tales OR grimm OR Perrault OR "Contes des fees" OR "basn" OR
"fiaba"’.
Figure 8: Language frequencies for the Popular Queries dataset
The popularity of French and German fits well with Table 2 concerning the Filters Dataset,
although English queries are comparatively more frequent here. However, the datasets are
not directly comparable, since Figure 8 represents manual coding of query text rather than
analysis of filter usage.
4.3 Random Sample Query Analysis
4.3.1 Dataset Characteristics
Two Random Queries datasets (Section 3.2.1) each contain query text and frequencies for
150 random queries submitted to Europeana between 01/01/2012-30/06/2012. As above,
query text includes both free text and ‘selected’ queries. The frequency range for the first
sample is 229 and for the second sample 149, with frequency distributions as shown in
Figure 9.
0
2
4
6
8
10
12
14
16
English German French Dutch Spanish Italian
Fre
qu
en
cy
Language
42
Figure 9: Frequency distributions for both Random Queries datasets
Figure 9 shows relatively smooth and closely-corresponding frequency distributions for the
two samples, which both decline steeply and end in long tails of single-occurrence queries.
The overall distribution therefore has a similar pattern to Figure 6 for the Popular Queries
dataset, but on a much smaller frequency scale.
4.3.2 Query ‘Selections’ and Language
The Random Queries datasets contain far fewer ‘selected’ queries than the Popular Queries
dataset, totalling 12.7% (19/150) and 22% (33/150) respectively. The numbers of queries
with particular selections are shown in Figure 10.
0
50
100
150
200
250
0 50 100 150 200
Fre
qu
en
cy
Query Rank
Frequency (Sample 1)
Frequency (Sample 2)
43
Figure 10: Usage frequencies for different query ‘selections’ in the Random Queries
datasets
Figure 10 shows that the most common query ‘selections’ for both datasets are ‘what’ and
‘who’, both potentially free-text selections; indeed, one Sample 2 query specifies ‘Quoi:’ (=
‘what’ in French), suggesting free-text input. Contrasting with the Popular Queries dataset
(Section 4.2.2), Europeana providers and collections are very rarely specified, for example
with ‘europeana_provider OR europeana_country’ (see: Figure 7) not occurring in either
random sample. However, other ‘selections’ occur amongst the random samples that are
not present in the Popular Queries dataset, including ‘when’, ‘subject’ and
‘europeana_rights’.
Query languages are classified as in Section 4.2.2, with the additional consideration of
queries (e.g. ‘Tractatus qui de varietate astronomiae intitulatur’ = Latin) that appear to
represent (albeit unknown) named works and are therefore also excluded from the analysis
(Figure 11).
0
2
4
6
8
10
12
14
Fre
qu
en
cy
'Selection' Name
Frequency (Sample 1)
Frequency (Sample 2)
44
Figure 11: Language frequencies for the Random Queries datasets
The total number of language-identifiable queries is similar for both datasets: 42 and 44
respectively, comparable to the 39 examples identified from the Popular Queries dataset
(Section 4.2.2). Similarly, the three most popular languages are English, French and German
in all three cases, although Sample 1 (Random Queries) contains slightly more German than
French queries. Both random query samples contain a greater variety of languages than the
frequent queries; for example, at least one Polish, Portuguese and Norwegian query occurs
in each dataset (Figure 11).
4.3.3 Query Classification: Final Scheme Refinement
The query classification scheme was refined based on query subjects/topics in the Random
Queries datasets. New categories emerging (e.g. ‘Sport’) were primarily Secondary or
Tertiary rather than Primary Categories. However, there was extensive restructuring, in
particular splitting some Primary Categories that had become overly large with the addition
of new Secondary/Tertiary Categories. For example, ‘The Arts’ was split into Primary
Categories ‘Arts and Design’, ‘Literature and Poetry’ and ‘Music, Film and Theatre’, whilst
new Primary Category ‘Lifestyle and Entertainment’ was separated from ‘Politics and
Society’.
Conversely, the distinction between online/physical collections was removed from ‘Named
Collections – Other Collections’, since it was felt that this was not meaningful given the
0
2
4
6
8
10
12
14
16
Fre
qu
en
cy
Language
Frequency (Sample 1)
Frequency (Sample 2)
45
large number of collections with both a physical and an online presence. It was also decided
that ‘Object or Form Descriptors’ should refer only to formats specified using query
‘selections’ (e.g. ‘what:text’ from the Popular Queries dataset), recognising the difficulty of
determining whether non-specified queries like ‘film’ refer to subjects or desired results
formats. The aim is therefore to avoid overlap with subject categories, meaning that
queries classified as ‘Object or Form Descriptors’ do not receive additional subject
classifications.
The final query classification scheme developed for this study is summarised in Table 6.
Green highlights show a small number of terms drawn from queries themselves, either
directly or in translation (e.g. ‘Military Tribunals’ from the query ‘tribunal militar’), that are
considered particularly appropriate for representing classification scheme topics or
subjects.
46
Primary
Categories
Secondary Categories Tertiary Categories
Philosophy,
Mythology
and Religion
Philosophy
Mythology
Religion
Named Figures
Ideas and Concepts
Folk and Fairy Tales
Legends
Classical Philosophy, Mythology and Religion
Theology and Religious History
Named Religions and Religious Groups
Named Figures: Ministers and Officials
Named Figures: Religious Texts
Festivals and Ceremonies
Iconography and Objects
Religious Buildings, Locations and Communities
Place,
Civilisation
and Travel
Geographical Features
or Regions
Countries and
Settlements
Travel
Civilisation or Culture
Country
City: Capital
City: Other
Municipality, Town or Village
Specified Address
Island (Inhabited)
Region or Administrative Region
Maps and Travel Guides
Languages
Historical Place Names
Ancient and Classical Civilisation and Culture
Politics and
Society
Named Figures
Political Leaders and Politicians
Royalty and Nobility
47
News
Law and Crime
Amenities and Facilities
History and Social
Change
Organisations and
Societies
Civil Ceremonies and
Events
International Relations
Named Newspapers
Journalism
History of Crime
Copyright
Housing
Hospitals and Healthcare
Libraries
Schools and Education
Marriage
Political Agreements
Military and
Military
History
Named Figures
Military Engagements
Procedure and
Discipline
Military Objects
Military Leaders and Personnel
Prisoners of War
Historical Figures
Strategy and Tactics
Treaties and Agreements
World Wars
Military Tribunals
Military Records
Buildings, Locations and Bases
48
Transport
Weapons and Equipment
Lifestyle and
Entertainment
Entertainment and
Events
Transport
Sport
Computing
Fashion and Beauty
Advertising
Performances
Exhibitions
Arcades
Road
Rail
Air
Other
Named Sports and Sports Clubs
Sporting Events
Equipment
Social Media
Arts and
Design
Named Figures
Named Works or
Subjects
Artistic Periods, Styles
or Movements
Genres
Creators or Designers
Collectors
History of Art
Classical Art
Portrait
Landscape
Painting, Drawing and Illustration
Engraving and Printing
Photography
49
Stamps
Bookplates
Postcards
Ceramics, Enamel, Pottery and Glass
Sculpture and Figurines
Fashion, Clothing and Jewellery
Other
Literature and
Poetry
Named Figures
Named Works or
Subjects
Literary Periods, Styles
or Movements
Genres
Authors and Editors
Publishers
Classical Literature
Poetry
Literature (Fiction)
Literature (Non-Fiction)
Ephemera
Music, Film
and Theatre
Named Figures
Named Works or
Subjects
Periods, Styles or
Movements
Instruments and
Equipment
Genres
Creators or Composers
Performers
Other
Folk Music
Musical Instruments
Music
50
Film
Theatre
Architecture,
Buildings and
Structures
Named Figures
Architectural Periods,
Styles or Movements
Genres
Architects
Landscape Architecture
Castles, Palaces, Religious Buildings and
Monuments
Civic Buildings, Housing and Businesses
Engineering Structures
Sciences Named Figures
Genres
Historical Figures
Natural History and Biology (Non-Human)
Animal Husbandry and Food Science
Human Biology and Medicine
Archaeology
Anthropology
Geography and Cartography
Physics and Astronomy
Technology
Business and
Industry
Named Companies or
Manufactories
Named Products and
Advertising
Named Industries
Patents
Mining and Resource Extraction
Construction and Manufacturing Industries
Generic
Subjects
Person
Place
51
Object
Time
Other
Date
Named
Collections
Libraries and Archives
Museums and Galleries
Other Collections
Portals and Aggregators
Geographical
Designations
Object or
Form
Descriptors
Europeana Query
‘Selections’: Format
Ambiguous or
Unclear
Person
Place
Computing Functionality
or Search Feature
Other
Table 6: A summary of the final study classification scheme following refinement based on
analysis of Random Queries datasets
4.3.4 Classification Scheme Mapping
To enhance its practical applicability, mapping between the study classification scheme
(Table 6) and existing schemes is also considered. As noted in Section 3.2.2, exploratory
mapping of an earlier version of the scheme suggested that ‘Library of Congress Subject
Headings’ (Library of Congress, [2012?]) would be more suitable for mapping the
Secondary/Tertiary Categories than Primary Categories (Appendix 2). The effectiveness of
52
mapping between Primary Categories and the broader, top-level headings of a Web-based
scheme, namely ‘Yahoo! Directory’ (Yahoo, Inc., 2012), is therefore considered instead. As
before, Secondary and Tertiary headings are considered in relation to the ‘LCSH’ scheme
(Library of Congress, [2012?]). The resulting potential mapping terms are given in Appendix
4.
Mapping between Primary Categories and ‘Yahoo! Directory’ (Yahoo, Inc., 2012) headings is
of mixed effectiveness. Although some Primary Categories (e.g. ‘Sciences’, ‘Business and
Industry’) have clear equivalents in the Web scheme, the majority are either too broad (e.g.
‘Politics and Society’) or too narrow (e.g. ‘Literature and Poetry’) to map successfully. This
suggests that the character of cultural heritage information, at least as revealed through
Europeana querying patterns, is indeed distinct from that of general Web information
resources. However, mapping between these schemes may be more feasible at narrower
levels of classification, which are not considered here.
In contrast, mapping between Secondary/Tertiary Categories and ‘LCSH’ (Library of
Congress, [2012?]) is generally effective; indeed, the majority of categories have direct
equivalents. Discrepancies primarily occur where categories like ‘Named Figures’ remain
too broad for direct mapping to ‘LCSH’ (Library of Congress, [2012?]), which is not
considered surprising given the small scale of this study. As such, library-based schemes
may remain more suitable than general Web schemes as a basis for classifying cultural
heritage information in online environments.
4.4 Comparison of Frequent and Random Query Classification
Patterns
As noted in Section 2.5, frequent and rare queries often have different characteristics (see,
e.g. Gabrilovich et al., 2009). The study classification scheme is therefore utilised here to
compare the topics of popular versus random queries. Given the large differences between
query frequencies (Figures 6 and 9), it is considered feasible to approximate rare queries
with the Random Queries datasets. Classification uses the Primary Categories of the
scheme (Table 6), with results as shown in Table 7.
53
Primary Category Popular
Queries (%)
Random
Queries 1 (%)
Random
Queries 2 (%)
Random
Queries
(Mean %)
Philosophy,
Mythology and
Religion
3.53 7.56 8.61 8.09
Place, Civilisation and
Travel
22.4 19.6 15.3 17.5
Politics and Society 2.94 8.00 7.18 7.59
Military and Military
History
3.53 5.78 2.87 4.33
Lifestyle and
Entertainment
1.76 4.44 2.87 3.66
Arts and Design 10.6 10.2 14.4 12.3
Literature and Poetry 0.588 8.89 8.61 8.75
Music, Film and
Theatre
3.53 2.67 2.39 2.53
Architecture,
Buildings and
Structures
1.76 2.67 6.22 4.45
Sciences 2.35 5.78 3.83 4.81
Business and Industry 1.76 2.22 2.87 2.55
Generic Subjects 1.18 7.56 4.31 5.94
Collections,
Organisations and
Institutions
31.2 3.11 2.39 2.75
Object or Form
Descriptors
11.8 0.444 0.478 0.461
Ambiguous or Unclear 1.18 11.1 17.7 14.4
TOTAL 100 100 100 100
Table 7: Categorisation percentages for queries in different study datasets using the
Primary Categories of the study classification scheme
The primary limitation of this comparison is the difficulty of accurate classification itself,
although it is felt that this approach does at least enable clear querying patterns to emerge.
54
Categories are not mutually exclusive, meaning that queries with multiple subjects/topics
receive one count per category. As such, the total number of categorisations per dataset
indicates the comparative complexity of the queries; for example, the 150 most popular
queries have 170 categorisations overall, whilst the random samples appear more complex,
with 225 and 209 categorisations respectively.
Table 7 shows that classification percentages for the random samples are generally similar.
The largest discrepancies are for ‘Ambiguous or Unclear’ (6.6% difference), ‘Arts and
Design’ (4.2% difference) and ‘Place, Civilisation and Travel’ (4.3% difference), the latter
more frequent in Sample 1 and the others in Sample 2. These are also the most frequent
categories overall for the Random Queries datasets. For the Popular Queries, the most
common categories are ‘Collections, Organisations and Institutions’, ‘Place, Civilisation
and Travel’ and ‘Object or Form Descriptors’.
Between the popular and random (mean %) queries, the largest discrepancies are for
‘Collections, Organisations and Institutions’ (28.5% difference), ‘Ambiguous or Unclear’
(13.2%) and ‘Object or Form Descriptors’ (11.3%). This is likely to reflect the much lower
use of query ‘selections’ in the random samples (Section 4.3.2), which accounts for most of
the collection and form-based categorisations amongst the frequent queries. Comparative
rankings of classification categories for the different datasets are given in Table 8; cell
shading represents categories within the datasets that have equal ranks (i.e. equal
categorisation percentages).
55
Rank (1=
high)
Popular Queries Random Queries 1 Random Queries 2
1 Collections, Organisations
and Institutions
Place, Civilisation and
Travel
Ambiguous or Unclear
2 Place, Civilisation and
Travel
Ambiguous or Unclear Place, Civilisation and
Travel
3 Object or Form Descriptors Arts and Design Arts and Design
4 Arts and Design Literature and Poetry Philosophy, Mythology
and Religion
5 Philosophy, Mythology and
Religion
Politics and Society Literature and Poetry
6 Military and Military
History
Philosophy, Mythology
and Religion
Politics and Society
7 Music, Film and Theatre Generic Subjects Architecture, Buildings
and Structures
8 Politics and Society Military and Military
History
Generic Subjects
9 Sciences Sciences Sciences
10 Lifestyle and
Entertainment
Lifestyle and
Entertainment
Military and Military
History
11 Architecture, Buildings and
Structures
Collections,
Organisations and
Institutions
Lifestyle and
Entertainment
12 Business and Industry Music, Film and
Theatre
Business and Industry
13 Generic Subjects Architecture, Buildings
and Structures
Music, Film and
Theatre
14 Ambiguous or Unclear Business and Industry Collections,
Organisations and
Institutions
15 Literature and Poetry Object or Form
Descriptors
Object or Form
Descriptors
Table 8: Comparative ranks for queries in different datasets classified using the Primary
Categories of the study classification scheme
56
4.5 Queries Filtered by Media Type
As noted in Section 2.6, multimedia formats are a particularly distinctive feature of cultural
heritage information (see, e.g. Kirchoff et al., 2008). It is therefore considered interesting to
classify and compare some small samples of queries refined by Europeana’s media filtering
options (Europeana, [2012?]d). The dataset contains the 25 most popular queries specifying
each of four main options from 01.01.2012-30.06.2012 (Section 3.2.1), thus including some
overlap with other datasets.
Variation in query frequencies indicates the comparative popularity of different filtering
options. For example, the most frequent ‘Text’ query is ‘dagras’ (frequency 23,624), whilst
that filtered by ‘Sound’ is ‘*:*’, with a much lower frequency (1310). The latter is the same
top-ranked query as for the Popular Queries dataset (Section 4.2) and also the most
popular query for the other media types. Frequency distributions for the different filtering
options are shown in Figure 12.
Figure 12: Frequency distributions for the most popular queries specifying different
Europeana media-based filtering options (Europeana, [2012?]d)
Excluding the most frequent ‘Text’ query, which is potentially anonymous, the distributions
appear as shown in Figure 13.
0
5000
10000
15000
20000
25000
0 5 10 15 20 25 30
Fre
qu
en
cy
Query Rank
Image
Sound
Text
Video
57
Figure 13: Frequency distributions for the most popular queries specifying different
Europeana media-based filtering options (Europeana, [2012?]d), excluding the top-ranked
‘Text’ query
Figure 13 shows that ‘Image’- and ‘Video’-filtered queries are most popular, whilst ‘Sound’
is specified much less frequently. Overall, the frequency distribution has a similar long-
tailed pattern to that of the larger datasets (Figures 6 and 9). However, all four distributions
also show ‘stepped’ components, particularly ‘Sound’, ‘Text’ and ‘Video’, whose ‘steps’
appear to overlap quite strongly at the higher query ranks.
The study classification scheme (Table 6) is also tested on these more narrowly-specified
queries, considering its applicability beyond the main datasets. The aim is to classify queries
without making any scheme alterations; queries that are difficult to classify are noted, thus
facilitating evaluation of the classification scheme (see: Section 5.2). Results are given in
Table 9.
0
1000
2000
3000
4000
5000
6000
7000
0 5 10 15 20 25 30
Fre
qu
en
cy
Query Rank
Image
Sound
Text
Video
58
Primary Category Media type:
“Image”
Queries (%)
Media type:
“Sound”
Queries (%)
Media type:
“Text”
Queries (%)
Media Type:
“Video”
Queries (%)
Philosophy,
Mythology and
Religion
6.45 9.09 8.33 3.57
Place, Civilisation and
Travel
38.7 30.3 44.4 46.4
Politics and Society 9.68 6.06 5.56 0.00
Military and Military
History
12.9 3.03 2.78 0.00
Lifestyle and
Entertainment
6.45 0.00 2.78 7.14
Arts and Design 3.23 3.03 0.00 3.57
Literature and Poetry 0.00 0.00 5.56 0.00
Music, Film and
Theatre
6.45 36.4 5.56 14.3
Architecture,
Buildings and
Structures
3.23 3.03 5.56 3.57
Sciences 0.00 0.00 5.56 3.57
Business and Industry 0.00 0.00 5.56 3.57
Generic Subjects 3.23 0.00 0.00 3.57
Collections,
Organisations and
Institutions
3.23 6.06 2.78 7.14
Object or Form
Descriptors
3.23 0.00 2.78 0.00
Ambiguous or Unclear 3.23 3.03 2.78 3.57
TOTAL 100 100 100 100
Table 9: Categorisation percentages for popular Europeana queries filtered by media type
(Europeana, [2012?]d) using the Primary Categories of the study classification scheme
59
As in Section 4.4, comparing total categorisations indicates comparative query complexity;
‘Video’ queries appear least complex, with 28 categorisations for the 25 queries, whilst
‘Text’ queries appear most complex, with 36 categorisations. ‘Place, Civilisation and Travel’
is the most popular category for all options except ‘Sound’, for which it is second most
popular after ‘Music, Film and Theatre’, reflecting the significant number of music-related
queries like ‘beethoven’ amongst ‘Sound’ queries. The popularity of ‘Image’ specifications
for ‘Military and Military History’ queries is also noticeable, with queries including
‘weltkrieg’ (=‘world war’ in German) and ‘what:World War One’. Results from Table 9 are
presented graphically in Figure 14.
Figure 14: Primary Category percentages for popular Europeana queries filtered by media
type (Europeana, [2012?]d)
Query ‘selections’ occur infrequently within this dataset, comprising (including ‘*:*’) six
‘Sound’, four ‘Image’ and ‘Text’, and one ‘Video’ query. Three of the ‘Sound’ selections,
which all concern music, involve an example that doesn’t occur elsewhere in the dataset
(‘europeana_rights’), suggesting usage by a distinct user group specifying media types
when using Europeana for a specific purpose (see: Section 5.1).
0
5
10
15
20
25
30
35
40
45
50
Pe
rce
nta
ge o
f To
tal C
ate
gory
De
sign
atio
ns
Classification Scheme: Primary Categories
Image
Sound
Text
Video
60
Chapter 5: Discussion
5.1 Europeana Querying Patterns: Filters, Query ‘Selections’ and
Languages
Filters Dataset results (Section 4.1.2) indicate a predominance of filtering by media type,
showing strong concern amongst Europeana users for information format; this kind of
functionality is also noted by Cousins et al. (2008: 133) regarding ‘The European Library’
portal (see also: Hearst, 2009). The popularity of ‘Image’ filtering specifically (Figure 4)
accords with wider literature concerning cultural heritage information-seeking (see, e.g.
Kirchoff et al., 2008; Politou et al., 2004). Contrastingly, the low popularity of ‘3D’ filtering
perhaps reflects how 3D visualisations are still an emerging form of representation for
cultural heritage material online; for example, Meyer et al. (2007: 405) note their high
potential for portraying archaeological sites.
Analysis of the Media Types dataset (Section 4.5) nevertheless reveals different results:
‘Text’ queries are most popular by a significant margin (Figure 12). This result is potentially
skewed by an anomalously high top-ranked query ‘*:*’, although the same query does
appear amongst the other media types. Interestingly, despite an overall more sharply
declining frequency distribution for ‘Video’ queries (Figure 13), the top-ranked ‘Image’
query and ‘Video’ query have similar frequencies, potentially contrasting with the low
occurrence of this option in the Filters Dataset (Figure 4), although the datasets are not
directly comparable. Indeed, whilst Nicholas and Clark (2012: 93) assert the popularity of
“video and sound” queries submitted to Europeana (see also: Europeana, [2011?]b),
‘Sound’ filtering in particular appears infrequent in both the Filters and Media Types
Datasets (Figure 4, Figure 13).
Additional specifications in the Filters Dataset correlate well with wider trends. For
example, CIBER Research Ltd. (2011: 8) note that Europeana has a particularly high number
of French users; Nicholas and Clark (2012: 92) similarly highlight “France and Germany”,
corroborating this study’s finding of high French and German ‘LANGUAGE’ filter
specifications (Table 2). These authors also note the comparatively low prominence of UK
Europeana usage (Nicholas & Clark, 2012: 92), supported by Filters Dataset findings (Tables
2, 3). English, again followed by French and German, does appear to be the most popular
language based on analysis of the Popular and Random Queries datasets (Figures 8, 11),
although this could reflect a limitation of the manual language coding strategy for these
datasets (Section 4.2.2).
61
It is considered likely that discrepancies between ‘LANGUAGE’ and ‘COUNTRY’ filtering
specifications (Table 4) reflect the latter’s greater dependence on providers (i.e. content
availability) rather than Europeana users (Section 4.1.2), for example with the high position
of ‘Belgium’ in Table 3 potentially reflecting the Belgian origin of the top-ranked provider
(Table 5). Providers can include sub-collections or projects, as exemplified by the query
‘europeana_provider:"Europeana 1914 - 1918" OR europeana_country:"Europeana 1914 -
1918"’, which refers to a Europeana initiative with high user participation concerning World
War One (see, e.g. Europeana, [2011?]b: 20). Bowler et al. (2011: 730) emphasise the
“value of story as an access method”, perhaps accounting for the presence of this query in
the Popular Queries dataset; indeed, ‘Europeana 1914-1918’ is also the sixth most popular
provider appearing in the Filters Dataset (Table 5).
Usage of the ‘YEAR’ filter is also reasonably high (Figure 3), whereas time-related queries
are quite infrequent, suggesting high user awareness of this filtering option. Where time-
based terms do occur in query text, they are often accompanied by other terms, as for the
query ‘school in 1800s’ (Random Queries dataset). Some query examples are therefore
quite specific, contrasting with Park et al.’s (2005: 215) assertion that “most Web users
want to get general information”. Nevertheless, other queries appear to support this
statement; for example, the most frequent ‘Video’ queries include ‘film’ (2nd), ‘filmy’ (9th),
‘video’ (16th) and ‘films’ (17th).
Although the most frequently occurring query languages (English, French and German) are
consistent across the Popular and Random Queries datasets, random query samples exhibit
a wider overall range of languages, including queries like ‘fabeloj’ (=’stories’ in Esperanto)
(Section 4.3.2, Figure 11). A possible relationship between language and query type is also
apparent, with queries (almost exclusively in French) concerning Classical vases occurring
across the Popular and Random Queries datasets; these perhaps represent a distinct user
group, or particularly large amount of French-language material available on this topic (e.g.
‘un vase et un dieu ou hero’ = ‘a vase and a god or hero’, ‘Vase avec dieu grec’ = ‘vase with
a Greek god’, both in French).
Multilingual querying is rare, with only two clear examples that are both from the Popular
Queries dataset: ‘what: Einfamilienhaus/Villa’ (German/Ambiguous), which is potentially
either free-text or ‘selected’ via an object page, and the complex, free-text query
‘sprookjes OR fairy tales OR grimm OR Perrault OR "Contes des fees" OR "basn" OR
"fiaba"’ (Dutch, English, French, Unknown & Italian, plus names). Indeed, Purday (2009:
62
933) emphasises that multilingual search is a key area of concern for Europeana (see also:
Europeana, [2011?]a), reflecting the importance of language as a factor affecting the
accessibility of cultural heritage information generally (Agosti et al., 2012; Minelli et al.,
2007). This query, which is the 11th most frequent overall and also notable for its inclusion
of operators (e.g. ‘OR’), similarly occurs amongst the most frequent ‘Video’ (18th) and
‘Sound’ (22nd) queries, suggesting that acted or spoken fairy tales are a distinct and popular
type of cultural heritage material.
The presence of query ‘selections’ occasionally makes it difficult to identify querying
characteristics, in particular distinguishing searching and browsing behaviour (see: Section
4.1.1). Indeed, the most popular query overall (‘*:*’) appears to arise from provider-based
browsing (Section 4.2.1) and also occurs across the media type specifications (Section 4.5).
The importance of facilitating both search and browse is highlighted by several authors; this
study indicates that Europeana users do indeed utilise different approaches to search and
access content (see, e.g. Agosti et al., 2012; Jansen, 2009; Levene, 2006; Nicholas & Clark,
2012). For example, ‘who: giorgione?utm_source=blog’, from the Random Queries dataset,
appears to support the use of blogs as additional access points (see, e.g. CIBER Research
Ltd., 2011; Nicholas & Clark, 2012).
Although potentially limiting the study, since queries therefore do not always indicate free
text entered by users, analysis of ‘selections’ nevertheless provides an interesting
additional point of comparison between the different study datasets. For example,
collection/provider ‘selections’ occur much more frequently in the Popular Queries dataset
compared to the random samples (Sections 4.2.2, 4.3.2). This is reflected in the study
classification scheme, where the proportion of queries categorised as ‘Collections,
Organisations or Institutions’ is approximately ten times higher for popular versus random
queries (Table 7, Section 4.4). Indeed, ‘selection’ occurrences are significantly higher
overall amongst the frequent queries, which is likely to indicate greater browsing
behaviour; despite high interest in specific collections, users may not always have well-
defined information requirements, perhaps reflecting a distinct user group of potential
visitors (e.g. tourists) to associated physical collections.
In contrast, the ‘europeana_rights’ selection does not occur at all in the Popular Queries
dataset. Where it does occur, query construction is also quite complex, suggesting that
users are experienced and have quite specific search goals (e.g. ‘music AND
europeana_rights:*creative* AND NOT europeana_rights:*nc* AND NOT
63
europeana_rights:*nd*’). These queries are in fact amongst the most frequent ‘Sound’
queries in the Media Types dataset (Section 4.5), potentially representing a particular niche
user group; for example, CIBER Research Ltd. (2011: 4) highlight use of Europeana by “the
creative and information industries”. Overall, the distribution of ‘selection’ occurrences
therefore seems to confirm Nicholas and Clark’s (2012: 91) findings concerning different
Europeana user groups, in particular suggesting both a “consumer/leisure profile…[and] a
large academic following”.
Both the Popular and Random Queries datasets exhibit long-tailed query frequency
distributions (Figures 6, 9), matching the profiles of query log datasets from previous
studies (see, e.g. Gabrilovich et al., 2009; Jansen et al., 2011). Operator usage is also very
low, which is consistent with wider literature (see, e.g. Gabrilovich et al., 2009; Jansen &
Pooch, 2001). Additionally, one random query (‘navarre + evreux’ = place names) employs
an operator that does not appear to be supported by Europeana (Europeana, [2012?]d),
implying that users may be unfamiliar with available search options. Other querying
characteristics nevertheless vary significantly between the datasets, illustrating the validity
of considering both frequent and random (or rare) query samples; when aggregated, rarer
queries represent a large volume of data (Gabrilovich et al., 2009; Jansen & Spink, 2006;
Silverstein et al., 1999).
It is not considered surprising that a far higher proportion of random versus frequent
queries are later classified as ‘Ambiguous or Unclear’ (Table 7, Section 4.4), since these are
generally rare queries, which “are…difficult to classify” (Gabrilovich et al., 2009: 20).
Indeed, over half of the examples in each Random Queries dataset only occur once.
Furthermore, whilst Ross and Wolfram (2000: 951) suggest that one-term queries
specifically are not actually very common, these comprise over one fifth of each random
dataset.
5.2 Classification Scheme Development
Classification scheme development is considered highly relevant for Europeana, since
classification schemes and taxonomies can support information-seeking through both
search and browse (see, e.g. Chaudhry & Jiun, 2005; Hearst, 2009). The final scheme
developed for this study has fifteen Primary Categories and is hierarchical, with three levels
(Table 6, Section 4.3.3). These aspects are broadly consistent with existing schemes,
including those for general Web SEs (see, e.g. Chuang & Chien, 2003b in Agosti et al., 2012:
671; Spink et al., 2002; Ross & Wolfram, 2000).
64
Gabrilovich et al. (2009: 14), who utilise a more detailed scheme, note that granularity must
reflect the classification purpose (see also: Section 3.1.2). In this study, greater complexity
is considered inappropriate for the intended scheme application, potentially reducing its
practical value for Europeana users. For this reason, Tertiary Categories like ‘City (Capital)’
and ‘Literature (Fiction)’ are not sub-divided into lower-level, Quaternary Categories, which
it is felt could overcomplicate the scheme. Indeed, although they focus primarily on results
presentation, Cousins et al. (2008: 13) suggest that “[e]ven academic researchers” prefer
relatively simple approaches to online information-seeking.
Form descriptors represent a particular problem, with queries like ‘music’ potentially
referring to either a subject or desired results format (see: Section 4.3.3). As an example,
Library of Congress schemes can support both alternatives, through the “LC Subject
Headings” and “LC Genre/Form Terms” (Library of Congress, [2012?]: n.p.). However, there
is insufficient contextual information to make this distinction for most queries in this study.
In the final scheme, queries with format-based ‘selections’ like ‘what:pdf’ are therefore
classified as ‘Object or Form Descriptors’, whilst non-‘selected’ queries receive subject-
based classifications. Even so, without related metadata like subsequent queries or session
information, these distinctions may sometimes be inaccurate.
Mapping between schemes, as shown in Appendices 2 and 4, is considered important for
improving the practical usefulness of this study’s query classification scheme for content
organisation and resource discovery (see, e.g. Walsh, 2011). It is also recognised that, while
ensuring an outcome tailored specifically to Europeana queries, the primarily inductive and
single-researcher approach adopted here for scheme development could lead to undue
representation of the researcher’s own assumptions and opinions (see, e.g. Jansen et al.,
2009; Kurth, 1993; Walsh, 2011). Mapping is therefore intended to help overcome this
limitation; similarly, relevant terms like ‘housing’ are drawn from queries themselves, thus
incorporating users’ own terminology (Chaudhry & Jiun, 2005; Li et al., 2012). Indeed,
Cousins et al. (2008: 137) argue that Europeana aims “to provide a user driven portal”, also
suggesting how “social tagging” (Cousins et al., 2008: 131) approaches can complement
traditional forms of information provision (see also: Bowler et al., 2011; Europeana,
[2011?]a; Europeana, [2012?]g).
As noted in Section 4.3.4, the classification scheme’s Primary Categories do not map well to
either ‘LCSH’ (Library of Congress, [2012?]) or the Web-based ‘Yahoo! Directory’ (Yahoo!
Inc., 2012). The latter was chosen to mirror Jansen et al.’s (2011: 491) query classification
65
using ‘Google Directory’. However, the organisation of Primary Categories appears specific
to Europeana. This is considered likely to reflect the portal’s wide topic range; Minelli et al.
(2007: 4), for example, find that “the characteristics of queries appeared to be influenced
by the subject domain” within cultural heritage, whilst Hargittai (2002: 1242) similarly
highlights “topic-specific search strategies”.
Nevertheless, the combination of Primary Categories emerging from Europeana query log
data and Secondary/Tertiary Categories mapped to ‘LCSH’ (Library of Congress, [2012?]) is
considered advantageous for potentially facilitating both general and more complex search
and browse. Walsh (2011: 331), for example, emphasises that “LCSH…enhances both
precision and recall when searching multiple digital collections at the same time”, making it
highly applicable to a portal like Europeana. However, this kind of scheme can be overly
complex for novice users, whilst domain specificity is an additional concern (see, e.g.
Chaudhry & Jiun, 2005; Kirchoff et al., 2008; Li et al., 2012; Walsh, 2011). Given Europeana’s
diverse user groups (see, e.g. Europeana, [2012?]c; Nicholas & Clark, 2012), the study
scheme aims to enable detailed description through mapping and combining categories,
rather than the creation of complicated headings, particularly since users may not have
precise or well-expressed information requirements (Cousins et al., 2008; Walsh, 2011). The
scheme therefore draws on both library/archive (e.g. terminology) and general Web-based
(e.g. broad top-level headings) classification practices (see, e.g. Jansen et al., 2011; Walsh,
2011).
Classification of the relatively complex query ‘architecture drawing of st paul’s, deptford’,
for example, would involve multiple relatively broad categories, thus allowing different
points of access: ‘Arts and Design – Genres – Painting, Drawing and Illustration’; ‘Place,
Civilisation and Travel – Countries and Settlements – City (Capital)’; ‘Philosophy,
Mythology and Religion – Religion – Religious Buildings, Locations and Communities’ and
‘Architecture, Buildings and Structures – Castles, Palaces, Religious Buildings and
Monuments’.
Classification of Media Types dataset queries partly tests the degree of saturation reached
during classification scheme development (Section 4.5). Overall, the scheme works well and
all queries can be classified; the structure appears appropriate and it is felt that saturation
has been reached for the Primary and Secondary Categories. However, additional Tertiary
Categories could provide useful further descriptive power, such as expansion of ‘Music,
Film and Theatre’ to include ‘Classical Music’ and ‘Religious Music’ (e.g. for the query
66
‘cantor’) and expansion of ‘Sciences’ to include ‘Psychology’ (e.g. for the query ‘mania’).
Future research could therefore focus on expanding the scheme’s range of Tertiary
Categories; nevertheless, manual query classification may be insufficient for achieving this
given the difficulty of classifying short and often ambiguous queries, as noted above
(Gabrilovich et al., 2009: 20).
Several authors highlight the popularity of place-related queries in Web SE studies (see, e.g.
Jansen et al., 2011; Ross & Wolfram, 2000; Spink et al., 2002). This study supports these
findings: ‘Place, Civilisation and Travel’ is either the first or second most popular category
for all study datasets classified (Sections 4.4, 4.5). Levene (2006: 243) similarly notes that
location-based queries may be longer than other types; the results of this study suggest
that this could result from the incorporation of place into multi-topic queries like ‘l'islam en
france’ (=’Islam in France’ in French). In addition, Ross and Wolfram (2000: 954) find that
common Web topic categories can include “Pictures” and “Organizations”, which is
corroborated here (see, e.g. Figure 4, Table 8), although datasets in this study differ in their
lack of “Sexuality” queries (Ross & Wolfram, 2000: 954).
Europeana querying patterns therefore exhibit both similarities and differences vis-à-vis
general Web querying. This probably reflects the overall “cross-domain” (Purday, 2009:
919) character of the service combined with its primary heritage focus (Europeana,
[2012?]b).
67
Chapter 6: Conclusion
6.1 Fulfilment of Study Objectives (Section 1.3.2)
Chapter 2 outlines background literature concerning Web-based information-seeking and
cultural heritage information, whilst Chapter 3 provides additional practical and theoretical
context for query log analysis, therefore fulfilling Objective 1.
Query refinement through search filters is investigated in Chapter 4 for a Filters Dataset
(Section 4.1) and – focusing on the most popular filter – a Media Types Dataset (Section
4.5). Query ‘selections’ are also considered across the different study datasets, giving a
more nuanced view of query specification and refinement in Europeana and thus also
contributing to the fulfilment of Objective 2.
In Chapters 3 and 4, a Popular Queries dataset (150 queries) and two Random Queries
datasets (each 150 queries) are analysed, including subject-based query classification and
classification scheme development. The final classification scheme is given in Table 6,
Section 4.3.3. Dataset characteristics, including classification patterns, are compared and
contrasted in Chapters 4 and 5. Although saturation is not entirely reached for the
scheme’s Tertiary Categories, its overall structure and Primary/Secondary Categories are
considered largely complete based on classification of queries from the Media Types
Dataset (Section 5.2), with sufficient flexibility to easily incorporate new Tertiary Categories
as required.
It is therefore felt that study Objectives 3, 4 and 5 have also been completed successfully.
Based on the data shown in Tables 7-9 and Figure 14, it is also concluded that the study
hypothesis of significant difference between classification profiles of frequent and random
queries, plus those filtered by different media types, is supported; even so, the largest
differences in both cases appear to rest primarily on a small number of classification
categories.
The practical application of results presented in this study for online cultural heritage
information provision is explored below (Section 6.2), thus considering the final study
Objective 6.
6.2 Recommendations for Cultural Heritage Information Provision
Agosti et al. (2012: 671), focusing on Web portals, state that log analysis can be applied “to
improve…structure and presentation”. Several recommendations for providing Web-based
68
cultural heritage information have arisen from this study, primarily concerning Europeana
but with relevance for other information providers.
Concerning content organisation, the predominance of subject-based searching (Section
3.2.2) indicates that subject should be incorporated systematically into object metadata,
both facilitating search and ensuring that content structuring reflects primary information
usage. Classification scheme development and mapping (Sections 4.3.3, 4.3.4) suggest that,
for Europeana, the combination of a tailored classification scheme and incorporation of
aspects from existing schemes to enable interoperability is likely to be most effective, an
approach noted by Walsh (2011: 333-334).
Information presentation and search functionalities are additional concerns, with several
authors emphasising the importance of interface design (see, e.g. Hearst, 2009; Jansen et
al., 2007; Levene, 2006). Europeana’s simple “search box” (Purday, 2009: 926), drawing
primarily on general Web models, is felt to be very effective and filtering options appear
well-used (Section 4.1.2) (see also: Nicholas & Clark, 2012). However, Meyer et al. (2007:
401) highlight the importance of user awareness of search functionalities (see also: Cooper,
2001). In this study, many ‘selections’ amongst the frequent queries appear to indicate
browsing behaviour via Europeana’s ‘Explore’ option (Europeana, [2012?]e), the
prominence or visibility of which could therefore potentially be expanded compared to
other assistance options (Figure 1).
It is also felt that ‘Subject’ should be available as a main filtering option, perhaps based on a
scheme like this study’s Primary Categories (Table 6, Section 4.3.3), to further facilitate
browsing, complement existing specification options (see, e.g. Europeana, [2012?]d) and
allow users – particularly novice searchers - to easily resolve subject-ambiguous queries.
Concordia et al. (2010: 67) suggest the alternative approach of “contextual grouping of
results sets” in Europeana, locating this kind of information-seeking assistance at the
results rather than search stage (cf. Europeana, [2011?]b: 16).
6.3 Study Limitations and Areas for Future Research
This study has two primary limitations, centered on query classification and the presence of
query ‘selections’:
1. Query classification: focusing on query text and frequencies without further
contextual information from query log or users made accurate classification
difficult and complicated the investigation of aspects like query language. Although
69
this study has aimed to overcome these difficulties by adopting consistent
approaches to classification, it could be enhanced by support from alternative data
sources (see: Section 3.1.1) and/or analysis of a larger query log dataset from a
longer time period.
2. Query ‘selections’: prefixes appearing in log data, defined here as query ‘selections’
(see: Section 4.1.1) were often unclear, with the difficulty of distinguishing search
and browse in particular increasing the complexity of drawing meaningful
conclusions from the study data. It would have been useful to ascertain from
Europeana staff exactly how ‘selections’ arise in the data, rather than relying on the
researcher’s potentially limited experimentation and personal experience of the
service.
The study could be extended in several different ways. For example, it could be repeated on
a broader scale by incorporating data like session information, interviews or surveys for
different user groups (see, e.g. Hearst, 2009; Jansen et al., 2007; Park et al., 2005).
Alternatively, a narrower study could focus solely on classifying initial queries without
query ‘selections’ or refinements, or consider aspects explored briefly here in more detail
(e.g. query languages or other filtering options). It would also be interesting to assess the
transferability of the study classification scheme beyond the Europeana case study to
queries from other cultural heritage organisations, or fields outside the cultural heritage
sphere.
WORD COUNT: 14,069
Programme of Study INFT03 MA Librarianship
Module Code INF6000 Dissertation
Student Registration Number 110134998
70
References
Agosti, M., Crivellari, F. & Di Nunzio, G. (2012). “Web log analysis: a review of a decade of
studies about information acquisition, inspection and interpretation of user interaction”.
Data Mining and Knowledge Discovery [Online], 24(3), 663-696.
http://www.springerlink.com/content/t36px9850w1u3877/?MUD=MP [Accessed 26 June
2012].
Bowler, L., Koshman, S., Oh, J.S., He, D., Callery, B.G., Bowker, G. & Cox, R.J. (2011). “Issues
in User-Centered Design in LIS”. Library Trends [Online], 59(4), 721-752.
http://dx.doi.org/10.1353/lib.2011.0013 [Accessed 10 June 2012].
Broder, A. (2002). “A taxonomy of web search”. ACM SIGIR Forum [Online], 36(2), 3-10.
http://dl.acm.org/citation.cfm?doid=792550.792552 [Accessed 7 June 2012].
Chaudhry, A.S. & Jiun, T.P. (2005). “Enhancing access to digital information resources on
heritage: A case of development of a taxonomy at the Integrated Museum and Archives
System in Singapore”. Journal of Documentation [Online], 61(6), 751-776.
http://dx.doi.org/10.1108/00220410510632077 [Accessed 25 February 2012].
CIBER Research Ltd. (2011). Europeana: Culture on the go [Online]. [Newbury, England:
CIBER Research Ltd.?]. Available from:
http://www.pro.europeana.eu/documents/858566/858665/Culture+on+the+Go [Accessed
5 June 2012].
Clough, P. (2009). Deliverable 4.1. TrebleCLEF Query Log Analysis Workshop Report [Online].
[Pisa: TrebleCLEF?]. Available from: http://ir.shef.ac.uk/cloughie/qlaw2009/D4.1-final.pdf
[Checked 17 March 2012].
Concordia, C., Gradmann, S. & Siebinga, S. (2010). “Not just another portal, not just another
digital library: A portrait of Europeana as an application program interface”. IFLA Journal
[Online], 36(1), 61-69. http://ifla.sagepub.com/content/36/1/61 [Accessed 14 June 2012].
Cooper, A. (2008). “A survey of query log privacy-enhancing techniques from a policy
perspective”. ACM Transactions on the Web (TWEB) [Online], 2(4), Article 19: 1-27.
http://dx.doi.org/10.1145/1409220.1409222 [Accessed 14 February 2012].
71
Cooper, M.D. (2001). “Usage patterns of a web-based library catalog”. Journal of the
American Society for Information Science and Technology [Online], 52(2), 137-148. Available
from: http://onlinelibrary.wiley.com [Accessed 8 June 2012].
Cousins, J., Chambers, S. & van der Meulen, E. (2008). “Uncovering cultural heritage
through collaboration”. International Journal on Digital Libraries [Online], 9(2), 125-138.
http://www.springerlink.com/content/q1067151j02j248n/fulltext.pdf [Accessed 15 August
2012].
Eirinaki, M. & Vazirgiannis, M. (2003). “Web mining for web personalization”. ACM
Transactions on Internet Technology (TOIT) [Online], 3(1), 1-27.
http://dl.acm.org/citation.cfm?doid=643477.643478 [Accessed 9 June 2012].
Eldredge, J.D. (2004). “Inventory of research methods for librarianship and informatics”.
Journal of the Medical Library Association [Online], 92(1), 83-90. Available from:
http://web.ebscohost.com [Accessed 7 February 2012].
Europeana. (2010). The Europeana Public Domain Charter [Online]. The Hague:
Europeana.eu c/o the Koninklijke Bibliotheek. Available from:
http://pro.europeana.eu/web/guest/publications [Accessed 6 June 2012].
Europeana. [2011?]a. Strategic Plan 2011-2015 [Online]. [The Hague: Europeana.eu c/o the
Koninklijke Bibliotheek?]. Available from: http://pro.europeana.eu/web/guest/publications
[Accessed 5 June 2012].
Europeana. [2011?]b. Business Plan 2012 [Online]. [The Hague: Europeana.eu c/o the
Koninklijke Bibliotheek?]. Available from: http://pro.europeana.eu/web/guest/publications
[Accessed 6 June 2012].
Europeana. (2012). faust-Europeana-Search results [Online]. The Hague: Europeana.eu c/o
the Koninklijke Bibliotheek. http://www.europeana.eu/portal/search.html?query=faust
[Accessed 31 August 2012].
Europeana. [2012?]a. Europeana – Homepage [Online]. The Hague: Europeana.eu c/o the
Koninklijke Bibliotheek. http://www.europeana.eu/portal/ [Accessed 27 February 2012].
Europeana. [2012?]b. The Europeana Foundation [Online]. The Hague: Europeana.eu c/o
the Koninklijke Bibliotheek. http://pro.europeana.eu/web/guest/foundation [Accessed 15
August 2012].
72
Europeana. [2012?]c. Facts and Figures [Online]. The Hague: Europeana.eu c/o the
Koninklijke Bibliotheek. http://pro.europeana.eu/web/guest/about/facts-figures [Accessed
16 May 2012].
Europeana. [2012?]d. Searching Europeana [Online]. The Hague: Europeana.eu c/o the
Koninklijke Bibliotheek. http://www.europeana.eu/portal/usingeuropeana_search.html
[Accessed 16 August 2012].
Europeana. [2012?]e. Exploring Europeana [Online]. The Hague: Europeana.eu c/o the
Koninklijke Bibliotheek. http://www.europeana.eu/portal/usingeuropeana_explore.html
[Accessed 16 August 2012].
Europeana. [2012?]f. Results in Europeana [Online]. The Hague: Europeana.eu c/o the
Koninklijke Bibliotheek. http://www.europeana.eu/portal/usingeuropeana_results.html
[Accessed 16 August 2012].
Europeana. [2012?]g. Using My Europeana [Online]. The Hague: Europeana.eu c/o the
Koninklijke Bibliotheek.
http://www.europeana.eu/portal/usingeuropeana_myeuropeana.html [Accessed 16
August 2012].
Gabrilovich, E., Broder, A., Fontoura, M., Joshi, A., Josifovski, V., Riedel, L. & Zhang, T.
(2009). “Classifying Search Queries Using the Web as a Source of Knowledge”. ACM
Transactions on The Web [Online], 3(2, Article 5), 1-28.
http://doi.acm.org/10.1145/1513876.1513877 [Accessed 27 June 2012].
Glaser, B.G. & Strauss, A.L. (1967). The Discovery of Grounded Theory: Strategies for
Qualitative Research. New York: Aldine de Gruyter (A Division of Walter de Gruyter, Inc.).
Google. [2012?]. Google Translate [Online]. [Mountain View, CA: Google Inc.?].
http://translate.google.com/ [Accessed 9 August 2012].
Hargittai, E. (2002). “Beyond logs and surveys: In-depth measures of people’s web use
skills”. Journal of the American Society for Information Science and Technology [Online],
53(14), 1239-1244. http://onlinelibrary.wiley.com/doi/10.1002/asi.10166/pdf [Accessed 10
June 2012].
Hearst, M.A. (2009). Search User Interfaces. Cambridge: Cambridge University Press.
73
Hider, P. & Pymm, B. (2008). “Empirical research methods reported in high-profile LIS
journal literature”. Library & Information Science Research [Online], 30(2), 108-114.
http://dx.doi.org/10.1016/j.lisr.2007.11.007 [Accessed 7 February 2012].
Jansen, B.J. (2006). “Search log analysis: What it is, what’s been done, how to do it”. Library
and Information Science Research [Online], 28(3), 407-432.
http://www.sciencedirect.com/science/article/pii/S0740818806000673 [Accessed 8 June
2012].
Jansen, B.J. (2009). Understanding User-Web Interactions via Web Analytics [Synthesis
Lectures on Information Concepts, Retrieval, and Services #6 (Series ed. Gary Marchionini)].
[San Rafael, California?]: Morgan & Claypool Publishers.
Jansen, B.J. & Pooch, U. (2001). “A review of Web searching studies and a framework for
future research”. Journal of the American Society for Information Science and Technology
[Online], 52(3), 235-246. Available from: http://onlinelibrary.wiley.com [Accessed 9 June
2012].
Jansen, B.J. & Spink, A. (2006). “How are we searching the World Wide Web? A comparison
of nine search engine transaction logs”. Information Processing and Management [Online],
42(1), 248-263. http://www.sciencedirect.com/science/article/pii/S0306457304001396
[Accessed 7 June 2012].
Jansen, B.J., Spink, A., Blakely, C. & Koshman, S. (2007). “Defining a Session on Web Search
Engines”. Journal of the American Society for Information Science and Technology [Online],
58(6), 862-871. http://onlinelibrary.wiley.com/doi/10.1002/asi.20564/pdf [Accessed 14
February 2012].
Jansen, B.J., Taksa, I. & Spink, A. (2009). “Chapter I. Research and Methodological
Foundations of Transaction Log Analysis”. In: Jansen, B.J., Spink, A. & Taksa, I. Handbook of
Research on Web Log Analysis. pp. 1-16. Hershey, PA: Information Science Reference (An
imprint of IGI Global).
Jansen, B.J., Liu, Z., Weaver, C., Campbell, G. & Gregg, M. (2011). “Real time search on the
web: Queries, topics, and economic value”. Information Processing and Management
[Online], 47(4), 491-506.
http://www.sciencedirect.com/science/article/pii/S0306457311000082 [Accessed 27 June
2012].
74
Kirchoff, T., Schweibenz, W. & Sieglerschmidt, J. (2008). “Archives, libraries, museums and
the spell of ubiquitous knowledge”. Archival Science [Online], 8(4), 251-266.
http://dx.doi.org/10.1007/s10502-009-9093-2 [Accessed 10 June 2012].
Kurth, M. (1993). “The limits and limitations of transaction log analysis”. Library Hi Tech
[Online], 11(2), 98-104. http://www.emeraldinsight.com/journals.htm?issn=0737-
8831&volume=11&issue=2&articleid=1676224&show=pdf [Accessed 10 June 2012].
Levene, M. (2006). An Introduction to Search Engines and Web Navigation. Harlow:
Addison-Wesley (An imprint of Pearson Education Limited).
Li, L., Zhong, L., Xu, G. & Kitsuregawa, M. (2012). “A feature-free search query classification
approach using semantic distance”. Expert Systems with Applications [Online], 39(12),
10739-10748. http://www.sciencedirect.com/science/article/pii/S0957417412004642
[Accessed 26 June 2012].
Library of Congress. (2010). Codes for the Representation of Names of Languages [Online].
Washington, DC: The Library of Congress. http://www.loc.gov/standards/iso639-
2/php/code_list.php [Accessed 15 August 2012].
Library of Congress. [2012?]. Library of Congress Subject Headings [Online]. Washington,
DC: The Library of Congress. http://id.loc.gov/authorities/subjects.html [Accessed 30 July
2012].
Meyer, É., Grussenmeyer, P., Perrin, J.-P., Durand, A. & Drap, P. (2007). “A web information
system for the management and the dissemination of Cultural Heritage data”. Journal of
Cultural Heritage [Online], 8(4), 396-411. http://dx.doi.org/10.1016/j.culher.2007.07.003
[Accessed 28 March 2012].
Minelli, S.H., Marlow, J., Clough, P., Cigarran Recuero, J.M., Gonzalo, J., Oomen, J. &
Loschiavo, D. (2007). “Gathering requirements for multilingual search of audiovisual
material in cultural heritage”. In: Proceedings of Workshop on User Centricity – state of the
art (16th IST Mobile and Wireless Communications Summit). Budapest, Hungary, 1-5 July
2007. 5pp. [Brussels: Information Society Technologies?]. Available from:
http://ir.shef.ac.uk/cloughie/papers/mobilesummit2007-minelli.pdf [Checked 1 March
2012].
75
Nicholas, D. & Clark, D. (2012). “Evidence of user behaviour: deep log analysis”. In: Dobreva,
M., O’Dwyer, A. & Feliciati, P. (eds.). User Studies for Digital Library Development. pp. 85-
94. London: Facet Publishing.
Ott, M. & Pozzi, F. (2011). “Towards a new era for Cultural Heritage Education: Discussing
the role of ICT”. Computers in Human Behavior [Online], 27(4), 1365-1371.
http://dx.doi.org/10.1016/j.chb.2010.07.031 [Accessed 27 March 2012].
Park, S., Lee, J.H. & Bae, H.J. (2005). “End user searching: A Web log analysis of NAVER, a
Korean Web search engine”. Library & Information Science Research [Online], 27(2), 203-
221. http://dx.doi.org/10.1016/j.lisr.2005.01.013 [Accessed 27 March 2012].
Politou, E.A., Pavlidis, G.P. & Chamzas, C. (2004). “JPEG2000 and dissemination of cultural
heritage over the Internet”. IEEE Transactions on Image Processing [Online], 13(3), 293-301.
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=1278354
[Accessed 7 June 2012].
Purday, J. (2009). “Think culture: Europeana.eu from concept to construction”. The
Electronic Library [Online], 27(6), 919-937.
http://www.emeraldinsight.com/journals.htm?articleid=1827227&show=abstract
[Accessed 21 April 2012].
Purday, J. (2010). “Intellectual Property Issues and Europeana, Europe’s Digital Library,
Museum and Archive”. Legal Information Management [Online], 10(3), 174-180.
http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=7891425
[Accessed 21 April 2012].
Ross, N.C.M & Wolfram, D. (2000). “End User Searching on the Internet: An Analysis of
Term Pair Topics Submitted to the Excite Search Engine”. Journal of the American Society
for Information Science [Online], 51(10), 949-958. Available from:
http://onlinelibrary.wiley.com [Accessed 9 June 2012].
Silverstein, C., Marais, H., Henzinger, M. & Moricz, M. (1999). “Analysis of a very large web
search engine query log”. ACM SIGIR Forum [Online], 33(1), 6-12.
http://dl.acm.org/citation.cfm?doid=331403.331405 [Accessed 7 June 2012].
Spink, A., Ozmuthu, S., Ozmuthu, H.C. & Jansen, B.J. (2002). “U.S. versus European web
searching trends”. ACM SIGIR Forum [Online], 36(2), 32-38.
http://dx.doi.org/10.1145/792550.792555 [Accessed 28 March 2012].
76
Voorbij, H. (2010). “The use of web statistics in cultural heritage institutions”. Performance
Measurement and Metrics [Online], 11(3), 266-279.
http://dx.doi.org/10.1108/14678041011098541 [Accessed 10 June 2012].
Walsh, J. (2011). “The use of Library of Congress Subject Headings in digital collections”.
Library Review [Online], 60(4), 328-343.
http://www.emeraldinsight.com/journals.htm?articleid=1923626&show=abstract
[Accessed 7 August 2012].
Wikipedia: The Free Encyclopedia. [2012?]. Wikipedia [Online]. San Francisco, CA:
Wikimedia Foundation Inc. http://www.wikipedia.org/ [Accessed 9 August 2012].
Yahoo! Inc. (2012). Yahoo! Directory [Online]. Sunnyvale, CA: Yahoo! Inc.
http://dir.yahoo.com/ [Accessed 22 August 2012].
77
Appendix
Appendix 1: A Summary of the Classification Scheme Developed for
Filters Dataset ‘text’ Query Refinements
Primary Categories Secondary Categories Tertiary Categories
Philosophy, Mythology and
Religion
Philosophy
Mythology
Religion
Ideas and Concepts
Ancient Greece and Rome
Abrahamic Religions
Place and Civilisation Country or Settlement
Region
Civilisation or Culture
City (Capital)
City (Other)
Town or Village
Society and Current Affairs Royalty and Nobility
Politics
Crime
Organisations and Events
Historical Figures
Journalism
Organisation or Institution
Named Event
Military and Military History Subjects
People
The Arts Visual Arts and Theatre
Music
People
People
78
Poetry and Literature People (Creators)
People (Fictional)
The Sciences People Historical Figures
Generic Subjects Place
Time
Person
Object or Form Descriptors Visual Arts (2D)
Object, Textile and Sculpture
Architecture
Written Word
Drawing and Painting
Photography
Stamps
Textile
Pottery and Ceramic
Ambiguous or Unclear Name
Computing or Search
Functionality
Other
Forename
Surname
79
Appendix 2: Results of Exploratory Mapping between Preliminary
Classification ‘Primary Category’ Terms and Library of Congress
Subject Headings (Library of Congress, [2012?])
Highlighted descriptors appear in search results for multiple preliminary classification
Primary Categories.
Preliminary Classification:
Primary Categories
Potential Mapping:
‘Library of Congress Subject Headings’
Quoted from searches conducted at:
Library of Congress ([2012?]: n.p.)
Philosophy, Mythology and Religion “Philosophy”
“Philosophy, Ancient”
“Idea (Philosophy)”
“Philosophy, Modern”
“Philosophy and religion”
“Philosophy and social sciences”
“Philosophy in literature”
“Philosophy and civilization”
“Ethics”
“Mythology”
“Mythology, Classical”
“Mythology in literature”
“Religion”
“Theology”
“Religion and sociology”
“Religion and politics”
“Religion and civilization”
“Religion and culture”
“Religion and literature”
“Religion and religious literature”
“Philosophy and religion”
“Philosophy and religion in literature”
“Religions”
Place and Civilisation “Philosophy and civilization”
80
“Place (Philosophy)”
“Names, Geographical”
“Civilization”
“Civilization in literature”
Society and Current Affairs “War and society”
“Civil society”
“Politics and government”
“Politics and culture”
“Press and politics”
“Political activity”
“World politics”
“Mass media and world politics”
“Popular culture”
“Political science”
Military and Military History “War and society”
“Military history”
“Military missions”
“Military paraphernalia”
“Military art and science”
“Military history, modern”
“Soldiers”
“Military campaigns”
“Military policy”
“Military life”
“Military history in literature”
“Combat”
“Armed Forces”
The Arts “Arts”
“Arts and society”
“Decorative arts”
“Arts, Ancient”
“Arts, Modern”
“Arts, Classical”
“Arts and history”
81
“Graphic arts”
“Arts and society in literature”
“Arts in literature”
“Art”
“Performing arts and literature”
“Art criticism”
“Art and state”
The Sciences “Science”
“Science and civilization”
“Science and state”
“Literature and science”
“Science and industry”
“Science and the arts”
“Historians of science”
“Philosophy and science”
“Science in literature”
“Science and civilization in literature”
“Research”
“Technology”
“Scientific literature”
“Natural history”
“Discoveries in science”
Generic Subjects n/a
Object/Form Descriptors Potential mapping to Library of Congress LC
Genre/Form Terms (Library of Congress,
[2012?]: n.p.).
Ambiguous or Unclear “Anonymous persons”
“Anonymous writings”
“Anonymous art”
“Names, Personal”
“Human-computer interaction”
82
Appendix 3: A Summary of the Study Classification Scheme Following
Refinement based on Popular Queries Dataset Analysis
Primary Categories Secondary Categories Tertiary Categories
Philosophy, Mythology and
Religion
Philosophy
Mythology
Religion
Ideas and Concepts
Named Philosophers
Classical: Ancient Greece,
Rome and Egypt
Folk and Fairy Tales
Iconography
Religious Buildings
Historical Figures (Biblical)
Named Religions
Place and Civilisation Geographical Area
Countries and Settlements
Civilisation or Culture
Region (multi-country)
City (Capital)
City (Other)
Municipality, Town or
Village
Region (single country)
Island
Historical Place Names
Politics and Society Political Figures
Popular Culture
Politicians/Political Leaders
Royalty and Nobility
Historical Figures
Fashion
Journalism
Entertainment and Events
83
Local Government and
Facilities
Crime
International Relations
Local Amenities
Healthcare
Political Agreements
Military and Military History Military Figures
Military Engagements
Military Objects
Military Leaders
Military Personnel
Prisoners of War
World War One
World War Two
Tactics and Strategy
Treaties and Agreements
The Arts Artists, Authors and Creators
Artistic Genres
Named Works
Painters and Illustrators
Photographers
Authors and Poets
Actors and Actresses
Architects
Designers
Visual Arts (2D)
Film and Theatre
Fashion and Design
Architecture
Music
Poetry and Literature
Other
Paintings
The Sciences Scientists and Scientific
Figures
Historical Figures
84
Scientific Genres
Architecture
Natural History
Archaeology
Business and Industry Named Figures
Genres
Named Companies or
Organisations
Engineering
Healthcare
Advertising
Patents
Collections, Organisations
and Institutions
Libraries and Archives
Museums and Galleries
Other Collections
Portals and Aggregators
Geographical Designations
Physical Collections
Online Collections
Europeana Collections
Generic Subjects Place
Time
Person
Object
Other
Object or Form Descriptors Visual Art Formats (2D) Drawing, Painting and
85
Visual Art Formats (3D)
Audio or Moving Image
Results Formats
Illustration
Printing
Design
Photography
Maps and Surveys
Stamps, Postcards and
Bookplates
Architecture
Design
Textiles
Pottery, Ceramic and
Glassware
Sculpture
Other Objects
Film
Query Selections
Ambiguous or Unclear Personal Names
Place Names
Computing Functionality or
Search Feature
Other
Forename
Surname
86
Appendix 4: Results of Potential Mapping between Final
Classification Scheme Primary, Secondary and Tertiary Category
terms and Existing Schemes
Final Classification: Primary
Categories
Potential Mapping: Yahoo! Directory
Quoted from: Yahoo! Inc. (2012: n.p.)
Philosophy, Mythology and Religion (“Arts & Humanities”)
“Social Science”
“Society & Culture”
Place, Civilisation and Travel “Recreation & Sports”
“Regional”
Politics and Society “Education”
“Government”
“Health”
“News & Media”
“Society & Culture”
Military and Military History (“Government”)
Lifestyle and Entertainment (“Business & Economy”)
“Computer & Internet”
“Entertainment”
“Recreation & Sports”
Arts and Design “Arts & Humanities”
Literature and Poetry “Arts & Humanities”
Music, Film and Theatre “Arts & Humanities”
“Entertainment”
Architecture, Buildings and Structures (No clear top-level category)
Sciences “Science”
“Social Science”
Business and Industry “Business & Economy”
Generic Subjects n/a
Collections, Organisations and
Institutions
“Arts & Humanities”
Object or Form Descriptors n/a
Ambiguous or Unclear n/a
87
Final
Classification:
Secondary
Categories
Potential Mapping:
‘Library of Congress
Subject Headings’
Quoted from
searches conducted
at:
Library of Congress
([2012?]: n.p.)
Final Classification:
Tertiary Categories
Potential Mapping:
‘Library of Congress
Subject Headings’ Quoted
from searches conducted
at:
Library of Congress
([2012?]: n.p.)
Philosophy,
Mythology and
Religion:
Philosophy
Mythology
Religion
“Philosophy”
“Mythology”
“Religion”
Named Figures
Ideas and Concepts
Folk and Fairy Tales
Legends
Classical Philosophy,
Mythology and
Religion
Theology and
Religious History
Named Religions
“Philosophers”
“Philosophers, Modern”
“Philosophers, Ancient”
“Ethics”
“Idea (Philosophy)”
“Tales”
“Fairy tales”
“Folklore”
“Legends”
“Mythology, Classical”
“Philosophy, Ancient”
“Theology”
“Religious History”
“Religions”
88
and Religious
Groups
Named Figures:
Ministers and
Officials
Named Figures:
Religious Texts
Festivals and
Ceremonies
Iconography and
Objects
Religious Buildings,
Locations and
Communities
“Religious institutions”
“Clergy”
“Associate clergy”
“Church musicians”
(n/a: too broad)
“Fasts and feasts”
“Rites and ceremonies”
“Idols and images”
“Religious articles”
“Religious facilities”
“Religious communities”
Place,
Civilisation and
Travel:
Geographical
Features or
Regions
Countries and
Settlements
“Geography”
“Physical
geography”
“Environmental
geography”
“Names,
Geographical”
“Human
settlements”
Country
City: Capital
(n/a: not an over-arching
subject)
“Capitals (Cities)”
89
Travel
Civilisation or
Culture
“Travel”
“Civilization”
“Culture”
City: Other
Municipality, Town
or Village
Specified Address
Island (Inhabited)
Region or
Administrative
Region
Maps and Travel
Guides
Languages
Historical Place
Names
Ancient and
Classical Civilisation
and Culture
“Cities and towns”
“Cities and towns”
“Villages”
“Street addresses”
“Islands”
“Regions”
“Regions (Administrative
and political divisions)”
“Maps”
“Tourist maps”
“Atlases”
“Guidebooks”
“Languages”
“Languages, Modern”
“Language and languages”
“Historic sites” (?)
“Civilization, Ancient”
“Civilization, Classical”
Politics and
Society:
Named Figures
(n/a: too broad)
Political Leaders and
Politicians
“Politicians”
90
News
Law and Crime
Amenities and
Facilities
History and
Social Change
“Foreign news”
“Press”
“Press coverage”
“Law”
“Crime”
(n/a: too broad)
“History”
“Social history”
“Social sciences and
history”
“Social change”
Royalty and Nobility
Named Newspapers
Journalism
History of Crime
Copyright
Housing
Hospitals and
Healthcare
Libraries
Schools and
Education
“Kings and rulers”
“Queens”
“Princesses”
“Princes”
“Royal houses”
“Nobility”
“Newspapers”
“Journalism”
“Crime—History”
“Copyright”
“Housing”
“Health facilities”
“Health facilities,
Proprietary”
“Hospitals”
“Hospitals, Proprietary”
“Libraries”
“Public services (Libraries)”
“Education”
“School facilities”
91
Organisations
and Societies
Civil
Ceremonies and
Events
International
Relations
“Societies, etc.”
“Associations,
institutions, etc.”
“Fraternal
organizations”
“Societies and
clubs”
“Societies”
“Clubs”
(n/a: no clear
equivalents)
“International
relations”
“Non-state actors
(International
relations)”
Marriage
Political Agreements
“Marriage”
“Treaties”
“International obligations”
Military and
Military
History:
Named Figures
Military
Engagements
(n/a: too broad)
“Wars”
“Combat”
Military Leaders and
Personnel
Prisoners of War
Historical Figures
Strategy and Tactics
“Soldiers”
“Veterans”
“Prisoners of war”
“Ex-prisoners of war”
(n/a: no clear equivalents)
“Strategy”
“Defensive (Military
92
Procedure and
Discipline
Military Objects
“Battles”
“Military missions”
“Military discipline”
“Military
paraphernalia”
Treaties and
Agreements
World Wars
Military Tribunals
Military Records
Buildings, Locations
and Bases
Transport
Weapons and
Equipment
science)”
“Offensive (Military
science)”
“Tactics”
“Treaties”
“Armistices”
“World War, 1914-1918”
“World War, 1939-1945”
“Military courts”
“Courts-martial and courts
of inquiry”
“Military administration”
“Military bases”
“Transportation, Military”
“Vehicles, Military”
“Weapons”
“Military supplies”
Lifestyle and
Entertainment:
Entertainment
and Events
“Entertainment
events”
Performances
Exhibitions
Arcades
“Performances”
“Exhibitions”
“Arcades”
93
Transport
Sport
Computing
Fashion and
Beauty
Advertising
“Transportation”
“Vehicles”
“Sports”
“Computer
systems”
“Computers”
“Fashion”
“Beauty, Personal”
“Advertising”
Road
Rail
Air
Other
Named Sports and
Sports Clubs
Sporting Events
Equipment
Social Media
“Transportation,
Automotive”
“Automobiles”
“Railroads”
“Railroad trains”
“Aeronautics, Commercial”
“Airplanes”
(n/a)
“Sports”
“Athletic clubs”
“Sports administration”
“Hosting of sporting
events”
“Sporting goods”
“Social media”
“Online social networks”
Arts and
Design:
Named Figures
(n/a: too broad)
Creators or
Designers
“Artists”
“Designers”
94
Named Works
or Subjects
Artistic Periods,
Styles or
Movements
Genres
“Titles of works of
art”
“Art--Themes,
motives”
“Art genres”
“Art movements”
“Art genres”
History of Art
Classical Art
Portrait
Landscape
Painting, Drawing
and Illustration
Engraving and
Printing
Photography
Stamps
Bookplates
Postcards
Ceramics, Enamel,
“Art and history”
(plus by location e.g. “Art,
Italian—History”)
“Art, Classical”
“Art objects, Classical”
“Portraits”
“Landscapes in art”
“Painting”
“Drawing”
“Pictorial works”
“Illustrations”
“Engraving”
“Printing”
“Photography”
“Photography, artistic”
“Postage stamps”
“Bookplates”
“Postcards”
“Ceramics”
95
Pottery and Glass
Sculpture and
Figurines
Fashion, Clothing
and Jewellery
Other
“Decorative arts”
“Pottery”
“Enamel and enameling”
“Glass art”
“Art glass”
“Glassware”
“Sculpture”
“Small sculpture”
“Figurines”
“Fashion”
“Clothing and dress”
“Costume”
“Jewelry”
(n/a)
Literature and
Poetry:
Named Figures
Named Works
or Subjects
Literary
Periods, Styles
or Movements
(n/a: too broad)
“Titles of books”
“Literary form”
“Literary
movements”
Authors and Editors
Publishers
Classical Literature
“Authors”
“Poets”
“Editors”
“Publishers and
publishing”
“Authors and publishers”
“Classical literature”
96
Genres
“Style, Literary”
“Literary form”
Poetry
Literature (Fiction)
Literature (Non-
Fiction)
Ephemera
“Poetry”
“Fiction”
“Non-fiction…”
“Printed ephemera”
Music, Film and
Theatre:
Named Figures
Named Works
or Subjects
Periods, Styles
or Movements
Instruments
and Equipment
(n/a: too broad)
(n/a: no clear
equivalents)
“Popular music
genres”
“Film genres”
“Stage props”
“Motion pictures--
Creators or
Composers
Performers
Other
Folk Music
Musical Instruments
“Composers”
“Screenwriters”
“Dramatists”
“Entertainers”
“Actors”
“Male actors”
“Actresses”
“Musicians”
(n/a)
“Folk music”
“Musical instruments”
97
Genres
Setting and
scenery”
“Theaters--Stage-
setting and
scenery”
“Popular music
genres”
“Film genres”
Music
Film
Theatre
“Music”
“Motion pictures”
“Motion pictures and
television”
“Performing arts”
“Theater”
“Drama”
Architecture,
Buildings and
Structures:
Named Figures
Architectural
Periods, Styles
or Movements
(n/a: too broad)
(n/a: individual
examples e.g.
“International style
(Architecture)”,
“Modern
movement
(Architecture)”)
Architects
Landscape
Architecture
Castles, Palaces,
Religious Buildings
and Monuments
Civic Buildings,
Housing and
Businesses
“Architects”
“Landscape architecture”
“Castles”
“Palaces”
“Monuments”
“Religious facilities”
“Public buildings”
“Housing”
“Business enterprises”
“Industrial buildings”
98
Engineering
Structures
“Structural engineering”
Sciences:
Named Figures
Genres
“Scientists”
“Classification of
sciences”
Historical Figures
Natural History and
Biology (Non-
Human)
Animal Husbandry
and Food Science
Human Biology and
Medicine
Archaeology
Anthropology
Geography and
Cartography
Physics and
Astronomy
Technology
(n/a: no clear equivalents)
“Natural history”
“Zoology”
“Botany”
“Animal culture”
“Livestock”
“Domestic animals”
“Food”
“Nutrition”
“Human biology”
“Medicine”
“Archaeology”
“Anthropology”
“Geography”
“Cartography”
“Physical Sciences”
“Physics”
“Astronomy”
“Technology”
Business and
Industry:
99
Named
Companies or
Manufactories
Named
Products and
Advertising
Named
Industries
“Business
enterprises”
“Business names”
“Corporations”
“Industrial
buildings”
“Factories”
“Commercial
products”
“Brand name
products”
“Advertising”
“Branding
(Marketing)”
“Industries”
Patents
Mining and
Resource Extraction
Construction and
Manufacturing
Industries
“Patents”
“Mineral industries”
“Construction industry”
“Manufacturing industries”
Generic
Subjects:
Person
Place
Object
Time
(n/a: too broad)
(n/a: too broad)
(n/a: too broad)
“Time”
Date
“Chronology, Historical”
“Days”
“Months”
100
Other
(n/a)
“Year”
Named
Collections:
Libraries and
Archives
Museums and
Galleries
Other
Collections
Portals and
Aggregators
Geographical
Designations
“Libraries”
“Archives”
“Museums”
“Art museums”
(n/a: too broad)
“Web portals”
“Federated
searching”
“Names,
Geographical”
Object or Form
Descriptors:
Europeana
Query
Selections:
Format
(n/a: this study
focusing on
Europeana
functionality)
Ambiguous or
Unclear:
Person
“Anonymous
persons”
“Names, Personal”
101
Place
Computing
Functionality or
Search Feature
Other
“Human-computer
interaction”
“Anonymous
writings”
“Anonymous art”