AN INVESTIGATION INTO QUERIES SUBMITTED TO THE EUROPEANA...

AN INVESTIGATION INTO QUERIES SUBMITTED TO THE

EUROPEANA WEB PORTAL

A study submitted in partial fulfillment

of the requirements for the degree of

Masters (MA) in Librarianship

at

THE UNIVERSITY OF SHEFFIELD

by

EMMA C.M. SILVEY

September 2012

1

Structured Abstract

Background

Online availability significantly affects how people find and use different information

resources (see, e.g. Levene, 2006). This study considers information-seeking in the cultural

heritage environment, characterised by its especially wide range of users (see, e.g.

Chaudhry & Jiun, 2005). It focuses on the Europeana Web portal (Europeana, [2012?]a),

which allows users to easily locate diverse information types and formats from different

providers.

Aims

The study aims to investigate users’ information-seeking behaviour, as revealed through

queries submitted to the Europeana portal. Of particular concern are filter usage, indicating

query refinement, and query topics shown by query classification. Potential practical

outcomes include informing how cultural heritage information is provided online, including

suggested search functionalities.

Methods

The study utilises query log analysis, concentrating initially on search filter specifications.

Further datasets contain query text and frequencies: the 150 most popular queries, random

samples (total 300 queries) and 100 queries filtered by media type. A subject-based query

classification scheme is developed and presented for the popular and random queries, then

evaluated through classification of media-filtered queries. Aspects like query languages are

also considered.

Results

Approximately one-third of Filters Dataset queries are filtered, primarily by media type. Of

the classified datasets, frequent queries most often concern collections or places; whilst

place-related queries remain popular, random query samples contain far fewer collection-

based queries. Across different media, proportions of queries classified as ‘Music, Film or

Theatre’ are especially varied.

Conclusions

Overall, study findings such as the prominence of place-related queries dovetail well with

existing literature (see, e.g. Jansen et al., 2011). The study classification scheme

nevertheless indicates the importance of subject-based search and browse support. It is

2

therefore recommended that Europeana incorporates greater functionality concerning

query topics. More detailed consideration of individual users’ search patterns is suggested

as a future research area.

(Abstract Word Count: 297 words)

3

Acknowledgements

I would like to thank my dissertation supervisor, Dr Paul Clough, for his help and support

throughout this project and Dr Mark Hall for providing the query log data for analysis.

Acknowledgement is also due to the Arts and Humanities Research Council (AHRC) for

funding my MA Librarianship programme of study at the University of Sheffield (2011-

2012).

4

Table of Contents

Structured Abstract ............................................................................................. 1

Background .......................................................................................................................... 1

Aims ..................................................................................................................................... 1

Methods ............................................................................................................................... 1

Results .................................................................................................................................. 1

Conclusions .......................................................................................................................... 1

Acknowledgements ............................................................................................. 3

Tables and Figures ............................................................................................... 7

Tables ................................................................................................................................... 7

Figures .................................................................................................................................. 8

Chapter 1: Introduction and Aims ........................................................................ 9

1.1 The Information Environment ....................................................................................... 9

1.2 The Europeana Web Portal ............................................................................................ 9

1.3 Study Aim and Objectives ............................................................................................ 11

1.3.1 Aim ........................................................................................................................ 11

1.3.2 Objectives .............................................................................................................. 11

Chapter 2: Literature Review ............................................................................. 13

2.1 The Modern Information Environment ....................................................................... 13

2.2 Investigating Search ..................................................................................................... 14

2.3 Query Log Analysis ....................................................................................................... 15

2.4 Applications of Query Log Analysis .............................................................................. 16

2.5 Query Classification ..................................................................................................... 17

2.6 Cultural Heritage Information ...................................................................................... 19

2.7 Query Log Analysis and Query Classification in Cultural Heritage ............................... 21

Chapter 3: Methodology .................................................................................... 24

3.1 Background and Theoretical Basis ............................................................................... 24

5

3.1.1 Query Log Analysis ................................................................................................ 24

3.1.2 Query Classification .............................................................................................. 25

3.2 Data Collection and Analytical Approach ..................................................................... 26

3.2.1 Datasets ................................................................................................................ 26

3.2.2 Data Analysis and Classification Scheme Development ....................................... 27

Chapter 4: Results .............................................................................................. 31

4.1 Filters Dataset .............................................................................................................. 31

4.1.1 Filters and Query ‘Selections’ ............................................................................... 31

4.1.2 Filter Specification ................................................................................................. 32

4.2 Popular Query Analysis ................................................................................................ 38

4.2.1 Dataset Characteristics ......................................................................................... 38

4.2.2 Query ‘Selections’ and Language .......................................................................... 39

4.3 Random Sample Query Analysis .................................................................................. 41

4.3.1 Dataset Characteristics ......................................................................................... 41

4.3.2 Query ‘Selections’ and Language .......................................................................... 42

4.3.3 Query Classification: Final Scheme Refinement ................................................... 44

4.3.4 Classification Scheme Mapping ............................................................................ 51

4.4 Comparison of Frequent and Random Query Classification Patterns ......................... 52

4.5 Queries Filtered by Media Type ................................................................................... 56

Chapter 5: Discussion ......................................................................................... 60

5.1 Europeana Querying Patterns: Filters, Query ‘Selections’ and Languages .................. 60

5.2 Classification Scheme Development ............................................................................ 63

Chapter 6: Conclusion ........................................................................................ 67

6.1 Fulfilment of Study Objectives (Section 1.3.2) ............................................................. 67

6.2 Recommendations for Cultural Heritage Information Provision ................................. 67

6.3 Study Limitations and Areas for Future Research ........................................................ 68

References ........................................................................................................ 70

6

Appendix ........................................................................................................... 77

Appendix 1: A Summary of the Classification Scheme Developed for Filters Dataset ‘text’

Query Refinements ............................................................................................................ 77

Appendix 2: Results of Exploratory Mapping between Preliminary Classification ‘Primary

Category’ Terms and Library of Congress Subject Headings (Library of Congress, [2012?])

........................................................................................................................................... 79

Appendix 3: A Summary of the Study Classification Scheme Following Refinement based

on Popular Queries Dataset Analysis ................................................................................. 82

Appendix 4: Results of Potential Mapping between Final Classification Scheme Primary,

Secondary and Tertiary Category terms and Existing Schemes ......................................... 86

Information School: Address & Other Confirmations

University of Sheffield - Information School: First Employment Destination Details

for School Records

Information School: Access to Dissertation

7

Tables and Figures

Tables

Table Number Title Page

1 Baseline classification of Filters Dataset ‘text’ query selections 28

2 The ten most popular options specified using the ‘LANGUAGE’

filter

36

3 The ten most popular options specified using the ‘COUNTRY’

filter

36

4 Language/Country pairs and associated ranks from Tables 2

and 3

37

5 The ten most popular options specified using the ‘PROVIDER’

filter

38

6 A summary of the final study classification scheme following

refinement based on analysis of Random Queries datasets

46 -

51

7 Categorisation percentages for queries in different study

datasets using the Primary Categories of the study

classification scheme

53

8 Comparative ranks for queries in different datasets classified

using the Primary Categories of the study classification

scheme

55

9 Categorisation percentages for popular Europeana queries

filtered by media type (Europeana, [2012?]d) using the

Primary Categories of the study classification scheme

58

8

Figures

Figure Number Title Page

1 A screenshot of a Europeana results page following query

submission (with query ‘faust’), showing filtering options

down the left hand side (Europeana, 2012)

31

2 A screenshot of the Europeana Web portal Homepage

(Europeana, [2012?]a), showing the ‘Explore’ option along the

top of the screen (Europeana, [2012?]e: n.p.)

32

3 A pie chart showing usage proportions (%) of Europeana filters

specified in the Filters Dataset

33

4 A pie chart showing the usage proportions (%) of media type

options specified in the Filters Dataset ‘TYPE’ filter

34

5 A chart showing frequency of usage (aggregated by century)

for the Filters Dataset ‘YEAR’ filter

35

6 Popular Queries dataset frequency distribution, excluding the

top-ranked result

39

7 Usage frequencies for different query ‘selections’ in the

Popular Queries dataset

40

8 Language frequencies for the Popular Queries dataset 41

9 Frequency distributions for both Random Queries datasets 42

10 Usage frequencies for different query ‘selections’ in the

Random Queries datasets

43

11 Language frequencies for the Random Queries datasets 44

12 Frequency distributions for the most popular queries

specifying different Europeana media-based filtering options

(Europeana, [2012?]d)

56

13 Frequency distributions for the most popular queries

specifying different Europeana media-based filtering options

(Europeana, [2012?]d), excluding the top-ranked ‘Text’ query

57

14 Primary Category percentages for popular Europeana queries

filtered by media type (Europeana, [2012?]d)

59

9

Chapter 1: Introduction and Aims

1.1 The Information Environment

Bowler et al.’s (2011: 746) characterisation of an emerging “knowledge era” indicates the

vital importance of information in modern society; technological developments have

transformed both information provision and information-seeking behaviour (Hearst, 2009;

Jansen et al., 2011; Levene, 2006; Nicholas & Clark, 2012; Purday, 2010). In particular, rising

available volumes of Web-based information have prompted the emergence of search

services that aim to help users locate relevant material (see, e.g. Broder, 2002; Clough,

2009; Eirinaki & Vazirgiannis, 2003; Jansen & Pooch, 2001; Jansen & Spink, 2006; Levene,

2006).

Two important recent developments in online services and information provision are Web

portals, providing access to information aggregated from different providers, and

transaction-based Websites, facilitating greater interactivity between users and systems

(Agosti et al., 2012; Eirinaki & Vazirgiannis, 2003; Hearst, 2009; Levene, 2006). The former

can include specialised portals, for example tailored to different users or areas of interest

(see, e.g. Agosti et al., 2012; Kirchoff et al., 2008; Minelli et al., 2007; Ott & Pozzi, 2011;

Purday, 2009; Voorbij, 2010).

This study is situated at the interface between modern information provision and the

cultural heritage sphere. Cultural heritage information, encompassing current and historical

multimedia material, appeals to diverse users in research, education, professional and

personal interest contexts (Chaudhry & Jiun, 2005; Concordia et al. 2010; Europeana,

[2012?]c; Kirchoff et al., 2008; Meyer et al., 2007; Minelli et al., 2007; Purday, 2009).

Kirchoff et al. (2008: 255) further state that “the quantity of cultural information on the

Internet is growing rapidly”, illustrating how cultural heritage organisations are taking

advantage of new technologies to promote their resources online, for example through the

Europeana portal (Europeana, [2012?]a).

1.2 The Europeana Web Portal

Launched initially in 2008, Europeana is described as “Europe’s digital library, archive and

museum” (Purday, 2009: 919). Contributors to the Web portal, which enables “access

to...over 23 million objects” (Europeana, [2012?]c: n.p.), represent different geographical

areas and domains of “cultural and scientific heritage” (Europeana, [2012?]b: n.p.). Indeed,

the Europeana Strategic Plan 2011-2015 states the service’s “aim to give access to all of

10

Europe’s digitised cultural heritage by 2025” (Europeana, [2011?]a: 5). The portal therefore

has multiple distinct current and potential user groups (Concordia et al., 2010; Europeana,

[2011?]a; Europeana, [2012?]c; Purday, 2009). Funding is received from a variety of

sources, including the European Commission (Purday, 2009: 932. See also: Europeana,

[2011?]b).

The portal provides a variety of search and browse functionalities, including “a multilingual

interface” (Purday, 2009: 919) that reflects its international focus (see also: Europeana,

[2011?]a; Europeana, [2011?]b; Europeana, [2012?]c). The service has a strong interest in

keeping up-to-date with online information-seeking trends, including consideration of how

to support different stages of the search process (see, e.g. Concordia et al., 2010;

Europeana, [2011?]b). For example, a report by CIBER Research Ltd. (2011: 6) focuses

especially on developing mobile access to Europeana, since several important search

functionalities are not supported by the current mobile interface; this area is also

highlighted in the Europeana Business Plan 2012 (Europeana, [2011?]b: 16. See also:

Nicholas & Clark, 2012).

Discoverability is an additional concern. The CIBER Research Ltd. (2011: 12) report

emphasises that “it is now possible to do a detailed search for Europeana content using a

popular search engine like Google”, therefore overcoming a common problem for earlier

digital libraries (see also: Europeana, [2011?]a; Nicholas & Clark, 2012). Subject-based

access is also considered, including the potential for future provision of “thematic browse

entry points” (Europeana, [2011?]b: 16), alongside the incorporation of new multimedia

content like “3D visualisations” (Europeana, [2011?]a: 12).

The Europeana service is not limited to the Web portal. Indeed, Concordia et al. (2010: 61)

state that “the main goal of Europeana is...to build an open services platform”, noting its

contribution to partners and the wider cultural heritage sphere alongside end users (see

also: Europeana, [2011?]a; Europeana, [2011?]b). Europeana’s Strategic Plan 2011-2015,

for example, identifies “four strategic tracks – aggregate, facilitate, distribute and engage”

(Europeana, [2011?]a: 11), highlighting its multiple service facets (see also: Europeana,

[2011?]b). Similarly, Kirchoff et al. (2008: 256) note that “metadata standards” are key for

organising cultural heritage information online. The Europeana Public Domain Charter adds

a more political dimension, emphasising that Europeana “belongs to the public and must

represent the public interest” (Europeana, 2010: 1).

11

1.3 Study Aim and Objectives

This study investigates online queries submitted to the Europeana Web portal, focusing

specifically on query text and the specification of filters (e.g. language, named collections)

that can be selected from results pages following initial query entry (Europeana, [2012?]d)

and identified through query log metadata. Exploring broad querying patterns, alongside

more granular consideration of filtering specifications, is considered relevant for potentially

informing the portal’s search interface design and content structuring based on user needs

and preferences (see, e.g. Chaudhry & Jiun, 2005; Jansen, 2009; Levene, 2006; Minelli et al.,

2007).

1.3.1 Aim

To investigate queries submitted online to the Europeana Web portal and develop a

query classification scheme; to utilise this scheme to compare popular and other

queries submitted to the portal; and to evaluate how study findings could inform

cultural heritage information provision online.

1.3.2 Objectives

1. To conduct a literature review concerning information provision and information-

seeking online, in general and cultural heritage contexts, and situating query log

analysis within its broader methodological framework.

2. To investigate filter specifications for a sample of online queries submitted to the

Europeana Web portal.

3. To analyse the 150 most popular queries (by frequency) submitted online to

Europeana and develop a query classification scheme.

4. To refine the classification scheme developed in Objective 3 by classifying and

analysing two samples of 150 random online queries to Europeana. Hence, to

evaluate and enhance the transferability of the classification scheme to queries

other than the most popular.

5. To utilise the study classification scheme to classify and compare the 25 most

frequent queries refined by different options within Europeana’s ‘Type’ filter (see,

e.g. Europeana, [2012?]f).

12

6. To consider the implications of study findings for organising and presenting cultural

heritage information online, via Europeana and other providers, in particular

through tailored content structuring, interface design and provision of search

functionalities.

13

Chapter 2: Literature Review

2.1 The Modern Information Environment

As noted in Section 1.1, online information-seeking is particularly significant in the modern

information environment (Europeana, 2010; Purday, 2010). Online information is

distinguished especially from other information resources by its diverse user groups, plus

the fact that both “Web content and Web user behavior are highly dynamic” (Li et al., 2012:

10740. See also: Cooper, 2001; Jansen & Pooch, 2001, Park et al., 2005). Indeed, Hargittai

(2002: 1243) conceptualises “the Web as a complex set of information retrieval services”,

potentially increasing the complexity of search and the likelihood of ‘information overload’

resulting from the large and rising volume of information available online (Eirinaki &

Vazirgiannis, 2003: 1).

Several authors therefore emphasise the importance of search engines (SEs) in online

information-seeking (see, e.g. Broder, 2002; Jansen & Pooch, 2001; Jansen & Spink, 2006).

Although Levene (2006: 10) notes that general SEs cannot always index “deep web”

information stored in databases or digital libraries, whilst Hargittai (2002: 1243) questions

their popularity, Broder (2002: 8) nevertheless suggests that modern SEs are increasingly

sophisticated and effective, focusing on “attempts to blend data from multiple sources”.

Additionally, “specialized search services” (Levene, 2006: 58) catering to different user

groups and/or specific (e.g. topic-based) information requirements are increasingly being

developed alongside general Web search tools (see also: Jansen, 2009).

A particular growth area in both general and specialist online information provision is

personalisation: “adapt[ing] the information or services provided by a Web site to the

needs of a particular user or set of users” (Eirinaki & Vazirgiannis, 2003: 1). Personalisation

is generally based on characteristics like cultural background, language or geographical

location that can potentially influence users’ information-seeking behaviour (see, e.g.

Clough, 2009; Cooper, 2008; Eirinaki & Vazirgiannis, 2003; Hearst, 2009; Jansen & Spink,

2006; Jansen et al., 2007; Jansen et al., 2011; Levene, 2006; Park et al., 2005; Spink et al.,

2002).

Other emerging trends include rising mobile Internet access, with implications for system

design features like tailored mobile interfaces (Agosti et al., 2012; Bowler et al., 2011; CIBER

Research Ltd., 2011; Clough, 2009; Hearst, 2009; Levene, 2006; Purday, 2009). In particular,

CIBER Research Ltd. (2011: 6) describe how search support must be tailored to the

14

functionalities of different mobile devices (see also: Agosti et al., 2012; Nicholas & Clark,

2012). The report adds that online information-seeking patterns can vary depending on

access type, reflecting how “people shift between different contexts and personas” (CIBER

Research Ltd., 2011: 22). There is also growing concern with the impact of social media on

information-seeking through engagement with features like tagging and “folksonomies”

(Gabrilovich et al., 2009: 26. See also: Agosti et al., 2012; Europeana, [2011?]b). Jansen et

al. (2011: 492), for example, highlight the evolution of “real time search engines” that can

incorporate dynamic, social media-type content.

2.2 Investigating Search

Researching online information-seeking requires consideration of diverse and interlinked

system and user factors. For example, compared with pre-Web information retrieval (IR), a

wider variety of users with different levels of knowledge and experience perform Web

searches (Bowler et al., 2011; Clough, 2009; Hearst, 2009; Ross & Wolfram, 2000;

Silverstein et al., 1999). Jansen (2006: 408) further summarises how “[a] Web search engine

may be a general-purpose search engine, a niche search engine, or a searching application

on a single Web site”. Alongside each of these having different functionalities, search

engines can also potentially be accessed indirectly, for example through an Application

Program Interface (API) (see, e.g. Concordia et al., 2010; Cousins et al., 2008; Jansen et al.,

2011).

Cooper (2001: 139) states that “[t]he search process is iterative”, emphasising the need to

support different stages of information-seeking, such as formulating queries plus navigation

or browsing behaviours; understanding search can therefore inform both content

structuring and SE design, particularly for complex multimedia information (see, e.g. Agosti

et al., 2012; Eirinaki & Vazirgiannis, 2003; Hearst, 2009; Jansen, 2009; Levene, 2006; Spink

et al., 2002). Web searching is further impacted by contextual factors like non-search

system/user interactions (Cooper, 2001: 141), alongside people’s “use of other media for

information retrieval, their demographics, and their social support networks” (Hargittai,

2002: 1239). For example, Spink et al. (2002: 37) note the continuing importance of users’

information and technical literacy.

Information needs themselves are highly variable. Broder (2002: 3), for example, identifies

“informational…navigational…or transactional” queries, representing different searching

behaviours and thus requiring tailored search support (see also: Bowler et al., 2011;

Levene, 2006; Park et al., 2005). Jansen et al. (2011: 499) further consider different user

15

characteristics, proposing that differences between queries submitted to the real-time

‘Collecta’ search engine compared to general Web search could “indicat[e] a possible early

adopter audience relatively more technical than the general population”. Additionally,

Jansen and Pooch (2001: 239) hypothesise that querying patterns may vary across media

types, with three studies in their literature review suggesting that “multimedia Web queries

contain more terms than the average Web query”. Information needs may also change

during search; later queries generally become more complex and specific in response to

earlier search results (Hearst, 2009: 79).

It is also important to consider “the impact of system differences [e.g. interface features]

on user behavior” (Kurth, 1993: 99), especially when comparing querying patterns across

systems with different features, or design changes for single systems (see, e.g. Clough,

2009; Jansen & Pooch, 2001; Jansen & Spink, 2006; Jansen et al., 2011; Koch et al., 2004 in

Agosti et al., 2012: 683; Levene, 2006; Spink et al., 2002). For example, features like “query

suggestion” (Agosti et al., 2012: 666) are likely to affect query formulation, especially for

novice searchers. To summarise, SE design is closely related to wider IR theory; information

needs and system functionalities both influence search behaviour (Levene, 2006). It can

therefore be difficult to generalise study findings from different time periods and across

different systems and user groups (Jansen & Pooch, 2001; Jansen & Spink, 2006; Ross &

Wolfram, 2000).

Some common trends have nevertheless emerged from previous studies of online

information-seeking. For example, considering the ‘AltaVista’ Web SE, Silverstein et al.

(1999: 6) find that “web users type in short queries, mostly look at the first 10 results only,

and seldom modify the query”, with these patterns corroborated by the findings of other

studies (see, e.g. Agosti et al., 2012; Gabrilovich et al., 2009; Jansen & Pooch, 2001; Jansen

& Spink, 2006; Li et al., 2012). Cooper (2001: 144) similarly notes that “most users are

satisfied with the most standard features of a system”, supported by Park et al.’s (2005:

213-214) research concerning the Korean ‘NAVER’ search engine (see also: Jansen et al.,

2011). Indeed, an additional concern is the extent to which users are aware of available

search functionalities, highlighting the need for support beyond the provision of ‘advanced

search’ features (Cooper, 2001; Meyer et al., 2007).

2.3 Query Log Analysis

Query log analysis (Section 3.1.1) is a popular methodology for investigating online

information-seeking that has high practical applicability, for example influencing “the

16

design, personalisation and evaluation of systems” (Clough, 2009: 4. See also: Hearst, 2009;

Jansen, 2009; Jansen et al., 2009; Levene, 2006; Park et al., 2005). Query logs generally

contain standard fields such as query terms, plus metadata like date/time and some form of

user identification (Eirinaki & Vazirgiannis, 2003; Jansen, 2009; Jansen & Spink, 2006;

Jansen et al., 2011; Levene, 2006; Nicholas & Clark, 2012). Additional information can

include “user-specified modifiers” like search filters (Silverstein et al., 1999: 7), or “referrer

site” (Jansen, 2006: 409), thus also considering navigation between different parts of the

Web. The available query log fields necessarily impact on possible areas of investigation,

including the extent to which logs from different systems are comparable (Agosti et al.,

2012: 680-681).

User identification is especially complex, since common surrogates like IP address or

cookies cannot always identify individuals reliably; it can be difficult to separate human and

non-human (e.g. robot) queries, whilst user privacy is an important concern from an ethical

standpoint (Agosti et al., 2012; Bowler et al., 2011; Clough, 2009; Cooper, 2001; Cooper,

2008; Eirinaki & Vazirgiannis, 2003; Jansen, 2006; Jansen et al., 2009; Jansen et al., 2011;

Kurth, 1993; Silverstein et al., 1999). Log analysis can also be conducted based on “term,

query, and session” (Jansen, 2006: 417), with the latter often especially difficult to identify

(see also: Jansen & Pooch, 2001). Given a frequent lack of contextual information,

delineating both users and their information needs can therefore be problematic (Cooper,

2001; Eirinaki & Vazirgiannis, 2003; Jansen & Spink, 2006; Kurth, 1993).

The primary disadvantage of query log analysis is its descriptive nature, whereby data

cannot account for “the underlying situational, cognitive, or affective elements of the

searching process” (Jansen, 2006: 411. See also: Bowler et al., 2011; Jansen & Pooch, 2001;

Jansen & Spink, 2006; Kurth, 1993). Kurth (1993: 100) further extends the conception of

information-seeking to consider the difficulty of determining “the information needs that

users are unable to express in the search statements that they enter into online systems”.

As noted above, it is therefore important to consider the generalisability of study findings

concerning patterns of search behaviour (see, e.g. Agosti et al., 2012; Jansen & Spink,

2006), particularly since users do not always perform searches as individuals (Bowler et al.,

2011; Hargittai, 2002; Kurth, 1993).

2.4 Applications of Query Log Analysis

As noted above (Section 2.2), online queries generally contain few terms; examining large

volumes of data through query log analysis can therefore help to broadly profile system

17

users’ information and service requirements (Jansen, 2009; Park et al., 2005; Spink et al.,

2002). For example, Cooper (2001: 140-141) uses query log data to investigate changes in

querying patterns through time for a University of California library catalogue, specifically

focusing on query volumes across University term/holiday periods (see also: Jansen et al.,

2011). Applications of query log analysis are nevertheless somewhat system-dependent.

Agosti et al. (2012: 663), for example, distinguish “Web search engine log analysis and

Digital Library System log analysis” based on differences in system content, functionalities

and user groups.

However, query log analysis can showcase key aspects of user-system interaction, thus

potentially “enlighten[ing]…interface development, and devising the information

architecture for content collections” (Jansen, 2006: 407. See also: Agosti et al., 2012;

Hearst, 2009; Jansen & Spink, 2006). The latter is considered particularly important for

organising and facilitating access to complex “Web sites whose content is increasing on a

daily basis, such as news sites or portals” (Eirinaki & Vazirgiannis, 2003: 3. See also: Agosti

et al., 2012). Query log data can additionally inform group or individual “user profiling”

(Eirinaki & Vazirgiannis, 2003: 3), which is an important concern given the growth of system

personalisation, particularly in e-commerce contexts (Eirinaki & Vazirgiannis, 2003;

Hargittai, 2002; Ross & Wolfram, 2000).

Query log analysis can therefore be highly relevant for both content providers and system

users. Its effective practical application nevertheless requires a broad understanding of

users and their context, meaning that findings may not apply across different systems. For

example, Cooper (2001: 143) emphasises that “[u]ser behavior varies significantly

depending upon the database being searched”, reflecting different users’ information

needs, expertise and the functionalities available to support different facets of system

usage, such as search and browse (Agosti et al., 2012: 683. See also: Jansen et al., 2011). As

such, whilst it has potential business relevance “for making managerial decisions and

establishing priorities” (Agosti et al., 2012: 681), query log analysis is unlikely to be

sufficient for supporting the growing area of truly “[u]ser-centered design” (Bowler et al.,

2011: 723).

2.5 Query Classification

Query classification (Section 3.1.2) is an approach to query log analysis that requires

consideration of several practical factors. For example, especially for short Web queries,

“search queries may be ambiguous” (Li et al., 2012: 10739), thus complicating the

18

classification process (see also: Gabrilovich et al., 2009). Queries may also be “affected by

ephemeral trends” like current affairs (Silverstein et al., 1999: 6), which can alter both

patterns of query subjects (see, e.g. Ross & Wolfram, 2000) and their expression through

language and terminology; query classification schemes must therefore be flexible and

adaptable enough to incorporate new fields (Beitzel et al., 2004 in Li et al., 2012: 10739;

Gabrilovich et al., 2009). Jansen et al. (2011: 499) further state that a “power-law

distribution…[is] typical for Web query terms”, suggesting that frequent and rare queries

are likely to have different characteristics. Similarly, Gabrilovich et al. (2009: 9) argue that

“rare queries…tend to contain rare words, be longer, and match fewer documents” (see

also: Beitzel et al., 2007a, 2007b in Agosti et al., 2012: 673), nevertheless representing a

significant volume of data about users’ search behaviour when aggregated (Silverstein et

al., 1999).

Query classification, commonly with multiple levels, has been utilised to help develop

general Web search taxonomies, in particular concerning query subjects (see, e.g. Agosti et

al., 2012; Chuang & Chien, 2003 in Agosti et al., 2012: 671; Gabrilovich et al., 2009; Li et al.,

2012). For example, Li et al. (2012: 10742) employ “a hierarchical category taxonomy”,

aiming to maintain the currency of Web query classification by categorising both queries

and results pages to consider the “semantic distance” between these groups (Li et al. 2012:

10740). Categories are generally not mutually exclusive (see, e.g. Gabrilovich et al., 2009:

11).

Study results have illuminated diverse aspects of online information-seeking. For example,

whilst frequent queries from Silverstein et al.’s (1999: 9, Table 4) study of the ‘AltaVista’ SE

include “sex” and “porno” (see also: Ross & Wolfram, 2000), Jansen et al.’s (2011: 504)

study of the real-time ‘Collecta’ SE conversely reveals “a high occurrence of society,

entertainment, technology, and politics”. They attribute this discrepancy to the relatively

specialist nature of real-time search and its distinct user groups, resulting in a querying

pattern that “differs from the topical characteristics of the traditional Web search” (Jansen

et al., 2011: 501). Jansen and Spink (2006: 258, Table 2) also consider variations in

searching behaviour across different locations, suggesting that UK and US searchers exhibit

different querying patterns. Based on query classification of six Web SE datasets, they

further conclude that “[t]he overall trend is towards using the Web as a tool for information

or commerce, rather than entertainment”, providing an alternative perspective to the

studies outlined above (Jansen & Spink, 2006: 260).

19

Query classification can therefore be relevant for both system and service development.

Gabrilovich et al. (2009: 6), for example, focus on targeted advertisements, whilst Ross and

Wolfram (2000: 957) consider “subject-based access tools” (see also: Jansen et al., 2011).

Significantly, assistance can be provided at different stages of the information-seeking

process, such as query formulation and modification, facet/filtering options and results

display; the latter could include topic-based results clustering to help provide clarity in

situations of “ambiguity or multiple aspects of a topic” (Agosti et al., 2012: 676. See also: Li

et al., 2012). Eirinaki and Vazirgiannis (2003: 7) further consider trends in online

information provision like “recommendation systems”, which allow organisations to cater

to diverse Web users.

Alternatively, classification can utilise Broder’s (2002: 3) categories (Section 2.2), although

the author does note that distinguishing the query types can be difficult (Broder, 2002: 5.

See also: Li et al., 2012). Indeed, Cooper (2001: 143) considers aspects of both query type

and topic for a library catalogue query log, noting that “[m]ost searches (40%) are power

searches [i.e. searching across catalogue fields]”, thus highlighting the importance of

considering the impact of system environment (e.g. available search options) on search

behaviour.

2.6 Cultural Heritage Information

Providing content and search functionalities aligned with user needs is especially important

for cultural heritage information given its widespread appeal (Chaudhry & Jiun, 2005;

Meyer et al., 2007; Minelli et al., 2007). As noted above (Sections 1.1/1.2), cultural heritage

organisations are focusing in particular on facilitating online and mobile access to their

collections. Even so, Agosti et al. (2012: 678) note that information in ‘Digital Library

Systems’ (DLS) – often associated with cultural heritage – retains its distinctive character in

digital environments and can be distinguished from general Web content on the basis that

“collections are explicitly organized, managed, described, and preserved”. Indeed,

Europeana’s Strategic Plan 2011-2015 (Europeana, [2011?]a: 12) highlights its aim to

provide a “comprehensive, trustworthy and authoritative collection”, which is therefore

potentially more closely aligned with library or archive models than broader online

provision.

Multilingual and multimedia search are particularly significant in cultural heritage contexts

(Concordia et al., 2010; Cousins et al., 2008; Kirchoff et al., 2008; Meyer et al., 2007; Minelli

et al., 2007; Ott & Pozzi, 2011). Image resources are especially popular (Kirchoff et al., 2008;

20

Politou et al., 2004), whilst the Europeana Business Plan 2012 (Europeana, [2011?]b: 7) also

notes high demand for audio-visual resources (see also: Nicholas & Clark, 2012). Technical

considerations relating to multimedia information storage and display are therefore key; as

an example, Politou et al. (2004: 300) suggest that functionalities supported by the

‘JPEG2000’ image format make it highly “applicable to cultural heritage databases”.

Understanding the nature of cultural heritage information is therefore vital for developing

effectively tailored online systems.

Voorbij (2010: 275) emphasises the variation between cultural heritage organisations,

noting especially that “libraries mainly provide access to external digital resources…while

archives and museums place their own unique resources in digitized form on their web

site”, resulting in different content management and intellectual property concerns

(Concordia et al., 2010: 68). Additionally, Meyer et al. (2007: 397) consider a potential

cultural heritage information system whose Web interface is adaptable for “professionals”

and “the general public”, reflecting the need for system design that caters to different

cultural heritage user groups (see also: Cousins et al., 2008; Dempsey, 2000 in Kirchoff et

al., 2008: 252). Indeed, Bowler et al. (2011: 745) suggest that focusing explicitly on different

users may be the most effective approach, despite the resultant complexity in system

design (see also: Europeana, [2011?]a).

Recent developments in online cultural heritage information provision have included

tailored “web portals” (Ott & Pozzi, 2011: 1366). These can help users overcome issues of

technical literacy that potentially complicate information-seeking (see, e.g. Europeana,

2010; Kirchoff et al., 2008), alongside simplifying search by aggregating content, which can

include different types (e.g. subjects, formats) of information (Agosti et al., 2012; Minelli et

al., 2007; Purday, 2009; Voorbij, 2010). Indeed, whilst many cultural heritage organisations

remain physically distinct, Bowler et al. (2011: 746) argue that “users increasingly do not

think in such organizationally restricted terms”, illustrating the need for portal-type

information access. Kirchoff et al. (2008: 258), for example, highlight the German ‘BAM’

cultural heritage Web portal, in particular describing its option to search via “a simple

Google search field”.

New technologies are therefore affecting the wider character of cultural heritage education

and research, encouraging international and interdisciplinary perspectives (Ott & Pozzi,

2011: 1369-1370). Concordia et al. (2010: 67) further suggest that a “digital cultural

commons” is evolving, aided by developments like portals, whilst Europeana’s Business

21

Plan 2012 (Europeana, [2011?]b: 11) states its aim to “[d]evelop the ‘European Cultural

Commons’ as a concept, a movement and a business model within the Europeana

Network”.

Personalisation and participation are increasingly important aspects of cultural heritage

information provision, as well as for Web information services more generally (Bowler et

al., 2011; Europeana, [2011?]a). Indeed, Bowler et al. (2011: 739) argue that “[l]ibrary

reference service has traditionally been adaptive and personalized”, suggesting that –

whilst its large-scale implementation in online environments may be relatively novel –

personalisation itself is not a new approach. Europeana itself is aiming to increase its

service/user interaction, such as by incorporating “a corpus of user-generated objects”

(Europeana, [2011?]b: 20).

It is important to note that Web-based cultural heritage information systems are often

concerned with both “preservation and dissemination” (Politou et al., 2004: 293), meaning

that accessibility and ease of use are not the only system priorities (Europeana, 2010;

Kirchoff et al., 2008; Meyer et al., 2007; Purday, 2010). For example, digitising objects can

support both aims, influencing the changing relationship between organisations’ physical

and online presence (Europeana, 2010; Europeana, [2012?]b; Kirchoff et al., 2008; Voorbij,

2010). Nevertheless, Kirchoff et al. (2008: 252) argue that “[d]igital memory institutions do

not compete with archives, libraries and museums”, suggesting that they have different

primary remits. Focusing on archaeological information, Meyer et al. (2007: 398, citing

Richards, 1998) therefore stress the importance of developing content management and

information access provision tailored specifically to cultural heritage. For example, mobile

access and systems that account for user location are two significant and interlinked areas

that are likely to be especially pertinent to cultural heritage tourism (see, e.g. CIBER

Research Ltd., 2011; Gravano et al., 2003 in Gabrilovich et al., 2009: 5 and in Li et al., 2012:

10746).

2.7 Query Log Analysis and Query Classification in Cultural Heritage

Query log analysis is highly applicable to information system design in the complex cultural

heritage environment, helping to illuminate user information needs and behaviour, and

thus aiding resource discovery through informing the development of systems and search

features. For example, Voorbij (2010: 267) describes how data from “log file analysis…and

tools based on page tagging” have enabled cultural heritage organisations in The

Netherlands to consider both presentation and content provision, through “adapting the

22

web site or setting priorities for further digitization” (Voorbij, 2010: 278). Concerning

Europeana specifically, CIBER Research Ltd. (2011: 7) employ “deep log analysis” to

investigate mobile access and make system recommendations (see also: Nicholas & Clark,

2012).

Web query classification schemes may not easily translate to cultural heritage topic areas;

Li et al. (2012: 10746, Figure 2), for example, note categories like “Computers” in the

experimental KDDCUP2005 taxonomy, which may be largely irrelevant in this field. Cultural

heritage information-seeking is further complicated by the wide variety of organisational

schemes currently in use, alongside the need to consider multilingualism and multimedia

information formats (Chaudhry & Jiun, 2005; Meyer et al., 2007; Minelli et al., 2007).

Indeed, Walsh (2011: 334) notes that “[l]ocally developed taxonomies have become a

popular method for subject description in digital collections”, sometimes combined with

existing schemes like ‘Library of Congress Subject Headings’ (Walsh, 2011: 333-334; Library

of Congress, [2012?]).

Several authors have therefore suggested that domain ontologies, taxonomies and faceted

search/browse are particularly useful for cultural heritage information-seeking, as noted by

Walsh (2011) in the context of digital libraries. For example, Chaudhry and Jiun (2005) focus

on cultural heritage taxonomy development in Singapore, whilst Minelli et al. (2007)

consider results filters (see also: Chaudhry & Jiun, 2005; Clough, 2009; Kirchoff et al., 2008;

Meyer et al., 2007). An advantage of portals specifically is highlighted by Kirchoff et al.

(2008: 262), who argue that combining “[i]nformation from…heterogeneous sources – by

location, time, person, and subject – are the added value provided by portals such as BAM

or Europeana”.

This quote introduces a classification element that can be similarly applied to queries; for

example, Minelli et al. (2007: 4) classify popular online queries from several cultural

heritage organisations into the comparable categories of “Proper names”, “Subject”,

“Place” and “Time”. Other authors emphasise the on-going importance of content

classification in facilitating effective search and browse, despite the difficulties inherent in

achieving this (see, e.g. Kirchoff et al., 2008; Walsh, 2011). Indeed, in accordance with wider

trends, Bowler et al. (2011: 727) consider the potential for “[s]ocial tagging” to enable

subject-based access to library materials. Voorbij (2010: 274) similarly notes “interest in

users’ search terms”, which could provide an alternative starting point to top-down

classification schemes, thus reflecting emerging principles of “participatory design” (Bowler

23

et al., 2011: 734) in online cultural heritage information provision. Indeed, Europeana itself

offers individual users some tagging functionality via the ‘My Europeana’ area (Europeana,

[2012?]g).

24

Chapter 3: Methodology

3.1 Background and Theoretical Basis

As noted in Section 1.3, a primary aim of this study is to create a query classification

scheme for online queries submitted to the Europeana Web portal, illuminating querying

patterns and facilitating comparison between different query types (e.g. frequent versus

random queries). The study hypothesis is that there are significant differences between

popular and other queries, meaning that its overall approach is broadly deductive.

However, as outlined below, query classification was initially undertaken using a more

inductive, data-focused approach (see, e.g. Jansen, 2009: 65).

Query classification is subject-based, with the intention of developing a classification

scheme that is specific to this study but informed by existing examples (e.g. Chaudhry &

Jiun, 2005; Minelli et al., 2007). The decision to create a new scheme reflects the lack of

existing classifications specific to cultural heritage settings like Europeana; as noted in

Section 2.7, existing subject-based schemes (e.g. for Web queries) may not be easily

transferable to this context.

3.1.1 Query Log Analysis

Query log analysis, a methodology related to broader transaction log analysis (Jansen et al.,

2009: 2), is the primary data analysis approach adopted for this study (see also: Jansen,

2009; Nicholas & Clark, 2012). Although not a new methodology, it is becoming increasingly

established in the context of online and digital system environments, thus requiring novel

approaches (Agosti et al., 2012; Clough, 2009; Cooper, 2001; Jansen, 2006; Jansen & Pooch,

2001; Nicholas & Clark, 2012). Jansen and Spink (2006: 254), focusing on the Web,

summarise the benefits of logs as data sources:

“Web transaction logs…unobtrusively [record] real interactions by real users in the

pursuit of real information needs in the complex Web information environment”.

Large volumes of data originating from different systems can therefore be gathered in

natural settings, without requiring direct interaction between users and researchers; Jansen

(2006: 424) further highlights the comparatively low cost of obtaining log data (see also:

Agosti et al., 2012; Clough, 2009; Cooper, 2001; Kurth, 1993). Nevertheless, its descriptive

nature can limit query log analysis, since it “cannot explain why something has occurred”

(Minelli et al., 2007: 3) and may be most effective when combined with other approaches

25

like interviews or surveys (see, e.g. Agosti et al., 2012; Clough, 2009; Jansen et al., 2009;

Kurth, 1993; Minelli et al., 2007).

Ethical issues also require consideration, since queries and query log metadata like IP

addresses can potentially identify individuals (Section 2.3). However, this project examines

only query text, frequencies and some non-identifying metadata like the use of search

filters; it was considered unlikely that query text itself would contain identifying

information given the cultural heritage context. The study therefore received a ‘No Risk’

ethical designation.

Query log analysis encompasses different stages of data “collection…preparation…and

analysis” (Jansen, 2006: 412. See also: Agosti et al., 2012; Eirinaki & Vazirgiannis, 2003;

Jansen & Pooch, 2001; Kurth, 1993). However, by using an existing dataset that was

collected and prepared before being passed to the researcher, this study focuses on the

analytical stage. Although a potential limitation, since data collection is therefore not

directed specifically towards the research (see, e.g. Jansen, 2009; Jansen et al., 2009), this is

considered the best approach given time constraints and the researcher’s limited

experience of dealing with log data. Analysis itself is conducted at the query level: “[a]

query is defined as a string list of zero or more terms submitted to a search engine” (Jansen,

2006: 418).

3.1.2 Query Classification

Query classification is a popular analytical approach, generally using topic categories like

“people, places or things” (Levene, 2006: 66. See also: Agosti et al., 2012; Clough, 2009;

Jansen & Spink, 2006; Spink et al., 2002). It can be performed either automatically or

manually and is often associated with taxonomy development (see, e.g. Agosti et al., 2012:

666). Indeed, reflecting the large numbers of queries submitted to online search engines

and comparatively high cost of manual classification (Li et al., 2012: 10746), automatic

approaches are generally considered necessary for Web-based datasets; however, some

initial manual input is often utilised, such as to help develop classification categories (see,

e.g. Agosti et al., 2012; Gabrilovich et al., 2009).

A manual approach is nevertheless considered appropriate for this study given its focus on

both frequent and random query samples (Section 1.3.2), plus Europeana’s specialist

subject area. Gabrilovich et al. (2009: 3), for example, argue that ““Tail” queries…do not

have enough occurrences to allow statistical learning on a per-query basis”, implying that

26

manual query classification could be more effective for gaining a nuanced understanding of

less popular queries.

Practical classification can involve different approaches. Li et al. (2012: 10743), in their Web

query taxonomy, note that “we can expand the names of categories from other sources”,

increasing the flexibility and relevance of the scheme for different user groups and subject

areas (see also: Concordia et al., 2010; Gabrilovich et al., 2009). Similarly, Jansen et al.

(2011: 491) classify Web queries “using the Google Directory topical hierarchy”, whilst

Agosti et al. (2012: 678) approach classification from a more library-based perspective,

considering the relevance of existing schemes and standards like “authority control rules”.

The character and granularity of the classification scheme employed or developed must

therefore be appropriate for the type of data and study purpose (Gabrilovich et al., 2009; Li

et al., 2012).

Query log analysis is strongly linked to inductive methodologies, particularly grounded

theory. Defined as “the discovery of theory from data” (Glaser & Strauss, 1967: 1), this

approach is noted specifically in the context of query log analysis by Jansen and Pooch

(2001) and Jansen (2006). Ross and Wolfram (2000: 951) similarly describe how “[c]oding

categories were developed inductively” in their Web query study. The inductive

classification approach adopted here is therefore informed by these examples.

3.2 Data Collection and Analytical Approach

Data collection involved existing Europeana query log data held by the researcher’s project

supervisor; relevant information was extracted from the log before being passed to the

researcher for analysis.

3.2.1 Datasets

Several distinct datasets are analysed in this study:

1. Filters Dataset: usage frequencies of different Europeana filters for a sample of

approximately 100,000 queries from early 2012. Intended to give a broad overview

of querying patterns and including some examples of query text (84) to seed initial

classification scheme development.

2. Popular Queries: query text and frequencies for the 150 most frequent queries

submitted to Europeana between 01.01.2012-30.06.2012. Utilised for further query

classification scheme development.

27

3. Random Queries: query text and frequencies for two samples of 150 random

queries submitted to Europeana between 01.01.2012-30.06.2012. Utilised for

classification scheme refinement and comparison with the ‘Popular Queries’

dataset.

4. Media Types Dataset: query text and frequencies for the 25 most popular queries

submitted to each of the Europeana ‘Type’ filter’s four main options (Europeana,

[2012?]d) between 01.01.2012-30.06.2012. Intended to further illuminate filter

usage, in particular through comparison of query classifications across media type

options.

The total sample therefore includes approximately 500 query text examples. Existing

studies involving manual query classification have generally considered around 1000-2000

queries (see, e.g. Gabrilovich et al., 2009; Ross & Wolfram, 2000). However, a smaller

sample is considered appropriate here given the relatively small scale and time constraints

of the study, alongside its consideration of aspects like filter specifications in addition to

query text.

3.2.2 Data Analysis and Classification Scheme Development

Data analysis encompasses investigation of both broad querying characteristics and query

classification scheme development. The Filters Dataset is first considered (Section 4.1),

including filter specification patterns (Section 4.1.2) and classification of query text

examples (see below). Analysis then focuses on features like language and query

‘selections’, the latter of which are defined in Section 4.1.1, for the Popular Queries

(Section 4.2) and Random Queries (Section 4.3) datasets, plus refinement of the study

classification scheme.

The Filters Dataset included 84 results with ‘text’ ‘selections’ (see: Section 4.1.1), thought

to represent user-generated query refinement terms (Europeana, [2012?]d). Minelli et al.’s

(2007) categories (Section 2.7) provided a basic initial classification framework, which was

intended to offer an outline structure without restricting the exploratory nature of the

classification and emergence of categories from the data; however, “Proper names”

(Minelli et al., 2007: 4) was substituted here by ‘People’ to avoid confusion with

institutional names (Table 1).

28

Category (based on Minelli et al., 2007: 4) Frequency

“Subject” 48

People 17

“Place” 12

“Time” 1

Unknown 6

Table 1: Baseline classification of Filters Dataset ‘text’ query selections

As shown in Table 1, only one query (‘2010’) fit well within the ‘Time’ category, whilst over

half were classified as ‘Subject’. It should be noted that several queries were ambiguous,

meaning that exact classification figures cannot be guaranteed without further contextual

information; for example, the query ‘china’ could describe either a place or material (i.e. a

subject). The classification pattern was nevertheless considered sufficiently clear for useful

analysis.

Reflecting the predominance of ‘Subject’ queries (Table 1), plus particularly wide variability

within this group, a subject-based approach was adopted for preliminary classification

scheme development, with multiple levels to incorporate other categories (e.g. personal

names). The classification scheme as it stood at this stage is given in Appendix 1. As a result

of the data-driven approach, only categories emerging directly from the Filters Dataset are

included in Appendix 1. Given the small sample, this was also necessarily intended as a

seeding point conducive to future refinement rather than a comprehensive description of

Europeana querying patterns.

Investigation then considered whether the classification scheme’s top-level headings could

be mapped to an existing scheme, namely the ‘Library of Congress Subject Headings (LCSH)’

(Library of Congress, [2012?]). This was considered more appropriate for the cultural

heritage context than general Web classification schemes noted in Sections 2.5/2.7 (see,

e.g. Jansen et al., 2011; Li et al., 2012). Indeed, Walsh (2011: 329) notes that “LCSH has

become one of the major tools for online information retrieval”, with potentially high

practical applicability for organising content and facilitating search. Again, this was an

exploratory comparison intended to inform subsequent classification scheme refinement.

For example, the lack of an ‘LCSH’ (Library of Congress, [2012?]) ‘Current Affairs’ descriptor

prompted consideration that this heading might not be appropriate for content structuring

due to the issue of constantly changing material; the related term ‘Politics’ was therefore

searched instead.

29

Comparison was made by searching terms from the scheme’s Primary Categories via the

Library of Congress Subject Headings area of the Library of Congress Website (Library of

Congress, [2012?]). It was found that, whilst Primary Categories were generally too broad

to have direct ‘LCSH’ equivalents (Library of Congress, [2012?]), mapping of the narrower

Secondary/Tertiary categories was likely to be feasible. Potentially relevant descriptors

emerging from this exploratory search are given in Appendix 2.

Analysis of the larger Popular and Random Queries datasets facilitated further classification

scheme development, aiming to reach a saturation point where no new categories

emerged. Kurth (1993: 101) highlights the potential impact of sampling strategy on study

results; in this case, consideration of popular queries was intended to give the study high

practical relevance (Section 1.3.2, Objective 3), whilst considering random query samples

(Section 1.3.2, Objective 4) involved deeper, more detailed analysis of Europeana querying

characteristics (see, e.g. Gabrilovich et al., 2009; Jansen, 2006; Ross & Wolfram, 2000).

Including both frequent and other queries was also intended to introduce an element of

“[c]omparative analysis”, which is considered an essential component of the grounded

theory approach (Glaser & Strauss, 1967: 21).

Analysis of popular queries enabled substantial refinement of the classification scheme. In

particular, the Primary Category ‘Society and Current Affairs’ was renamed ‘Politics and

Society’, whilst the Tertiary Category ‘Organisation or Institution’ was separated to

become a new Primary Category (‘Collections, Organisations and Institutions’), reflecting

the large number of popular queries with collection or provider-based query ‘selections’.

An additional Primary Category (‘Business and Industry’) emerged, alongside multiple new

Secondary/Tertiary Categories.

The revised scheme, following modification based on frequent query analysis, is given in

Appendix 3. It became apparent that some aspects of the scheme required further

refinement, such as potential overlap between sub-categories in ‘The Arts’ and ‘Object or

Form Descriptors’. Other aspects appeared to work well, including the three-level structure

and maintenance of ‘Military and Military History’ as a distinct Primary Category, with an

expanded number of both Secondary and Tertiary categories compared with its limited

original size (Appendix 1).

Subsequent analysis of Random Queries datasets helps to clarify the scheme; the final

scheme is discussed and presented in Section 4.3.3. Classification of frequent and random

30

queries is again informed by a combination of inductive and deductive approaches, focusing

initially on categories emerging from the data and then considering whether these can be

mapped to existing schemes (Section 4.3.4).

The final classification scheme is then used to compare frequent and random queries, plus

those refined by media type using one of Europeana’s filtering options (Europeana,

[2012?]d) (Section 1.3.2, Objective 5). Selection of this filter is informed by both theoretical

and practical considerations, aiming to facilitate a meaningful comparison by incorporating

queries concerning different types of material (e.g. modern and historical), but taking

practical issues like the number of available options into account. Additionally, multimedia

content provision is considered particularly important in the cultural heritage sphere (see:

Section 2.6).

This study therefore considers both technical and subject-based aspects of online cultural

heritage information-seeking, with data analysis and evaluation encompassing qualitative

and quantitative approaches; both are considered valid for studies employing query log

analysis (see, e.g. Jansen & Pooch, 2001; Kurth, 1993; Ross & Wolfram, 2000), alongside LIS

research more generally (Eldredge, 2004; Hider & Pymm, 2008). It was anticipated that

some query topics would be unclear or unknown to the researcher. Foreign language

queries were therefore translated using ‘Google Translate’ (Google, [2012?]) and queries

with unknown subjects entered either into Wikipedia (Wikipedia: The Free Encyclopedia,

[2012?]), chosen for its wide subject coverage and existing association with Europeana

(Europeana, [2012?]f), or Europeana itself (Europeana, [2012?]a). A consistent approach is

intended, so that queries remaining unclear are classified as such without consultation of

additional sources.

31

Chapter 4: Results

4.1 Filters Dataset

4.1.1 Filters and Query ‘Selections’

This dataset contains filter specifications for a sample of 110,691 Europeana queries from

early 2012 (Section 3.2.1), illustrating usage frequencies for different filters available

through the portal. It should be noted that data includes both filters available via results

pages for query refinement (Figure 1) and what are defined in this study as ‘query

selections’.

Figure 1: A screenshot of a Europeana results page following query submission (with query

‘faust’), showing filtering options down the left hand side (Europeana, 2012)

‘Selections’, as defined for this study, include descriptors like “who, what, where or when”

(Europeana, [2012?]d: n.p.) that can be entered either directly by users or generated by

selecting options from individual object pages within Europeana, the latter therefore

indicating browsing rather than search behaviour. Indeed, whilst not present in the Filters

Dataset, other datasets considered here (Section 3.2.1) include additional ‘selections’ like

‘europeana_collectionName’ that appear to arise via browsing when content is selected

32

from options available via Europeana’s ‘Explore’ function (Europeana, [2012?]e: n.p.), as

shown in Figure 2.

Figure 2: A screenshot of the Europeana Web portal Homepage (Europeana, [2012?]a),

showing the ‘Explore’ option along the top of the screen (Europeana, [2012?]e: n.p.)

‘Selections’ therefore appear to represent both user-generated (i.e. free-text entry) and

system-generated (i.e. browsing) queries (see, e.g. Nicholas & Clark, 2012). They are distinct

from filters, generating new queries rather than modifying existing queries. The ‘text’

selection nevertheless seems to arise from utilisation of Europeana’s ‘Refine your search’

option, whereby users can input new terms to search within the results set of an existing

query (Europeana, [2012?]d: n.p.), therefore combining free-text querying with query

refinement.

4.1.2 Filter Specification

In total, 35,684 (32.2%) Filters Dataset queries are either filtered or involve query

‘selections’, the majority (35,365, or 31.9%) of these filtered. Figure 3 shows the usage

proportions for different filters, excluding ‘selection’ options. The “-TYPE” filter – too small

to be visible on the chart – refers to 27 queries, 26 of which specify ‘Wikipedia’. The

definition of this filter is not entirely clear, but it may arise from the existing links between

33

Europeana and Wikipedia (see, e.g. Europeana, [2011?]a; Europeana, [2012?]f; Nicholas &

Clark, 2012).

Figure 3: A pie chart showing usage proportions (%) of Europeana filters specified in the

Filters Dataset

‘TYPE’ (54%) is therefore clearly the most commonly specified filter, followed by

‘PROVIDER’ (16.2%) and ‘YEAR’ (13.5%). The ‘TYPE’ filter refers to “media type”

(Europeana, [2012?]d: n.p. See also: Europeana, [2012?]f), which can be further subdivided

as shown in Figure 4.

In this example, and others below, the data contains some fields with identical names bar

the addition of speech marks (e.g. Text and “Text”). Since these appear to represent the

same options, values are combined to create Tables (2-5) and Figure 4 below, identifiable

by the label ‘Corrected’ (or ‘C.’).

7% 1%

8%

14%

0%

16%

54%

LANGUAGE

RIGHTS

COUNTRY

YEAR

"-TYPE"

PROVIDER

TYPE

34

Figure 4: A pie chart showing the usage proportions (%) of media type options specified in

the Filters Dataset ‘TYPE’ filter

Figure 4 shows that the media type options ‘Image (C.)’ and ‘Text (C.)’ are clearly dominant,

together representing almost 80% of the total filter usage. The least specified option is ‘3D’,

with only 11 occurrences, representing less than 1% of filter usage and not visible in Figure

4. The ‘Unknown’ option refers to 1045 (≈5%) occurrences that appear separately in the

dataset but whose field is unnamed.

‘YEAR’ values specified in the Filters Dataset are aggregated by century to create Figure 5. It

can be seen that the distribution is broadly positively skewed. Years within the 20th century

are most commonly specified, with 30.5% of total filter usage, followed by the 17th and 16th

centuries, each with approximately 15% of total specifications. A small number of examples

(19/4783 ≈0.4%) specify future dates; for example, there are four occurrences of the year

‘5640’.

6%

36%

9%

44%

0%

5%

Sound (C.)

Text (C.)

Video (C.)

Image (C.)

3D

Unknown

35

Figure 5: A chart showing frequency of usage (aggregated by century) for the Filters Dataset

‘YEAR’ filter

Tables 2 and 3 show usage frequencies for the ‘LANGUAGE’ and ‘COUNTRY’ filters. Both

show only the ten most popular options; in total, there are 32 ‘LANGUAGE’ options (28

distinct, accounting for ‘Corrected’ options, including an ‘Unknown’ option) and 40

‘COUNTRY’ options (33 distinct, including an ‘Unknown’ option and amalgamating the

multi-country options ‘united kingdom’ and ‘uk’). These filters are utilised to a similar

degree, with 2447 and 2936 occurrences respectively. Languages are identified from their

abbreviations using the Library of Congress Codes for the Representation of Names of

Languages (Library of Congress, 2010).

0

200

400

600

800

1000

1200

1400

1600

Futu

re

21

st (

to 2

01

2)

20

th

19

th

18

th

17

th

16

th

15

th

14

th

13

th

12

th

11

th

10

th

9th

8th

7th

6th

5th

4th

3rd

2n

d

1st

Un

kno

wn

Fre

qu

en

cy

Century

36

Language Language Name

Quoted from:

Library of Congress (2010:

n.p.)

Frequency

fr (C.) “French” 530

de (C.) “German” 475

es (C.) “Spanish; Castilian” 339

mul “Multiple Languages” 203

pl “Polish” 144

en “English” 143

nl “Dutch; Flemish” 126

Unknown Unknown 75

it “Italian” 67

sl “Slovenian” 53

Table 2: The ten most popular options specified using the ‘LANGUAGE’ filter

Country or Country Group Frequency

Germany 830

Belgium 249

Austria (C.) 249

France 218

United Kingdom/UK (C.) 200

The Netherlands (C.) 170

Europe 168

Spain (C.) 167

Poland (C.) 106

Unknown 71

Table 3: The ten most popular options specified using the ‘COUNTRY’ filter

Table 2 shows that French, German and Spanish are the most popular languages by a

significant margin; a ‘tail’ of less popular languages is visible even within this limited

selection. The most frequently specified countries (Table 3) are Germany, Belgium and

Austria. With the exception of the notably high figure for ‘Germany’, which is over three

times greater than that for ‘Belgium’, there appears to be less variation between country

specification frequencies compared to the language options.

37

Tables 2 and 3 show six clearly identifiable language/country pairs, summarised below

(Table 4), thus illustrating a reasonably strong relationship between usage patterns of these

filters in the Filters Dataset.

Language Name

Quoted from:

Library of Congress

(2010: n.p.)

Rank Country or Country

Group

Rank Rank Difference

(Language – Country)

“French” (C.) 1 France 4 -3

“German” (C.) 2 Germany 1 1

“Spanish; Castilian”

(C.)

3 Spain (C.) 8 -5

“Polish” 5 Poland (C.) 9 -4

“English” 6 UK (C.) 5 1

“Dutch; Flemish” 7 The Netherlands (C.) 6 1

Table 4: Language/Country pairs and associated ranks from Tables 2 and 3

Interestingly, Table 4 shows no exactly correspondent ranks. There are particular

discrepancies between ‘Spanish/Spain’, ‘Polish/Poland’ and ‘French/France’, each with

language ranked higher than country. This potentially illustrates how, whilst interlinked,

usage of these filters is also likely to be influenced by factors like available content. This is

related in turn to ‘PROVIDER’ specifications, with ‘PROVIDER’ the second most popular

Filters Dataset filter (Figure 3). The ten most frequently specified providers are given in

Table 5.

38

Provider Name Frequency

Koninklijk Instituut voor het

Kunstpatrimonium (KIK) [Brussel, België]

2257

moteur Collections 501

Athena 282

The European Library (C.) 255

Institut National de l’Audiovisuel 181

Europeana 1914-1918 (C.) 133

Nationaal Archief 131

Musée Royal de Mariemont 101

Erfgoedplus.be 96

Svenska litteratursällskapet i Finland 78

Table 5: The ten most popular options specified using the ‘PROVIDER’ filter

Table 5 shows particularly high specification of French and Belgian providers in the Filters

Dataset. For example, the Belgian ‘Koninklijk Instituut voor het Kunstpatrimonium (KIK)’ is

over four times more frequently specified than the next most popular provider, perhaps

impacting Belgium’s high position in Table 3. ‘PROVIDER’ specifications also illustrate how

Europeana brings together existing aggregators and portals like ‘Athena’ and ‘The

European Library’ (see, e.g. Cousins et al., 2008).

4.2 Popular Query Analysis

4.2.1 Dataset Characteristics

The Popular Queries dataset (Section 3.2.1) contains query text and frequencies for the 150

most frequent queries submitted to Europeana between 01/01/2012-30/06/2012. Query

text includes both free text and ‘selected’ queries. Indeed, the most popular query is ‘*:*’

(frequency = 301,419), which appears from the researcher’s exploration of Europeana to

arise following selection of certain named providers via the ‘By provider’ option of

Europeana’s ‘Explore’ function (Europeana, [2012?]e: n.p.).

The lowest query frequency is 880, giving a frequency range of 300,539. Excluding the ‘*:*’

query, which is potentially anomalous, the frequency range is 22,885, distributed as shown

in Figure 6.

39

Figure 6: Popular Queries dataset frequency distribution, excluding the top-ranked result

Figure 6 shows a clearly long-tailed and relatively smooth frequency distribution, even

within the most popular Europeana queries; the frequency range is 21,230 between the

query ranks 2-20 alone, compared with only 1655 between the remaining query ranks 20-

150.

4.2.2 Query ‘Selections’ and Language

Based on the dataset of 150 queries (rather than absolute query frequencies), over half of

the popular queries (86/150 or 57.3%) involve query selections, rising to 58% when the

query ‘*:*’ is included. The numbers of queries that employ particular selections are shown

in Figure 7.

0

5000

10000

15000

20000

25000

0 20 40 60 80 100 120 140 160

Fre

qu

en

cy

Query Rank

40

Figure 7: Usage frequencies for different query ‘selections’ in the Popular Queries dataset

Figure 7 shows that the most common selections relate to Europeana providers and

collections. Of those selections that have potentially been entered as free text, ‘what’

clearly has the highest frequency. This is also combined in four cases with ‘dc_type’, where

‘dc’ refers to ‘Dublin Core’, an established and prominent metadata scheme (see, e.g.

Kirchoff et al., 2008).

Query languages are also considered, although language identification could be difficult.

Place, personal and collection names and named works are therefore excluded from the

analysis, alongside language-ambiguous queries (e.g. ‘synagoge’ = ‘Synagogue’, potentially

German, Dutch or French). However, this approach has the limitation of excluding some

less ambiguous queries like ‘wien’ (=’Vienna’ in German) and ‘maria maddalena’ (=‘Mary

Magdalene’ in Italian), meaning that absolute language frequencies are likely to be

significantly higher than those shown in Figure 8.

0

5

10

15

20

25

30

Fre

qu

en

cy

'Selection' Name

41

Multiple counts (one per language, excluding names) are included for the multilingual

query ‘sprookjes OR fairy tales OR grimm OR Perrault OR "Contes des fees" OR "basn" OR

"fiaba"’.

Figure 8: Language frequencies for the Popular Queries dataset

The popularity of French and German fits well with Table 2 concerning the Filters Dataset,

although English queries are comparatively more frequent here. However, the datasets are

not directly comparable, since Figure 8 represents manual coding of query text rather than

analysis of filter usage.

4.3 Random Sample Query Analysis

4.3.1 Dataset Characteristics

Two Random Queries datasets (Section 3.2.1) each contain query text and frequencies for

150 random queries submitted to Europeana between 01/01/2012-30/06/2012. As above,

query text includes both free text and ‘selected’ queries. The frequency range for the first

sample is 229 and for the second sample 149, with frequency distributions as shown in

Figure 9.

0

2

4

6

8

10

12

14

16

English German French Dutch Spanish Italian

Fre

qu

en

cy

Language

42

Figure 9: Frequency distributions for both Random Queries datasets

Figure 9 shows relatively smooth and closely-corresponding frequency distributions for the

two samples, which both decline steeply and end in long tails of single-occurrence queries.

The overall distribution therefore has a similar pattern to Figure 6 for the Popular Queries

dataset, but on a much smaller frequency scale.

4.3.2 Query ‘Selections’ and Language

The Random Queries datasets contain far fewer ‘selected’ queries than the Popular Queries

dataset, totalling 12.7% (19/150) and 22% (33/150) respectively. The numbers of queries

with particular selections are shown in Figure 10.

0

50

100

150

200

250

0 50 100 150 200

Fre

qu

en

cy

Query Rank

Frequency (Sample 1)


43

Figure 10: Usage frequencies for different query ‘selections’ in the Random Queries

datasets

Figure 10 shows that the most common query ‘selections’ for both datasets are ‘what’ and

‘who’, both potentially free-text selections; indeed, one Sample 2 query specifies ‘Quoi:’ (=

‘what’ in French), suggesting free-text input. Contrasting with the Popular Queries dataset

(Section 4.2.2), Europeana providers and collections are very rarely specified, for example

with ‘europeana_provider OR europeana_country’ (see: Figure 7) not occurring in either

random sample. However, other ‘selections’ occur amongst the random samples that are

not present in the Popular Queries dataset, including ‘when’, ‘subject’ and

‘europeana_rights’.

Query languages are classified as in Section 4.2.2, with the additional consideration of

queries (e.g. ‘Tractatus qui de varietate astronomiae intitulatur’ = Latin) that appear to

represent (albeit unknown) named works and are therefore also excluded from the analysis

(Figure 11).

0

2

4

6

8

10

12

14

Fre

qu

en

cy

'Selection' Name



44

Figure 11: Language frequencies for the Random Queries datasets

The total number of language-identifiable queries is similar for both datasets: 42 and 44

respectively, comparable to the 39 examples identified from the Popular Queries dataset

(Section 4.2.2). Similarly, the three most popular languages are English, French and German

in all three cases, although Sample 1 (Random Queries) contains slightly more German than

French queries. Both random query samples contain a greater variety of languages than the

frequent queries; for example, at least one Polish, Portuguese and Norwegian query occurs

in each dataset (Figure 11).

4.3.3 Query Classification: Final Scheme Refinement

The query classification scheme was refined based on query subjects/topics in the Random

Queries datasets. New categories emerging (e.g. ‘Sport’) were primarily Secondary or

Tertiary rather than Primary Categories. However, there was extensive restructuring, in

particular splitting some Primary Categories that had become overly large with the addition

of new Secondary/Tertiary Categories. For example, ‘The Arts’ was split into Primary

Categories ‘Arts and Design’, ‘Literature and Poetry’ and ‘Music, Film and Theatre’, whilst

new Primary Category ‘Lifestyle and Entertainment’ was separated from ‘Politics and

Society’.

Conversely, the distinction between online/physical collections was removed from ‘Named

Collections – Other Collections’, since it was felt that this was not meaningful given the

0

2

4

6

8

10

12

14

16

Fre

qu

en

cy

Language



45

large number of collections with both a physical and an online presence. It was also decided

that ‘Object or Form Descriptors’ should refer only to formats specified using query

‘selections’ (e.g. ‘what:text’ from the Popular Queries dataset), recognising the difficulty of

determining whether non-specified queries like ‘film’ refer to subjects or desired results

formats. The aim is therefore to avoid overlap with subject categories, meaning that

queries classified as ‘Object or Form Descriptors’ do not receive additional subject

classifications.

The final query classification scheme developed for this study is summarised in Table 6.

Green highlights show a small number of terms drawn from queries themselves, either

directly or in translation (e.g. ‘Military Tribunals’ from the query ‘tribunal militar’), that are

considered particularly appropriate for representing classification scheme topics or

subjects.

46

Primary

Categories

Secondary Categories Tertiary Categories

Philosophy,

Mythology

and Religion

Philosophy

Mythology

Religion

Named Figures

Ideas and Concepts

Folk and Fairy Tales

Legends

Classical Philosophy, Mythology and Religion

Theology and Religious History

Named Religions and Religious Groups

Named Figures: Ministers and Officials

Named Figures: Religious Texts

Festivals and Ceremonies

Iconography and Objects

Religious Buildings, Locations and Communities

Place,

Civilisation

and Travel

Geographical Features

or Regions

Countries and

Settlements

Travel

Civilisation or Culture

Country

City: Capital

City: Other

Municipality, Town or Village

Specified Address

Island (Inhabited)

Region or Administrative Region

Maps and Travel Guides

Languages

Historical Place Names

Ancient and Classical Civilisation and Culture

Politics and

Society

Named Figures

Political Leaders and Politicians

Royalty and Nobility

47

News

Law and Crime

Amenities and Facilities

History and Social

Change

Organisations and

Societies

Civil Ceremonies and

Events

International Relations

Named Newspapers

Journalism

History of Crime

Copyright

Housing

Hospitals and Healthcare

Libraries

Schools and Education

Marriage

Political Agreements

Military and

Military

History

Named Figures

Military Engagements

Procedure and

Discipline

Military Objects

Military Leaders and Personnel

Prisoners of War

Historical Figures

Strategy and Tactics

Treaties and Agreements

World Wars

Military Tribunals

Military Records

Buildings, Locations and Bases

48

Transport

Weapons and Equipment

Lifestyle and

Entertainment

Entertainment and

Events

Transport

Sport

Computing

Fashion and Beauty

Advertising

Performances

Exhibitions

Arcades

Road

Rail

Air

Other

Named Sports and Sports Clubs

Sporting Events

Equipment

Social Media

Arts and

Design

Named Figures

Named Works or

Subjects

Artistic Periods, Styles

or Movements

Genres

Creators or Designers

Collectors

History of Art

Classical Art

Portrait

Landscape

Painting, Drawing and Illustration

Engraving and Printing

Photography

49

Stamps

Bookplates

Postcards

Ceramics, Enamel, Pottery and Glass

Sculpture and Figurines

Fashion, Clothing and Jewellery

Other

Literature and

Poetry

Named Figures

Named Works or

Subjects

Literary Periods, Styles

or Movements

Genres

Authors and Editors

Publishers

Classical Literature

Poetry

Literature (Fiction)

Literature (Non-Fiction)

Ephemera

Music, Film

and Theatre

Named Figures

Named Works or

Subjects

Periods, Styles or

Movements

Instruments and

Equipment

Genres

Creators or Composers

Performers

Other

Folk Music

Musical Instruments

Music

50

Film

Theatre

Architecture,

Buildings and

Structures

Named Figures

Architectural Periods,

Styles or Movements

Genres

Architects

Landscape Architecture

Castles, Palaces, Religious Buildings and

Monuments

Civic Buildings, Housing and Businesses

Engineering Structures

Sciences Named Figures

Genres

Historical Figures

Natural History and Biology (Non-Human)

Animal Husbandry and Food Science

Human Biology and Medicine

Archaeology

Anthropology

Geography and Cartography

Physics and Astronomy

Technology

Business and

Industry

Named Companies or

Manufactories

Named Products and

Advertising

Named Industries

Patents

Mining and Resource Extraction

Construction and Manufacturing Industries

Generic

Subjects

Person

Place

51

Object

Time

Other

Date

Named

Collections

Libraries and Archives

Museums and Galleries

Other Collections

Portals and Aggregators

Geographical

Designations

Object or

Form

Descriptors

Europeana Query

‘Selections’: Format

Ambiguous or

Unclear

Person

Place

Computing Functionality

or Search Feature

Other

Table 6: A summary of the final study classification scheme following refinement based on

analysis of Random Queries datasets

4.3.4 Classification Scheme Mapping

To enhance its practical applicability, mapping between the study classification scheme

(Table 6) and existing schemes is also considered. As noted in Section 3.2.2, exploratory

mapping of an earlier version of the scheme suggested that ‘Library of Congress Subject

Headings’ (Library of Congress, [2012?]) would be more suitable for mapping the

Secondary/Tertiary Categories than Primary Categories (Appendix 2). The effectiveness of

52

mapping between Primary Categories and the broader, top-level headings of a Web-based

scheme, namely ‘Yahoo! Directory’ (Yahoo, Inc., 2012), is therefore considered instead. As

before, Secondary and Tertiary headings are considered in relation to the ‘LCSH’ scheme

(Library of Congress, [2012?]). The resulting potential mapping terms are given in Appendix

4.

Mapping between Primary Categories and ‘Yahoo! Directory’ (Yahoo, Inc., 2012) headings is

of mixed effectiveness. Although some Primary Categories (e.g. ‘Sciences’, ‘Business and

Industry’) have clear equivalents in the Web scheme, the majority are either too broad (e.g.

‘Politics and Society’) or too narrow (e.g. ‘Literature and Poetry’) to map successfully. This

suggests that the character of cultural heritage information, at least as revealed through

Europeana querying patterns, is indeed distinct from that of general Web information

resources. However, mapping between these schemes may be more feasible at narrower

levels of classification, which are not considered here.

In contrast, mapping between Secondary/Tertiary Categories and ‘LCSH’ (Library of

Congress, [2012?]) is generally effective; indeed, the majority of categories have direct

equivalents. Discrepancies primarily occur where categories like ‘Named Figures’ remain

too broad for direct mapping to ‘LCSH’ (Library of Congress, [2012?]), which is not

considered surprising given the small scale of this study. As such, library-based schemes

may remain more suitable than general Web schemes as a basis for classifying cultural

heritage information in online environments.

4.4 Comparison of Frequent and Random Query Classification

Patterns

As noted in Section 2.5, frequent and rare queries often have different characteristics (see,

e.g. Gabrilovich et al., 2009). The study classification scheme is therefore utilised here to

compare the topics of popular versus random queries. Given the large differences between

query frequencies (Figures 6 and 9), it is considered feasible to approximate rare queries

with the Random Queries datasets. Classification uses the Primary Categories of the

scheme (Table 6), with results as shown in Table 7.

53

Primary Category Popular

Queries (%)

Random

Queries 1 (%)

Random

Queries 2 (%)

Random

Queries

(Mean %)

Philosophy,

Mythology and

Religion

3.53 7.56 8.61 8.09

Place, Civilisation and

Travel

22.4 19.6 15.3 17.5

Politics and Society 2.94 8.00 7.18 7.59

Military and Military

History

3.53 5.78 2.87 4.33

Lifestyle and

Entertainment

1.76 4.44 2.87 3.66

Arts and Design 10.6 10.2 14.4 12.3

Literature and Poetry 0.588 8.89 8.61 8.75

Music, Film and

Theatre

3.53 2.67 2.39 2.53

Architecture,

Buildings and

Structures

1.76 2.67 6.22 4.45

Sciences 2.35 5.78 3.83 4.81

Business and Industry 1.76 2.22 2.87 2.55

Generic Subjects 1.18 7.56 4.31 5.94

Collections,

Organisations and

Institutions

31.2 3.11 2.39 2.75

Object or Form

Descriptors

11.8 0.444 0.478 0.461

Ambiguous or Unclear 1.18 11.1 17.7 14.4

TOTAL 100 100 100 100

Table 7: Categorisation percentages for queries in different study datasets using the

Primary Categories of the study classification scheme

The primary limitation of this comparison is the difficulty of accurate classification itself,

although it is felt that this approach does at least enable clear querying patterns to emerge.

54

Categories are not mutually exclusive, meaning that queries with multiple subjects/topics

receive one count per category. As such, the total number of categorisations per dataset

indicates the comparative complexity of the queries; for example, the 150 most popular

queries have 170 categorisations overall, whilst the random samples appear more complex,

with 225 and 209 categorisations respectively.

Table 7 shows that classification percentages for the random samples are generally similar.

The largest discrepancies are for ‘Ambiguous or Unclear’ (6.6% difference), ‘Arts and

Design’ (4.2% difference) and ‘Place, Civilisation and Travel’ (4.3% difference), the latter

more frequent in Sample 1 and the others in Sample 2. These are also the most frequent

categories overall for the Random Queries datasets. For the Popular Queries, the most

common categories are ‘Collections, Organisations and Institutions’, ‘Place, Civilisation

and Travel’ and ‘Object or Form Descriptors’.

Between the popular and random (mean %) queries, the largest discrepancies are for

‘Collections, Organisations and Institutions’ (28.5% difference), ‘Ambiguous or Unclear’

(13.2%) and ‘Object or Form Descriptors’ (11.3%). This is likely to reflect the much lower

use of query ‘selections’ in the random samples (Section 4.3.2), which accounts for most of

the collection and form-based categorisations amongst the frequent queries. Comparative

rankings of classification categories for the different datasets are given in Table 8; cell

shading represents categories within the datasets that have equal ranks (i.e. equal

categorisation percentages).

55

Rank (1=

high)

Popular Queries Random Queries 1 Random Queries 2

1 Collections, Organisations

and Institutions


Travel

Ambiguous or Unclear

2 Place, Civilisation and

Travel

Ambiguous or Unclear Place, Civilisation and

Travel

3 Object or Form Descriptors Arts and Design Arts and Design

4 Arts and Design Literature and Poetry Philosophy, Mythology

and Religion

5 Philosophy, Mythology and

Religion

Politics and Society Literature and Poetry

6 Military and Military

History

Philosophy, Mythology

and Religion

Politics and Society

7 Music, Film and Theatre Generic Subjects Architecture, Buildings

and Structures

8 Politics and Society Military and Military

History

Generic Subjects

9 Sciences Sciences Sciences

10 Lifestyle and

Entertainment

Lifestyle and

Entertainment


History

11 Architecture, Buildings and

Structures

Collections,

Organisations and

Institutions

Lifestyle and

Entertainment

12 Business and Industry Music, Film and

Theatre

Business and Industry

13 Generic Subjects Architecture, Buildings

and Structures

Music, Film and

Theatre

14 Ambiguous or Unclear Business and Industry Collections,

Organisations and

Institutions

15 Literature and Poetry Object or Form

Descriptors

Object or Form

Descriptors

Table 8: Comparative ranks for queries in different datasets classified using the Primary

Categories of the study classification scheme

56

4.5 Queries Filtered by Media Type

As noted in Section 2.6, multimedia formats are a particularly distinctive feature of cultural

heritage information (see, e.g. Kirchoff et al., 2008). It is therefore considered interesting to

classify and compare some small samples of queries refined by Europeana’s media filtering

options (Europeana, [2012?]d). The dataset contains the 25 most popular queries specifying

each of four main options from 01.01.2012-30.06.2012 (Section 3.2.1), thus including some

overlap with other datasets.

Variation in query frequencies indicates the comparative popularity of different filtering

options. For example, the most frequent ‘Text’ query is ‘dagras’ (frequency 23,624), whilst

that filtered by ‘Sound’ is ‘*:*’, with a much lower frequency (1310). The latter is the same

top-ranked query as for the Popular Queries dataset (Section 4.2) and also the most

popular query for the other media types. Frequency distributions for the different filtering

options are shown in Figure 12.

Figure 12: Frequency distributions for the most popular queries specifying different

Europeana media-based filtering options (Europeana, [2012?]d)

Excluding the most frequent ‘Text’ query, which is potentially anonymous, the distributions

appear as shown in Figure 13.

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30

Fre

qu

en

cy

Query Rank

Image

Sound

Text

Video

57

Figure 13: Frequency distributions for the most popular queries specifying different

Europeana media-based filtering options (Europeana, [2012?]d), excluding the top-ranked

‘Text’ query

Figure 13 shows that ‘Image’- and ‘Video’-filtered queries are most popular, whilst ‘Sound’

is specified much less frequently. Overall, the frequency distribution has a similar long-

tailed pattern to that of the larger datasets (Figures 6 and 9). However, all four distributions

also show ‘stepped’ components, particularly ‘Sound’, ‘Text’ and ‘Video’, whose ‘steps’

appear to overlap quite strongly at the higher query ranks.

The study classification scheme (Table 6) is also tested on these more narrowly-specified

queries, considering its applicability beyond the main datasets. The aim is to classify queries

without making any scheme alterations; queries that are difficult to classify are noted, thus

facilitating evaluation of the classification scheme (see: Section 5.2). Results are given in

Table 9.

0

1000

2000

3000

4000

5000

6000

7000

0 5 10 15 20 25 30

Fre

qu

en

cy

Query Rank

Image

Sound

Text

Video

58

Primary Category Media type:

“Image”

Queries (%)

Media type:

“Sound”

Queries (%)

Media type:

“Text”

Queries (%)

Media Type:

“Video”

Queries (%)

Philosophy,

Mythology and

Religion

6.45 9.09 8.33 3.57


Travel

38.7 30.3 44.4 46.4

Politics and Society 9.68 6.06 5.56 0.00


History

12.9 3.03 2.78 0.00

Lifestyle and

Entertainment

6.45 0.00 2.78 7.14

Arts and Design 3.23 3.03 0.00 3.57

Literature and Poetry 0.00 0.00 5.56 0.00

Music, Film and

Theatre

6.45 36.4 5.56 14.3

Architecture,

Buildings and

Structures

3.23 3.03 5.56 3.57

Sciences 0.00 0.00 5.56 3.57

Business and Industry 0.00 0.00 5.56 3.57

Generic Subjects 3.23 0.00 0.00 3.57

Collections,

Organisations and

Institutions

3.23 6.06 2.78 7.14

Object or Form

Descriptors

3.23 0.00 2.78 0.00

Ambiguous or Unclear 3.23 3.03 2.78 3.57

TOTAL 100 100 100 100

Table 9: Categorisation percentages for popular Europeana queries filtered by media type

(Europeana, [2012?]d) using the Primary Categories of the study classification scheme

59

As in Section 4.4, comparing total categorisations indicates comparative query complexity;

‘Video’ queries appear least complex, with 28 categorisations for the 25 queries, whilst

‘Text’ queries appear most complex, with 36 categorisations. ‘Place, Civilisation and Travel’

is the most popular category for all options except ‘Sound’, for which it is second most

popular after ‘Music, Film and Theatre’, reflecting the significant number of music-related

queries like ‘beethoven’ amongst ‘Sound’ queries. The popularity of ‘Image’ specifications

for ‘Military and Military History’ queries is also noticeable, with queries including

‘weltkrieg’ (=‘world war’ in German) and ‘what:World War One’. Results from Table 9 are

presented graphically in Figure 14.

Figure 14: Primary Category percentages for popular Europeana queries filtered by media

type (Europeana, [2012?]d)

Query ‘selections’ occur infrequently within this dataset, comprising (including ‘*:*’) six

‘Sound’, four ‘Image’ and ‘Text’, and one ‘Video’ query. Three of the ‘Sound’ selections,

which all concern music, involve an example that doesn’t occur elsewhere in the dataset

(‘europeana_rights’), suggesting usage by a distinct user group specifying media types

when using Europeana for a specific purpose (see: Section 5.1).

0

5

10

15

20

25

30

35

40

45

50

Pe

rce

nta

ge o

f To

tal C

ate

gory

De

sign

atio

ns

Classification Scheme: Primary Categories

Image

Sound

Text

Video

60

Chapter 5: Discussion

5.1 Europeana Querying Patterns: Filters, Query ‘Selections’ and

Languages

Filters Dataset results (Section 4.1.2) indicate a predominance of filtering by media type,

showing strong concern amongst Europeana users for information format; this kind of

functionality is also noted by Cousins et al. (2008: 133) regarding ‘The European Library’

portal (see also: Hearst, 2009). The popularity of ‘Image’ filtering specifically (Figure 4)

accords with wider literature concerning cultural heritage information-seeking (see, e.g.

Kirchoff et al., 2008; Politou et al., 2004). Contrastingly, the low popularity of ‘3D’ filtering

perhaps reflects how 3D visualisations are still an emerging form of representation for

cultural heritage material online; for example, Meyer et al. (2007: 405) note their high

potential for portraying archaeological sites.

Analysis of the Media Types dataset (Section 4.5) nevertheless reveals different results:

‘Text’ queries are most popular by a significant margin (Figure 12). This result is potentially

skewed by an anomalously high top-ranked query ‘*:*’, although the same query does

appear amongst the other media types. Interestingly, despite an overall more sharply

declining frequency distribution for ‘Video’ queries (Figure 13), the top-ranked ‘Image’

query and ‘Video’ query have similar frequencies, potentially contrasting with the low

occurrence of this option in the Filters Dataset (Figure 4), although the datasets are not

directly comparable. Indeed, whilst Nicholas and Clark (2012: 93) assert the popularity of

“video and sound” queries submitted to Europeana (see also: Europeana, [2011?]b),

‘Sound’ filtering in particular appears infrequent in both the Filters and Media Types

Datasets (Figure 4, Figure 13).

Additional specifications in the Filters Dataset correlate well with wider trends. For

example, CIBER Research Ltd. (2011: 8) note that Europeana has a particularly high number

of French users; Nicholas and Clark (2012: 92) similarly highlight “France and Germany”,

corroborating this study’s finding of high French and German ‘LANGUAGE’ filter

specifications (Table 2). These authors also note the comparatively low prominence of UK

Europeana usage (Nicholas & Clark, 2012: 92), supported by Filters Dataset findings (Tables

2, 3). English, again followed by French and German, does appear to be the most popular

language based on analysis of the Popular and Random Queries datasets (Figures 8, 11),

although this could reflect a limitation of the manual language coding strategy for these

datasets (Section 4.2.2).

61

It is considered likely that discrepancies between ‘LANGUAGE’ and ‘COUNTRY’ filtering

specifications (Table 4) reflect the latter’s greater dependence on providers (i.e. content

availability) rather than Europeana users (Section 4.1.2), for example with the high position

of ‘Belgium’ in Table 3 potentially reflecting the Belgian origin of the top-ranked provider

(Table 5). Providers can include sub-collections or projects, as exemplified by the query

‘europeana_provider:"Europeana 1914 - 1918" OR europeana_country:"Europeana 1914 -

1918"’, which refers to a Europeana initiative with high user participation concerning World

War One (see, e.g. Europeana, [2011?]b: 20). Bowler et al. (2011: 730) emphasise the

“value of story as an access method”, perhaps accounting for the presence of this query in

the Popular Queries dataset; indeed, ‘Europeana 1914-1918’ is also the sixth most popular

provider appearing in the Filters Dataset (Table 5).

Usage of the ‘YEAR’ filter is also reasonably high (Figure 3), whereas time-related queries

are quite infrequent, suggesting high user awareness of this filtering option. Where time-

based terms do occur in query text, they are often accompanied by other terms, as for the

query ‘school in 1800s’ (Random Queries dataset). Some query examples are therefore

quite specific, contrasting with Park et al.’s (2005: 215) assertion that “most Web users

want to get general information”. Nevertheless, other queries appear to support this

statement; for example, the most frequent ‘Video’ queries include ‘film’ (2nd), ‘filmy’ (9th),

‘video’ (16th) and ‘films’ (17th).

Although the most frequently occurring query languages (English, French and German) are

consistent across the Popular and Random Queries datasets, random query samples exhibit

a wider overall range of languages, including queries like ‘fabeloj’ (=’stories’ in Esperanto)

(Section 4.3.2, Figure 11). A possible relationship between language and query type is also

apparent, with queries (almost exclusively in French) concerning Classical vases occurring

across the Popular and Random Queries datasets; these perhaps represent a distinct user

group, or particularly large amount of French-language material available on this topic (e.g.

‘un vase et un dieu ou hero’ = ‘a vase and a god or hero’, ‘Vase avec dieu grec’ = ‘vase with

a Greek god’, both in French).

Multilingual querying is rare, with only two clear examples that are both from the Popular

Queries dataset: ‘what: Einfamilienhaus/Villa’ (German/Ambiguous), which is potentially

either free-text or ‘selected’ via an object page, and the complex, free-text query

‘sprookjes OR fairy tales OR grimm OR Perrault OR "Contes des fees" OR "basn" OR

"fiaba"’ (Dutch, English, French, Unknown & Italian, plus names). Indeed, Purday (2009:

62

933) emphasises that multilingual search is a key area of concern for Europeana (see also:

Europeana, [2011?]a), reflecting the importance of language as a factor affecting the

accessibility of cultural heritage information generally (Agosti et al., 2012; Minelli et al.,

2007). This query, which is the 11th most frequent overall and also notable for its inclusion

of operators (e.g. ‘OR’), similarly occurs amongst the most frequent ‘Video’ (18th) and

‘Sound’ (22nd) queries, suggesting that acted or spoken fairy tales are a distinct and popular

type of cultural heritage material.

The presence of query ‘selections’ occasionally makes it difficult to identify querying

characteristics, in particular distinguishing searching and browsing behaviour (see: Section

4.1.1). Indeed, the most popular query overall (‘*:*’) appears to arise from provider-based

browsing (Section 4.2.1) and also occurs across the media type specifications (Section 4.5).

The importance of facilitating both search and browse is highlighted by several authors; this

study indicates that Europeana users do indeed utilise different approaches to search and

access content (see, e.g. Agosti et al., 2012; Jansen, 2009; Levene, 2006; Nicholas & Clark,

2012). For example, ‘who: giorgione?utm_source=blog’, from the Random Queries dataset,

appears to support the use of blogs as additional access points (see, e.g. CIBER Research

Ltd., 2011; Nicholas & Clark, 2012).

Although potentially limiting the study, since queries therefore do not always indicate free

text entered by users, analysis of ‘selections’ nevertheless provides an interesting

additional point of comparison between the different study datasets. For example,

collection/provider ‘selections’ occur much more frequently in the Popular Queries dataset

compared to the random samples (Sections 4.2.2, 4.3.2). This is reflected in the study

classification scheme, where the proportion of queries categorised as ‘Collections,

Organisations or Institutions’ is approximately ten times higher for popular versus random

queries (Table 7, Section 4.4). Indeed, ‘selection’ occurrences are significantly higher

overall amongst the frequent queries, which is likely to indicate greater browsing

behaviour; despite high interest in specific collections, users may not always have well-

defined information requirements, perhaps reflecting a distinct user group of potential

visitors (e.g. tourists) to associated physical collections.

In contrast, the ‘europeana_rights’ selection does not occur at all in the Popular Queries

dataset. Where it does occur, query construction is also quite complex, suggesting that

users are experienced and have quite specific search goals (e.g. ‘music AND

europeana_rights:*creative* AND NOT europeana_rights:*nc* AND NOT

63

europeana_rights:*nd*’). These queries are in fact amongst the most frequent ‘Sound’

queries in the Media Types dataset (Section 4.5), potentially representing a particular niche

user group; for example, CIBER Research Ltd. (2011: 4) highlight use of Europeana by “the

creative and information industries”. Overall, the distribution of ‘selection’ occurrences

therefore seems to confirm Nicholas and Clark’s (2012: 91) findings concerning different

Europeana user groups, in particular suggesting both a “consumer/leisure profile…[and] a

large academic following”.

Both the Popular and Random Queries datasets exhibit long-tailed query frequency

distributions (Figures 6, 9), matching the profiles of query log datasets from previous

studies (see, e.g. Gabrilovich et al., 2009; Jansen et al., 2011). Operator usage is also very

low, which is consistent with wider literature (see, e.g. Gabrilovich et al., 2009; Jansen &

Pooch, 2001). Additionally, one random query (‘navarre + evreux’ = place names) employs

an operator that does not appear to be supported by Europeana (Europeana, [2012?]d),

implying that users may be unfamiliar with available search options. Other querying

characteristics nevertheless vary significantly between the datasets, illustrating the validity

of considering both frequent and random (or rare) query samples; when aggregated, rarer

queries represent a large volume of data (Gabrilovich et al., 2009; Jansen & Spink, 2006;

Silverstein et al., 1999).

It is not considered surprising that a far higher proportion of random versus frequent

queries are later classified as ‘Ambiguous or Unclear’ (Table 7, Section 4.4), since these are

generally rare queries, which “are…difficult to classify” (Gabrilovich et al., 2009: 20).

Indeed, over half of the examples in each Random Queries dataset only occur once.

Furthermore, whilst Ross and Wolfram (2000: 951) suggest that one-term queries

specifically are not actually very common, these comprise over one fifth of each random

dataset.

5.2 Classification Scheme Development

Classification scheme development is considered highly relevant for Europeana, since

classification schemes and taxonomies can support information-seeking through both

search and browse (see, e.g. Chaudhry & Jiun, 2005; Hearst, 2009). The final scheme

developed for this study has fifteen Primary Categories and is hierarchical, with three levels

(Table 6, Section 4.3.3). These aspects are broadly consistent with existing schemes,

including those for general Web SEs (see, e.g. Chuang & Chien, 2003b in Agosti et al., 2012:

671; Spink et al., 2002; Ross & Wolfram, 2000).

64

Gabrilovich et al. (2009: 14), who utilise a more detailed scheme, note that granularity must

reflect the classification purpose (see also: Section 3.1.2). In this study, greater complexity

is considered inappropriate for the intended scheme application, potentially reducing its

practical value for Europeana users. For this reason, Tertiary Categories like ‘City (Capital)’

and ‘Literature (Fiction)’ are not sub-divided into lower-level, Quaternary Categories, which

it is felt could overcomplicate the scheme. Indeed, although they focus primarily on results

presentation, Cousins et al. (2008: 13) suggest that “[e]ven academic researchers” prefer

relatively simple approaches to online information-seeking.

Form descriptors represent a particular problem, with queries like ‘music’ potentially

referring to either a subject or desired results format (see: Section 4.3.3). As an example,

Library of Congress schemes can support both alternatives, through the “LC Subject

Headings” and “LC Genre/Form Terms” (Library of Congress, [2012?]: n.p.). However, there

is insufficient contextual information to make this distinction for most queries in this study.

In the final scheme, queries with format-based ‘selections’ like ‘what:pdf’ are therefore

classified as ‘Object or Form Descriptors’, whilst non-‘selected’ queries receive subject-

based classifications. Even so, without related metadata like subsequent queries or session

information, these distinctions may sometimes be inaccurate.

Mapping between schemes, as shown in Appendices 2 and 4, is considered important for

improving the practical usefulness of this study’s query classification scheme for content

organisation and resource discovery (see, e.g. Walsh, 2011). It is also recognised that, while

ensuring an outcome tailored specifically to Europeana queries, the primarily inductive and

single-researcher approach adopted here for scheme development could lead to undue

representation of the researcher’s own assumptions and opinions (see, e.g. Jansen et al.,

2009; Kurth, 1993; Walsh, 2011). Mapping is therefore intended to help overcome this

limitation; similarly, relevant terms like ‘housing’ are drawn from queries themselves, thus

incorporating users’ own terminology (Chaudhry & Jiun, 2005; Li et al., 2012). Indeed,

Cousins et al. (2008: 137) argue that Europeana aims “to provide a user driven portal”, also

suggesting how “social tagging” (Cousins et al., 2008: 131) approaches can complement

traditional forms of information provision (see also: Bowler et al., 2011; Europeana,

[2011?]a; Europeana, [2012?]g).

As noted in Section 4.3.4, the classification scheme’s Primary Categories do not map well to

either ‘LCSH’ (Library of Congress, [2012?]) or the Web-based ‘Yahoo! Directory’ (Yahoo!

Inc., 2012). The latter was chosen to mirror Jansen et al.’s (2011: 491) query classification

65

using ‘Google Directory’. However, the organisation of Primary Categories appears specific

to Europeana. This is considered likely to reflect the portal’s wide topic range; Minelli et al.

(2007: 4), for example, find that “the characteristics of queries appeared to be influenced

by the subject domain” within cultural heritage, whilst Hargittai (2002: 1242) similarly

highlights “topic-specific search strategies”.

Nevertheless, the combination of Primary Categories emerging from Europeana query log

data and Secondary/Tertiary Categories mapped to ‘LCSH’ (Library of Congress, [2012?]) is

considered advantageous for potentially facilitating both general and more complex search

and browse. Walsh (2011: 331), for example, emphasises that “LCSH…enhances both

precision and recall when searching multiple digital collections at the same time”, making it

highly applicable to a portal like Europeana. However, this kind of scheme can be overly

complex for novice users, whilst domain specificity is an additional concern (see, e.g.

Chaudhry & Jiun, 2005; Kirchoff et al., 2008; Li et al., 2012; Walsh, 2011). Given Europeana’s

diverse user groups (see, e.g. Europeana, [2012?]c; Nicholas & Clark, 2012), the study

scheme aims to enable detailed description through mapping and combining categories,

rather than the creation of complicated headings, particularly since users may not have

precise or well-expressed information requirements (Cousins et al., 2008; Walsh, 2011). The

scheme therefore draws on both library/archive (e.g. terminology) and general Web-based

(e.g. broad top-level headings) classification practices (see, e.g. Jansen et al., 2011; Walsh,

2011).

Classification of the relatively complex query ‘architecture drawing of st paul’s, deptford’,

for example, would involve multiple relatively broad categories, thus allowing different

points of access: ‘Arts and Design – Genres – Painting, Drawing and Illustration’; ‘Place,

Civilisation and Travel – Countries and Settlements – City (Capital)’; ‘Philosophy,

Mythology and Religion – Religion – Religious Buildings, Locations and Communities’ and

‘Architecture, Buildings and Structures – Castles, Palaces, Religious Buildings and

Monuments’.

Classification of Media Types dataset queries partly tests the degree of saturation reached

during classification scheme development (Section 4.5). Overall, the scheme works well and

all queries can be classified; the structure appears appropriate and it is felt that saturation

has been reached for the Primary and Secondary Categories. However, additional Tertiary

Categories could provide useful further descriptive power, such as expansion of ‘Music,

Film and Theatre’ to include ‘Classical Music’ and ‘Religious Music’ (e.g. for the query

66

‘cantor’) and expansion of ‘Sciences’ to include ‘Psychology’ (e.g. for the query ‘mania’).

Future research could therefore focus on expanding the scheme’s range of Tertiary

Categories; nevertheless, manual query classification may be insufficient for achieving this

given the difficulty of classifying short and often ambiguous queries, as noted above

(Gabrilovich et al., 2009: 20).

Several authors highlight the popularity of place-related queries in Web SE studies (see, e.g.

Jansen et al., 2011; Ross & Wolfram, 2000; Spink et al., 2002). This study supports these

findings: ‘Place, Civilisation and Travel’ is either the first or second most popular category

for all study datasets classified (Sections 4.4, 4.5). Levene (2006: 243) similarly notes that

location-based queries may be longer than other types; the results of this study suggest

that this could result from the incorporation of place into multi-topic queries like ‘l'islam en

france’ (=’Islam in France’ in French). In addition, Ross and Wolfram (2000: 954) find that

common Web topic categories can include “Pictures” and “Organizations”, which is

corroborated here (see, e.g. Figure 4, Table 8), although datasets in this study differ in their

lack of “Sexuality” queries (Ross & Wolfram, 2000: 954).

Europeana querying patterns therefore exhibit both similarities and differences vis-à-vis

general Web querying. This probably reflects the overall “cross-domain” (Purday, 2009:

919) character of the service combined with its primary heritage focus (Europeana,

[2012?]b).

67

Chapter 6: Conclusion

6.1 Fulfilment of Study Objectives (Section 1.3.2)

Chapter 2 outlines background literature concerning Web-based information-seeking and

cultural heritage information, whilst Chapter 3 provides additional practical and theoretical

context for query log analysis, therefore fulfilling Objective 1.

Query refinement through search filters is investigated in Chapter 4 for a Filters Dataset

(Section 4.1) and – focusing on the most popular filter – a Media Types Dataset (Section

4.5). Query ‘selections’ are also considered across the different study datasets, giving a

more nuanced view of query specification and refinement in Europeana and thus also

contributing to the fulfilment of Objective 2.

In Chapters 3 and 4, a Popular Queries dataset (150 queries) and two Random Queries

datasets (each 150 queries) are analysed, including subject-based query classification and

classification scheme development. The final classification scheme is given in Table 6,

Section 4.3.3. Dataset characteristics, including classification patterns, are compared and

contrasted in Chapters 4 and 5. Although saturation is not entirely reached for the

scheme’s Tertiary Categories, its overall structure and Primary/Secondary Categories are

considered largely complete based on classification of queries from the Media Types

Dataset (Section 5.2), with sufficient flexibility to easily incorporate new Tertiary Categories

as required.

It is therefore felt that study Objectives 3, 4 and 5 have also been completed successfully.

Based on the data shown in Tables 7-9 and Figure 14, it is also concluded that the study

hypothesis of significant difference between classification profiles of frequent and random

queries, plus those filtered by different media types, is supported; even so, the largest

differences in both cases appear to rest primarily on a small number of classification

categories.

The practical application of results presented in this study for online cultural heritage

information provision is explored below (Section 6.2), thus considering the final study

Objective 6.

6.2 Recommendations for Cultural Heritage Information Provision

Agosti et al. (2012: 671), focusing on Web portals, state that log analysis can be applied “to

improve…structure and presentation”. Several recommendations for providing Web-based

68

cultural heritage information have arisen from this study, primarily concerning Europeana

but with relevance for other information providers.

Concerning content organisation, the predominance of subject-based searching (Section

3.2.2) indicates that subject should be incorporated systematically into object metadata,

both facilitating search and ensuring that content structuring reflects primary information

usage. Classification scheme development and mapping (Sections 4.3.3, 4.3.4) suggest that,

for Europeana, the combination of a tailored classification scheme and incorporation of

aspects from existing schemes to enable interoperability is likely to be most effective, an

approach noted by Walsh (2011: 333-334).

Information presentation and search functionalities are additional concerns, with several

authors emphasising the importance of interface design (see, e.g. Hearst, 2009; Jansen et

al., 2007; Levene, 2006). Europeana’s simple “search box” (Purday, 2009: 926), drawing

primarily on general Web models, is felt to be very effective and filtering options appear

well-used (Section 4.1.2) (see also: Nicholas & Clark, 2012). However, Meyer et al. (2007:

401) highlight the importance of user awareness of search functionalities (see also: Cooper,

2001). In this study, many ‘selections’ amongst the frequent queries appear to indicate

browsing behaviour via Europeana’s ‘Explore’ option (Europeana, [2012?]e), the

prominence or visibility of which could therefore potentially be expanded compared to

other assistance options (Figure 1).

It is also felt that ‘Subject’ should be available as a main filtering option, perhaps based on a

scheme like this study’s Primary Categories (Table 6, Section 4.3.3), to further facilitate

browsing, complement existing specification options (see, e.g. Europeana, [2012?]d) and

allow users – particularly novice searchers - to easily resolve subject-ambiguous queries.

Concordia et al. (2010: 67) suggest the alternative approach of “contextual grouping of

results sets” in Europeana, locating this kind of information-seeking assistance at the

results rather than search stage (cf. Europeana, [2011?]b: 16).

6.3 Study Limitations and Areas for Future Research

This study has two primary limitations, centered on query classification and the presence of

query ‘selections’:

1. Query classification: focusing on query text and frequencies without further

contextual information from query log or users made accurate classification

difficult and complicated the investigation of aspects like query language. Although

69

this study has aimed to overcome these difficulties by adopting consistent

approaches to classification, it could be enhanced by support from alternative data

sources (see: Section 3.1.1) and/or analysis of a larger query log dataset from a

longer time period.

2. Query ‘selections’: prefixes appearing in log data, defined here as query ‘selections’

(see: Section 4.1.1) were often unclear, with the difficulty of distinguishing search

and browse in particular increasing the complexity of drawing meaningful

conclusions from the study data. It would have been useful to ascertain from

Europeana staff exactly how ‘selections’ arise in the data, rather than relying on the

researcher’s potentially limited experimentation and personal experience of the

service.

The study could be extended in several different ways. For example, it could be repeated on

a broader scale by incorporating data like session information, interviews or surveys for

different user groups (see, e.g. Hearst, 2009; Jansen et al., 2007; Park et al., 2005).

Alternatively, a narrower study could focus solely on classifying initial queries without

query ‘selections’ or refinements, or consider aspects explored briefly here in more detail

(e.g. query languages or other filtering options). It would also be interesting to assess the

transferability of the study classification scheme beyond the Europeana case study to

queries from other cultural heritage organisations, or fields outside the cultural heritage

sphere.

WORD COUNT: 14,069

Programme of Study INFT03 MA Librarianship

Module Code INF6000 Dissertation

Student Registration Number 110134998

70

References

Agosti, M., Crivellari, F. & Di Nunzio, G. (2012). “Web log analysis: a review of a decade of

studies about information acquisition, inspection and interpretation of user interaction”.

Data Mining and Knowledge Discovery [Online], 24(3), 663-696.

http://www.springerlink.com/content/t36px9850w1u3877/?MUD=MP [Accessed 26 June

2012].

Bowler, L., Koshman, S., Oh, J.S., He, D., Callery, B.G., Bowker, G. & Cox, R.J. (2011). “Issues

in User-Centered Design in LIS”. Library Trends [Online], 59(4), 721-752.

http://dx.doi.org/10.1353/lib.2011.0013 [Accessed 10 June 2012].

Broder, A. (2002). “A taxonomy of web search”. ACM SIGIR Forum [Online], 36(2), 3-10.

http://dl.acm.org/citation.cfm?doid=792550.792552 [Accessed 7 June 2012].

Chaudhry, A.S. & Jiun, T.P. (2005). “Enhancing access to digital information resources on

heritage: A case of development of a taxonomy at the Integrated Museum and Archives

System in Singapore”. Journal of Documentation [Online], 61(6), 751-776.

http://dx.doi.org/10.1108/00220410510632077 [Accessed 25 February 2012].

CIBER Research Ltd. (2011). Europeana: Culture on the go [Online]. [Newbury, England:

CIBER Research Ltd.?]. Available from:

http://www.pro.europeana.eu/documents/858566/858665/Culture+on+the+Go [Accessed

5 June 2012].

Clough, P. (2009). Deliverable 4.1. TrebleCLEF Query Log Analysis Workshop Report [Online].

[Pisa: TrebleCLEF?]. Available from: http://ir.shef.ac.uk/cloughie/qlaw2009/D4.1-final.pdf

[Checked 17 March 2012].

Concordia, C., Gradmann, S. & Siebinga, S. (2010). “Not just another portal, not just another

digital library: A portrait of Europeana as an application program interface”. IFLA Journal

[Online], 36(1), 61-69. http://ifla.sagepub.com/content/36/1/61 [Accessed 14 June 2012].

Cooper, A. (2008). “A survey of query log privacy-enhancing techniques from a policy

perspective”. ACM Transactions on the Web (TWEB) [Online], 2(4), Article 19: 1-27.

http://dx.doi.org/10.1145/1409220.1409222 [Accessed 14 February 2012].

71

Cooper, M.D. (2001). “Usage patterns of a web-based library catalog”. Journal of the

American Society for Information Science and Technology [Online], 52(2), 137-148. Available

from: http://onlinelibrary.wiley.com [Accessed 8 June 2012].

Cousins, J., Chambers, S. & van der Meulen, E. (2008). “Uncovering cultural heritage

through collaboration”. International Journal on Digital Libraries [Online], 9(2), 125-138.

http://www.springerlink.com/content/q1067151j02j248n/fulltext.pdf [Accessed 15 August

2012].

Eirinaki, M. & Vazirgiannis, M. (2003). “Web mining for web personalization”. ACM

Transactions on Internet Technology (TOIT) [Online], 3(1), 1-27.


Eldredge, J.D. (2004). “Inventory of research methods for librarianship and informatics”.

Journal of the Medical Library Association [Online], 92(1), 83-90. Available from:

http://web.ebscohost.com [Accessed 7 February 2012].

Europeana. (2010). The Europeana Public Domain Charter [Online]. The Hague:

Europeana.eu c/o the Koninklijke Bibliotheek. Available from:

http://pro.europeana.eu/web/guest/publications [Accessed 6 June 2012].

Europeana. [2011?]a. Strategic Plan 2011-2015 [Online]. [The Hague: Europeana.eu c/o the

Koninklijke Bibliotheek?]. Available from: http://pro.europeana.eu/web/guest/publications

[Accessed 5 June 2012].

Europeana. [2011?]b. Business Plan 2012 [Online]. [The Hague: Europeana.eu c/o the

Koninklijke Bibliotheek?]. Available from: http://pro.europeana.eu/web/guest/publications


Europeana. (2012). faust-Europeana-Search results [Online]. The Hague: Europeana.eu c/o

the Koninklijke Bibliotheek. http://www.europeana.eu/portal/search.html?query=faust

[Accessed 31 August 2012].

Europeana. [2012?]a. Europeana – Homepage [Online]. The Hague: Europeana.eu c/o the

Koninklijke Bibliotheek. http://www.europeana.eu/portal/ [Accessed 27 February 2012].

Europeana. [2012?]b. The Europeana Foundation [Online]. The Hague: Europeana.eu c/o

the Koninklijke Bibliotheek. http://pro.europeana.eu/web/guest/foundation [Accessed 15

August 2012].

72

Europeana. [2012?]c. Facts and Figures [Online]. The Hague: Europeana.eu c/o the

Koninklijke Bibliotheek. http://pro.europeana.eu/web/guest/about/facts-figures [Accessed

16 May 2012].

Europeana. [2012?]d. Searching Europeana [Online]. The Hague: Europeana.eu c/o the

Koninklijke Bibliotheek. http://www.europeana.eu/portal/usingeuropeana_search.html


Europeana. [2012?]e. Exploring Europeana [Online]. The Hague: Europeana.eu c/o the

Koninklijke Bibliotheek. http://www.europeana.eu/portal/usingeuropeana_explore.html


Europeana. [2012?]f. Results in Europeana [Online]. The Hague: Europeana.eu c/o the

Koninklijke Bibliotheek. http://www.europeana.eu/portal/usingeuropeana_results.html


Europeana. [2012?]g. Using My Europeana [Online]. The Hague: Europeana.eu c/o the

Koninklijke Bibliotheek.

http://www.europeana.eu/portal/usingeuropeana_myeuropeana.html [Accessed 16

August 2012].

Gabrilovich, E., Broder, A., Fontoura, M., Joshi, A., Josifovski, V., Riedel, L. & Zhang, T.

(2009). “Classifying Search Queries Using the Web as a Source of Knowledge”. ACM

Transactions on The Web [Online], 3(2, Article 5), 1-28.

http://doi.acm.org/10.1145/1513876.1513877 [Accessed 27 June 2012].

Glaser, B.G. & Strauss, A.L. (1967). The Discovery of Grounded Theory: Strategies for

Qualitative Research. New York: Aldine de Gruyter (A Division of Walter de Gruyter, Inc.).

Google. [2012?]. Google Translate [Online]. [Mountain View, CA: Google Inc.?].

http://translate.google.com/ [Accessed 9 August 2012].

Hargittai, E. (2002). “Beyond logs and surveys: In-depth measures of people’s web use

skills”. Journal of the American Society for Information Science and Technology [Online],

53(14), 1239-1244. http://onlinelibrary.wiley.com/doi/10.1002/asi.10166/pdf [Accessed 10

June 2012].

Hearst, M.A. (2009). Search User Interfaces. Cambridge: Cambridge University Press.

73

Hider, P. & Pymm, B. (2008). “Empirical research methods reported in high-profile LIS

journal literature”. Library & Information Science Research [Online], 30(2), 108-114.

http://dx.doi.org/10.1016/j.lisr.2007.11.007 [Accessed 7 February 2012].

Jansen, B.J. (2006). “Search log analysis: What it is, what’s been done, how to do it”. Library

and Information Science Research [Online], 28(3), 407-432.

http://www.sciencedirect.com/science/article/pii/S0740818806000673 [Accessed 8 June

2012].

Jansen, B.J. (2009). Understanding User-Web Interactions via Web Analytics [Synthesis

Lectures on Information Concepts, Retrieval, and Services #6 (Series ed. Gary Marchionini)].

[San Rafael, California?]: Morgan & Claypool Publishers.

Jansen, B.J. & Pooch, U. (2001). “A review of Web searching studies and a framework for

future research”. Journal of the American Society for Information Science and Technology

[Online], 52(3), 235-246. Available from: http://onlinelibrary.wiley.com [Accessed 9 June

2012].

Jansen, B.J. & Spink, A. (2006). “How are we searching the World Wide Web? A comparison

of nine search engine transaction logs”. Information Processing and Management [Online],

42(1), 248-263. http://www.sciencedirect.com/science/article/pii/S0306457304001396


Jansen, B.J., Spink, A., Blakely, C. & Koshman, S. (2007). “Defining a Session on Web Search

Engines”. Journal of the American Society for Information Science and Technology [Online],

58(6), 862-871. http://onlinelibrary.wiley.com/doi/10.1002/asi.20564/pdf [Accessed 14

February 2012].

Jansen, B.J., Taksa, I. & Spink, A. (2009). “Chapter I. Research and Methodological

Foundations of Transaction Log Analysis”. In: Jansen, B.J., Spink, A. & Taksa, I. Handbook of

Research on Web Log Analysis. pp. 1-16. Hershey, PA: Information Science Reference (An

imprint of IGI Global).

Jansen, B.J., Liu, Z., Weaver, C., Campbell, G. & Gregg, M. (2011). “Real time search on the

web: Queries, topics, and economic value”. Information Processing and Management

[Online], 47(4), 491-506.

http://www.sciencedirect.com/science/article/pii/S0306457311000082 [Accessed 27 June

2012].

74

Kirchoff, T., Schweibenz, W. & Sieglerschmidt, J. (2008). “Archives, libraries, museums and

the spell of ubiquitous knowledge”. Archival Science [Online], 8(4), 251-266.

http://dx.doi.org/10.1007/s10502-009-9093-2 [Accessed 10 June 2012].

Kurth, M. (1993). “The limits and limitations of transaction log analysis”. Library Hi Tech

[Online], 11(2), 98-104. http://www.emeraldinsight.com/journals.htm?issn=0737-

8831&volume=11&issue=2&articleid=1676224&show=pdf [Accessed 10 June 2012].

Levene, M. (2006). An Introduction to Search Engines and Web Navigation. Harlow:

Addison-Wesley (An imprint of Pearson Education Limited).

Li, L., Zhong, L., Xu, G. & Kitsuregawa, M. (2012). “A feature-free search query classification

approach using semantic distance”. Expert Systems with Applications [Online], 39(12),

10739-10748. http://www.sciencedirect.com/science/article/pii/S0957417412004642


Library of Congress. (2010). Codes for the Representation of Names of Languages [Online].

Washington, DC: The Library of Congress. http://www.loc.gov/standards/iso639-

2/php/code_list.php [Accessed 15 August 2012].

Library of Congress. [2012?]. Library of Congress Subject Headings [Online]. Washington,

DC: The Library of Congress. http://id.loc.gov/authorities/subjects.html [Accessed 30 July

2012].

Meyer, É., Grussenmeyer, P., Perrin, J.-P., Durand, A. & Drap, P. (2007). “A web information

system for the management and the dissemination of Cultural Heritage data”. Journal of

Cultural Heritage [Online], 8(4), 396-411. http://dx.doi.org/10.1016/j.culher.2007.07.003

[Accessed 28 March 2012].

Minelli, S.H., Marlow, J., Clough, P., Cigarran Recuero, J.M., Gonzalo, J., Oomen, J. &

Loschiavo, D. (2007). “Gathering requirements for multilingual search of audiovisual

material in cultural heritage”. In: Proceedings of Workshop on User Centricity – state of the

art (16th IST Mobile and Wireless Communications Summit). Budapest, Hungary, 1-5 July

2007. 5pp. [Brussels: Information Society Technologies?]. Available from:

http://ir.shef.ac.uk/cloughie/papers/mobilesummit2007-minelli.pdf [Checked 1 March

2012].

75

Nicholas, D. & Clark, D. (2012). “Evidence of user behaviour: deep log analysis”. In: Dobreva,

M., O’Dwyer, A. & Feliciati, P. (eds.). User Studies for Digital Library Development. pp. 85-

94. London: Facet Publishing.

Ott, M. & Pozzi, F. (2011). “Towards a new era for Cultural Heritage Education: Discussing

the role of ICT”. Computers in Human Behavior [Online], 27(4), 1365-1371.

http://dx.doi.org/10.1016/j.chb.2010.07.031 [Accessed 27 March 2012].

Park, S., Lee, J.H. & Bae, H.J. (2005). “End user searching: A Web log analysis of NAVER, a

Korean Web search engine”. Library & Information Science Research [Online], 27(2), 203-

221. http://dx.doi.org/10.1016/j.lisr.2005.01.013 [Accessed 27 March 2012].

Politou, E.A., Pavlidis, G.P. & Chamzas, C. (2004). “JPEG2000 and dissemination of cultural

heritage over the Internet”. IEEE Transactions on Image Processing [Online], 13(3), 293-301.

http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=1278354


Purday, J. (2009). “Think culture: Europeana.eu from concept to construction”. The

Electronic Library [Online], 27(6), 919-937.

http://www.emeraldinsight.com/journals.htm?articleid=1827227&show=abstract

[Accessed 21 April 2012].

Purday, J. (2010). “Intellectual Property Issues and Europeana, Europe’s Digital Library,

Museum and Archive”. Legal Information Management [Online], 10(3), 174-180.

http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=7891425

[Accessed 21 April 2012].

Ross, N.C.M & Wolfram, D. (2000). “End User Searching on the Internet: An Analysis of

Term Pair Topics Submitted to the Excite Search Engine”. Journal of the American Society

for Information Science [Online], 51(10), 949-958. Available from:

http://onlinelibrary.wiley.com [Accessed 9 June 2012].

Silverstein, C., Marais, H., Henzinger, M. & Moricz, M. (1999). “Analysis of a very large web

search engine query log”. ACM SIGIR Forum [Online], 33(1), 6-12.


Spink, A., Ozmuthu, S., Ozmuthu, H.C. & Jansen, B.J. (2002). “U.S. versus European web

searching trends”. ACM SIGIR Forum [Online], 36(2), 32-38.

http://dx.doi.org/10.1145/792550.792555 [Accessed 28 March 2012].

76

Voorbij, H. (2010). “The use of web statistics in cultural heritage institutions”. Performance

Measurement and Metrics [Online], 11(3), 266-279.

http://dx.doi.org/10.1108/14678041011098541 [Accessed 10 June 2012].

Walsh, J. (2011). “The use of Library of Congress Subject Headings in digital collections”.

Library Review [Online], 60(4), 328-343.

http://www.emeraldinsight.com/journals.htm?articleid=1923626&show=abstract


Wikipedia: The Free Encyclopedia. [2012?]. Wikipedia [Online]. San Francisco, CA:

Wikimedia Foundation Inc. http://www.wikipedia.org/ [Accessed 9 August 2012].

Yahoo! Inc. (2012). Yahoo! Directory [Online]. Sunnyvale, CA: Yahoo! Inc.

http://dir.yahoo.com/ [Accessed 22 August 2012].

77

Appendix

Appendix 1: A Summary of the Classification Scheme Developed for

Filters Dataset ‘text’ Query Refinements

Primary Categories Secondary Categories Tertiary Categories

Philosophy, Mythology and

Religion

Philosophy

Mythology

Religion

Ideas and Concepts

Ancient Greece and Rome

Abrahamic Religions

Place and Civilisation Country or Settlement

Region


City (Capital)

City (Other)

Town or Village

Society and Current Affairs Royalty and Nobility

Politics

Crime

Organisations and Events

Historical Figures

Journalism

Organisation or Institution

Named Event

Military and Military History Subjects

People

The Arts Visual Arts and Theatre

Music

People

People

78

Poetry and Literature People (Creators)

People (Fictional)

The Sciences People Historical Figures

Generic Subjects Place

Time

Person

Object or Form Descriptors Visual Arts (2D)

Object, Textile and Sculpture

Architecture

Written Word

Drawing and Painting

Photography

Stamps

Textile

Pottery and Ceramic

Ambiguous or Unclear Name

Computing or Search

Functionality

Other

Forename

Surname

79

Appendix 2: Results of Exploratory Mapping between Preliminary

Classification ‘Primary Category’ Terms and Library of Congress

Subject Headings (Library of Congress, [2012?])

Highlighted descriptors appear in search results for multiple preliminary classification

Primary Categories.

Preliminary Classification:

Primary Categories

Potential Mapping:

‘Library of Congress Subject Headings’

Quoted from searches conducted at:

Library of Congress ([2012?]: n.p.)

Philosophy, Mythology and Religion “Philosophy”

“Philosophy, Ancient”

“Idea (Philosophy)”

“Philosophy, Modern”

“Philosophy and religion”

“Philosophy and social sciences”

“Philosophy in literature”

“Philosophy and civilization”

“Ethics”

“Mythology”

“Mythology, Classical”

“Mythology in literature”

“Religion”

“Theology”

“Religion and sociology”

“Religion and politics”

“Religion and civilization”

“Religion and culture”

“Religion and literature”

“Religion and religious literature”

“Philosophy and religion”

“Philosophy and religion in literature”

“Religions”

Place and Civilisation “Philosophy and civilization”

80

“Place (Philosophy)”

“Names, Geographical”

“Civilization”

“Civilization in literature”

Society and Current Affairs “War and society”

“Civil society”

“Politics and government”

“Politics and culture”

“Press and politics”

“Political activity”

“World politics”

“Mass media and world politics”

“Popular culture”

“Political science”

Military and Military History “War and society”

“Military history”

“Military missions”

“Military paraphernalia”

“Military art and science”

“Military history, modern”

“Soldiers”

“Military campaigns”

“Military policy”

“Military life”

“Military history in literature”

“Combat”

“Armed Forces”

The Arts “Arts”

“Arts and society”

“Decorative arts”

“Arts, Ancient”

“Arts, Modern”

“Arts, Classical”

“Arts and history”

81

“Graphic arts”

“Arts and society in literature”

“Arts in literature”

“Art”

“Performing arts and literature”

“Art criticism”

“Art and state”

The Sciences “Science”

“Science and civilization”

“Science and state”

“Literature and science”

“Science and industry”

“Science and the arts”

“Historians of science”

“Philosophy and science”

“Science in literature”

“Science and civilization in literature”

“Research”

“Technology”

“Scientific literature”

“Natural history”

“Discoveries in science”

Generic Subjects n/a

Object/Form Descriptors Potential mapping to Library of Congress LC

Genre/Form Terms (Library of Congress,

[2012?]: n.p.).

Ambiguous or Unclear “Anonymous persons”

“Anonymous writings”

“Anonymous art”

“Names, Personal”

“Human-computer interaction”

82

Appendix 3: A Summary of the Study Classification Scheme Following

Refinement based on Popular Queries Dataset Analysis

Primary Categories Secondary Categories Tertiary Categories

Philosophy, Mythology and

Religion

Philosophy

Mythology

Religion

Ideas and Concepts

Named Philosophers

Classical: Ancient Greece,

Rome and Egypt


Iconography

Religious Buildings

Historical Figures (Biblical)

Named Religions

Place and Civilisation Geographical Area

Countries and Settlements


Region (multi-country)

City (Capital)

City (Other)

Municipality, Town or

Village

Region (single country)

Island

Historical Place Names

Politics and Society Political Figures

Popular Culture

Politicians/Political Leaders


Historical Figures

Fashion

Journalism

Entertainment and Events

83

Local Government and

Facilities

Crime

International Relations

Local Amenities

Healthcare


Military and Military History Military Figures

Military Engagements

Military Objects

Military Leaders

Military Personnel

Prisoners of War

World War One

World War Two

Tactics and Strategy

Treaties and Agreements

The Arts Artists, Authors and Creators

Artistic Genres

Named Works

Painters and Illustrators

Photographers

Authors and Poets

Actors and Actresses

Architects

Designers

Visual Arts (2D)

Film and Theatre

Fashion and Design

Architecture

Music

Poetry and Literature

Other

Paintings

The Sciences Scientists and Scientific

Figures

Historical Figures

84

Scientific Genres

Architecture

Natural History

Archaeology

Business and Industry Named Figures

Genres

Named Companies or

Organisations

Engineering

Healthcare

Advertising

Patents

Collections, Organisations

and Institutions

Libraries and Archives

Museums and Galleries

Other Collections

Portals and Aggregators

Geographical Designations

Physical Collections

Online Collections

Europeana Collections

Generic Subjects Place

Time

Person

Object

Other

Object or Form Descriptors Visual Art Formats (2D) Drawing, Painting and

85

Visual Art Formats (3D)

Audio or Moving Image

Results Formats

Illustration

Printing

Design

Photography

Maps and Surveys

Stamps, Postcards and

Bookplates

Architecture

Design

Textiles

Pottery, Ceramic and

Glassware

Sculpture

Other Objects

Film

Query Selections

Ambiguous or Unclear Personal Names

Place Names

Computing Functionality or

Search Feature

Other

Forename

Surname

86

Appendix 4: Results of Potential Mapping between Final

Classification Scheme Primary, Secondary and Tertiary Category

terms and Existing Schemes

Final Classification: Primary

Categories

Potential Mapping: Yahoo! Directory

Quoted from: Yahoo! Inc. (2012: n.p.)

Philosophy, Mythology and Religion (“Arts & Humanities”)

“Social Science”

“Society & Culture”

Place, Civilisation and Travel “Recreation & Sports”

“Regional”

Politics and Society “Education”

“Government”

“Health”

“News & Media”

“Society & Culture”

Military and Military History (“Government”)

Lifestyle and Entertainment (“Business & Economy”)

“Computer & Internet”

“Entertainment”

“Recreation & Sports”

Arts and Design “Arts & Humanities”

Literature and Poetry “Arts & Humanities”

Music, Film and Theatre “Arts & Humanities”

“Entertainment”

Architecture, Buildings and Structures (No clear top-level category)

Sciences “Science”

“Social Science”

Business and Industry “Business & Economy”

Generic Subjects n/a

Collections, Organisations and

Institutions

“Arts & Humanities”

Object or Form Descriptors n/a

Ambiguous or Unclear n/a

87

Final

Classification:

Secondary

Categories

Potential Mapping:

‘Library of Congress

Subject Headings’

Quoted from

searches conducted

at:

Library of Congress

([2012?]: n.p.)

Final Classification:

Tertiary Categories

Potential Mapping:

‘Library of Congress

Subject Headings’ Quoted

from searches conducted

at:

Library of Congress

([2012?]: n.p.)

Philosophy,

Mythology and

Religion:

Philosophy

Mythology

Religion

“Philosophy”

“Mythology”

“Religion”

Named Figures

Ideas and Concepts


Legends

Classical Philosophy,

Mythology and

Religion

Theology and

Religious History

Named Religions

“Philosophers”

“Philosophers, Modern”

“Philosophers, Ancient”

“Ethics”

“Idea (Philosophy)”

“Tales”

“Fairy tales”

“Folklore”

“Legends”

“Mythology, Classical”

“Philosophy, Ancient”

“Theology”

“Religious History”

“Religions”

88

and Religious

Groups

Named Figures:

Ministers and

Officials

Named Figures:

Religious Texts

Festivals and

Ceremonies

Iconography and

Objects

Religious Buildings,

Locations and

Communities

“Religious institutions”

“Clergy”

“Associate clergy”

“Church musicians”

(n/a: too broad)

“Fasts and feasts”

“Rites and ceremonies”

“Idols and images”

“Religious articles”

“Religious facilities”

“Religious communities”

Place,

Civilisation and

Travel:

Geographical

Features or

Regions

Countries and

Settlements

“Geography”

“Physical

geography”

“Environmental

geography”

“Names,

Geographical”

“Human

settlements”

Country

City: Capital

(n/a: not an over-arching

subject)

“Capitals (Cities)”

89

Travel

Civilisation or

Culture

“Travel”

“Civilization”

“Culture”

City: Other

Municipality, Town

or Village

Specified Address

Island (Inhabited)

Region or

Administrative

Region

Maps and Travel

Guides

Languages

Historical Place

Names

Ancient and

Classical Civilisation

and Culture

“Cities and towns”

“Cities and towns”

“Villages”

“Street addresses”

“Islands”

“Regions”

“Regions (Administrative

and political divisions)”

“Maps”

“Tourist maps”

“Atlases”

“Guidebooks”

“Languages”

“Languages, Modern”

“Language and languages”

“Historic sites” (?)

“Civilization, Ancient”

“Civilization, Classical”

Politics and

Society:

Named Figures

(n/a: too broad)

Political Leaders and

Politicians

“Politicians”

90

News

Law and Crime

Amenities and

Facilities

History and

Social Change

“Foreign news”

“Press”

“Press coverage”

“Law”

“Crime”

(n/a: too broad)

“History”

“Social history”

“Social sciences and

history”

“Social change”


Named Newspapers

Journalism

History of Crime

Copyright

Housing

Hospitals and

Healthcare

Libraries

Schools and

Education

“Kings and rulers”

“Queens”

“Princesses”

“Princes”

“Royal houses”

“Nobility”

“Newspapers”

“Journalism”

“Crime—History”

“Copyright”

“Housing”

“Health facilities”

“Health facilities,

Proprietary”

“Hospitals”

“Hospitals, Proprietary”

“Libraries”

“Public services (Libraries)”

“Education”

“School facilities”

91

Organisations

and Societies

Civil

Ceremonies and

Events

International

Relations

“Societies, etc.”

“Associations,

institutions, etc.”

“Fraternal

organizations”

“Societies and

clubs”

“Societies”

“Clubs”

(n/a: no clear

equivalents)

“International

relations”

“Non-state actors

(International

relations)”

Marriage


“Marriage”

“Treaties”

“International obligations”

Military and

Military

History:

Named Figures

Military

Engagements

(n/a: too broad)

“Wars”

“Combat”

Military Leaders and

Personnel

Prisoners of War

Historical Figures

Strategy and Tactics

“Soldiers”

“Veterans”

“Prisoners of war”

“Ex-prisoners of war”

(n/a: no clear equivalents)

“Strategy”

“Defensive (Military

92

Procedure and

Discipline

Military Objects

“Battles”

“Military missions”

“Military discipline”

“Military

paraphernalia”

Treaties and

Agreements

World Wars

Military Tribunals

Military Records

Buildings, Locations

and Bases

Transport

Weapons and

Equipment

science)”

“Offensive (Military

science)”

“Tactics”

“Treaties”

“Armistices”

“World War, 1914-1918”

“World War, 1939-1945”

“Military courts”

“Courts-martial and courts

of inquiry”

“Military administration”

“Military bases”

“Transportation, Military”

“Vehicles, Military”

“Weapons”

“Military supplies”

Lifestyle and

Entertainment:

Entertainment

and Events

“Entertainment

events”

Performances

Exhibitions

Arcades

“Performances”

“Exhibitions”

“Arcades”

93

Transport

Sport

Computing

Fashion and

Beauty

Advertising

“Transportation”

“Vehicles”

“Sports”

“Computer

systems”

“Computers”

“Fashion”

“Beauty, Personal”

“Advertising”

Road

Rail

Air

Other

Named Sports and

Sports Clubs

Sporting Events

Equipment

Social Media

“Transportation,

Automotive”

“Automobiles”

“Railroads”

“Railroad trains”

“Aeronautics, Commercial”

“Airplanes”

(n/a)

“Sports”

“Athletic clubs”

“Sports administration”

“Hosting of sporting

events”

“Sporting goods”

“Social media”

“Online social networks”

Arts and

Design:

Named Figures

(n/a: too broad)

Creators or

Designers

“Artists”

“Designers”

94

Named Works

or Subjects

Artistic Periods,

Styles or

Movements

Genres

“Titles of works of

art”

“Art--Themes,

motives”

“Art genres”

“Art movements”

“Art genres”

History of Art

Classical Art

Portrait

Landscape

Painting, Drawing

and Illustration

Engraving and

Printing

Photography

Stamps

Bookplates

Postcards

Ceramics, Enamel,

“Art and history”

(plus by location e.g. “Art,

Italian—History”)

“Art, Classical”

“Art objects, Classical”

“Portraits”

“Landscapes in art”

“Painting”

“Drawing”

“Pictorial works”

“Illustrations”

“Engraving”

“Printing”

“Photography”

“Photography, artistic”

“Postage stamps”

“Bookplates”

“Postcards”

“Ceramics”

95

Pottery and Glass

Sculpture and

Figurines

Fashion, Clothing

and Jewellery

Other

“Decorative arts”

“Pottery”

“Enamel and enameling”

“Glass art”

“Art glass”

“Glassware”

“Sculpture”

“Small sculpture”

“Figurines”

“Fashion”

“Clothing and dress”

“Costume”

“Jewelry”

(n/a)

Literature and

Poetry:

Named Figures

Named Works

or Subjects

Literary

Periods, Styles

or Movements

(n/a: too broad)

“Titles of books”

“Literary form”

“Literary

movements”

Authors and Editors

Publishers

Classical Literature

“Authors”

“Poets”

“Editors”

“Publishers and

publishing”

“Authors and publishers”

“Classical literature”

96

Genres

“Style, Literary”

“Literary form”

Poetry

Literature (Fiction)

Literature (Non-

Fiction)

Ephemera

“Poetry”

“Fiction”

“Non-fiction…”

“Printed ephemera”

Music, Film and

Theatre:

Named Figures

Named Works

or Subjects

Periods, Styles

or Movements

Instruments

and Equipment

(n/a: too broad)

(n/a: no clear

equivalents)

“Popular music

genres”

“Film genres”

“Stage props”

“Motion pictures--

Creators or

Composers

Performers

Other

Folk Music

Musical Instruments

“Composers”

“Screenwriters”

“Dramatists”

“Entertainers”

“Actors”

“Male actors”

“Actresses”

“Musicians”

(n/a)

“Folk music”

“Musical instruments”

97

Genres

Setting and

scenery”

“Theaters--Stage-

setting and

scenery”

“Popular music

genres”

“Film genres”

Music

Film

Theatre

“Music”

“Motion pictures”

“Motion pictures and

television”

“Performing arts”

“Theater”

“Drama”

Architecture,

Buildings and

Structures:

Named Figures

Architectural

Periods, Styles

or Movements

(n/a: too broad)

(n/a: individual

examples e.g.

“International style

(Architecture)”,

“Modern

movement

(Architecture)”)

Architects

Landscape

Architecture

Castles, Palaces,

Religious Buildings

and Monuments

Civic Buildings,

Housing and

Businesses

“Architects”

“Landscape architecture”

“Castles”

“Palaces”

“Monuments”

“Religious facilities”

“Public buildings”

“Housing”

“Business enterprises”

“Industrial buildings”

98

Engineering

Structures

“Structural engineering”

Sciences:

Named Figures

Genres

“Scientists”

“Classification of

sciences”

Historical Figures

Natural History and

Biology (Non-

Human)

Animal Husbandry

and Food Science

Human Biology and

Medicine

Archaeology

Anthropology

Geography and

Cartography

Physics and

Astronomy

Technology

(n/a: no clear equivalents)

“Natural history”

“Zoology”

“Botany”

“Animal culture”

“Livestock”

“Domestic animals”

“Food”

“Nutrition”

“Human biology”

“Medicine”

“Archaeology”

“Anthropology”

“Geography”

“Cartography”

“Physical Sciences”

“Physics”

“Astronomy”

“Technology”

Business and

Industry:

99

Named

Companies or

Manufactories

Named

Products and

Advertising

Named

Industries

“Business

enterprises”

“Business names”

“Corporations”

“Industrial

buildings”

“Factories”

“Commercial

products”

“Brand name

products”

“Advertising”

“Branding

(Marketing)”

“Industries”

Patents

Mining and

Resource Extraction

Construction and

Manufacturing

Industries

“Patents”

“Mineral industries”

“Construction industry”

“Manufacturing industries”

Generic

Subjects:

Person

Place

Object

Time

(n/a: too broad)

(n/a: too broad)

(n/a: too broad)

“Time”

Date

“Chronology, Historical”

“Days”

“Months”

100

Other

(n/a)

“Year”

Named

Collections:

Libraries and

Archives

Museums and

Galleries

Other

Collections

Portals and

Aggregators

Geographical

Designations

“Libraries”

“Archives”

“Museums”

“Art museums”

(n/a: too broad)

“Web portals”

“Federated

searching”

“Names,

Geographical”

Object or Form

Descriptors:

Europeana

Query

Selections:

Format

(n/a: this study

focusing on

Europeana

functionality)

Ambiguous or

Unclear:

Person

“Anonymous

persons”

“Names, Personal”

101

Place

Computing

Functionality or

Search Feature

Other

“Human-computer

interaction”

“Anonymous

writings”

“Anonymous art”

AN INVESTIGATION INTO QUERIES SUBMITTED TO THE EUROPEANA...

Documents

Transcript of AN INVESTIGATION INTO QUERIES SUBMITTED TO THE EUROPEANA...