Example queries for Federated search

Jan OdijkCLARIN Federated Search Workshop

Copenhagen, 24 Apr 2013

• Linguistic Problem• Federated Search• Structural Differences • Search in Lexicons• Search In Corpora• Corpora+Lexicon search• Iterative Corpus Search• Search in micro-comparative databases?

Overview

• Inflection of attributively used adjectives• Influence of number, gender, case, definiteness (strong/weak

inflection), other factors?• Exceptions to the main rules such as

– het bijvoeglijk(?e) naamwoord, lit. the adjectival noun, ‘the adjective’– het medisch(*e) onderzoek `the medical research’– de medisch(*e) onderzoeker `the medical researcher’– Een competent(e) linguïst

• Where –e suffix is predicted as the only option by the main rules• The exceptions are not (all) arbitrary, there are further subregularities: I want to

find out what these are.• There are similar phenomena in many languages (Germanic, Romance, e.g.

Brasilian Portuguese Menuzzi 1994, ..)

Linguistic Problem

• [Odijk 1992] J. Odijk. Uninflected Adjectives in Dutch. In R. Bok-Bennema & R. van Hout, Linguistics in the Netherlands1992, pp.197-208. Amsterdam: Benjamins

• Odijk, J. (2012). De structuur van Phrasal Names. Nederlandse Taalkunde, 17(2), 292-298.

• Odijk, J. (2013), Comparative Linguistic Research in the CLARIN Infrastructure, presentation to be held at the Patterns of Macro- and Micro-Diversity in the Languages of Europe and the Middle East. Computational Issues in Studying Language Diversity: Storage, Analysis and Inference , Groningen, July 2013.

Linguistic Problem

• I want to search1. For relevant examples in large annotated text corpora2. For related examples in large text corpora selected on

the basis of the results of (1)3. For relevant examples is microcomparative databases

(e.g. MIMORE)4. For properties of relevant words in dictionaries5. For synonyms/hyperonyms/hyponyms in semantic

lexical databases (e.g. CORNETTO)

Broaden empirical Base

• A set of resources R has been selected by the researcher on the basis of metadata. The resources of R can be in different locations

• A Federated Search Engine (FSE) enables search in the resources in R• For each resource r in R there is a local search engine LSE_r• A query can be formulated in an agreed-upon query language

(SRU/CQL),e.g. via a Federates Search web application: q_fs• Q_fs uses ISOCAT DCs• Q_fs is sent to the LSE_r for each r in R, and translated there into the query

language needed for LSE_r and into the DCs used in r• Each LSE_r yields a result set for q_fs, in which it translates DCs used into

ISOCAT_DCs , and sends to FSE, which combines/aggregates them and prepares them for presentation to /saving by the user, possibly via the Federated Search web application

Federated Search

• For many resource formats used in NL there is NOT yet a systematic mapping of their DCs to ISOCAT DCs, e.g. TEI, CGN-format, Folia, EAF, GTB, WordNet, CELEX, Praat,…

• A special project should be started up for this– Nationally for national formats (CGN, Folia, …)– Internationally for generic formats (TEI, CELEX,

Wordnet, …)

Federated Search

• There are (sometimes trivial) structural differences between resources.

• Description of an occurrence of Dutch ‘is’:

Structural Differences

CGN VU-DNC/FoLiA

• Do these two descriptions contain the same or overlapping information?

• ISOCAT alone will not help because there are differences in structure

• Will the LSE’s deal with such structural differences?

• Or is something more general needed for this (and is this possible?)

• CGN pw element means the same as Folia w element: • CGN w attribute of pw means the same as Folia t element

in w• CGN pos attribute of pw means the same as Folia class

attribute of element pos (part of speech property name) • CGN lem attribute of pw means the same as FoLiA class

attribute of element lemma (lemma property name) • Values inside the CGN pos attribute of pw are identical to

and mean the same as values inside Folia class attribute of element pos in element w (values from the CGN-tagset)

• WFT-GTB: Give me entries with PoS=noun of which the headword ends in “tsje”

• GTB, CELEX, CGN-lexicon: Give me entries with PoS=noun and with the headword ending in “tsje”, together with the source (=GTB, CELEX, of CGN-lexicon) in which it was found.

Search in Lexicons

• Search in all resources where the language=nld • For each resource with language=nld

– For each word in [ ‘zeer’, ‘heel’, ‘erg’] with PoS=adj• For each sense of the word

– For each synonym of the sense» For each lemma of the synonym

• Return word, Pos, sense, synonym, lemma, ‘synonym’ , resource.name

• • And analogously with ‘synonym’ replaced by ‘immediate hyperonym’• And analogously with ‘synomym’ replaced by ‘hyponym’ (incl

hyponyms of hyponyms) •

Search in Lexicons

• Question: Will federated search somehow smartly `know’ (e.g. from the metadata) that it has to search in lexicons only, actually only in lexicons that contain synonym information? Or will it waist time and effort by searching in all text corpora and in lexicons that do not have synonym information? Or is a smart choice of resources to search in left to the user?

• • Similarly:• Search in CGN: Give me all utterances that contain the word ‘zeer’ with

PoS=ADJ spoken by a speaker with age<=7.• • (there are no speakers with age<=7 in CGN; will federated search smartly be

able to see this from the metadata or will it waist time searching?)

Question

• Search in CGN-corpus, VU-DNC, SONAR:• Give me utterances that contain a subsequence of the form:

– A wordtoken with PoS='definite_determiner', immediately followed by

– A wordtoken with PoS=adjective with vorm=zonder-e, immediately followed by

– A wordtoken with Pos=noun • examples are 'het bijvoeglijk naamwoord', 'de gulden snede',

'het ingewikkelder probleem')• lternative: just return the subsequence

Search in Corpora

• The same as in the preceding example but now • the adjective should not end in two syllables

that both contain a schwa (represented by a regular expression over the phonetic transcription) in its phonetic_transcription as found in the CGN-lexicon

• This excludes an example such as: 'het ingewikkelder probleem'

Corpus+Lexicon search

• a value for an additional attribute with as possible values eFormExists, eFormDoesNotExist, eFormExistenceUnknown.

• The value specifies whether it is true for the word with pos=adjective that a form with property vorm=met-e exists or, or not, or whether it is unknown whether such a form exists.

• let wv be the value of the attribute word of the wordtoken with properties Pos=adjective, vorm=zonder-e). Look up the entry/ies for wv for which PoS=adjective in the CGN-lexicon and/or CELEX-lexicon lexicon, and determine its lemma (=wl) – if not found: result =eFormExistenceUnknown– if found

• look up in CGN/Celex an entry with PoS=adjective-code and lemma=wl and vorm=met-e– if found: result=EFormExists (e.g. (het) bijvoeglijk (naamwoord))– if not found: result= eFormDoesNotExist (e.g. ('de) gulden (snede)'

• This can be done in one very complicated query, or the queries might be put in a series where the results of the First query are filtered by the second query, etc.

• Each result in of the previous query is (or contains) a sequence Det ADJ NOUN

• For each result found in the previous query, – Give me utterances that contain a subsequence of the form:– A wordtoken with PoS='definite determiner', immediately followed

by– A wordtoken with PoS=adjective, with lemma=ADJ.lemma and with

vorm=met-e, immediately followed by– A wordtoken with Pos=noun with number=NOUN.number

• • alternative: just return the subsequences

Iterative Corpus Search

• using the MIMORE search engine (MIMORE web app)• Give me utterances that contain a subsequence of the

form:– A wordtoken with PoS='definite_determiner', immediately

followed by– A wordtoken with PoS=adjective with vorm=zonder-e,

immediately followed by– A wordtoken with Pos=noun

• alternative: just return the subsequences

Search in MIMORE

• Odijk, J. (2011), "User Scenario Search", internal CLARIN-NL document, April 13, 2011. [docx]

• Odijk, J. (2011), "Linguistic Research in the CLARIN Infrastructure", presentation for the KNAW eHumanities Workshop, NIAS, Wassenaar, Mar 29, 2011 [ppt]. Abstract contained in eHumanities Brainstorm Booklet

• Odijk, J.E.J.M. (2012, October 23). Linguistic Research and the CLARIN Infrastructure. Utrecht, Digital Humanities Lecture. [ppt]

More Examples

Thanks for your attention!

DO NOT ENTER HERE

Example queries for Federated search

Documents

Transcript of Example queries for Federated search

Revenue Mechanism for Federated Search Engines

Achieving time effective federated information from scalable rdf data using sparql queries

SeerSuite For Distributed Indexing, Federated …acscinf.org/docs/meetings/238nm/presentations/238nm32.pdfSeerSuite For Distributed Indexing, Federated Search and Meta Search ... Acegi

Optimization of Continuous Queries in Federated Database ...

Federated Search Falls Short

Federated Ontology Search

Answering Imprecise Structured Search Queries

Federated ECM Search with CMIS

Federated queries (DB2 to Informix) - Joe Kennedy and Warren ...

Strategies for executing federated queries in SPARQL1polleres/publications/buil-etal... · 2014. 9. 9. · Strategies for executing federated queries in SPARQL1.1 Carlos Buil-Aranda1?,

Enabling Federated Search with Heterogeneous Search Engines

Federated Search at Green Gables (Federated Search: The Good and the Bad)

Retroactive Answering of Search Queries

Randwick Library and Federated Search

Federated Search Brochure 2009-06-12

Chapter 6 Queries and Interfaces. Keyword Queries n Simple, natural language queries were designed to enable everyone to search n Current search engines.

Explorit Federated Search · © 2010 Deep Web Technologies, Inc. By Abe Lederman President and CTO Explorit Federated Search

Brand Attitudes and Search Engine Queries Attitudes and Search Engine Queries Abstract Search engines record the queries that users submit, including a large number of queries that

Federated Queries with HAWQ - SQL on Hadoop and Beyond

Federated Search evaluation & implementation. Outline Drivers Methodology Outcome.