Basi di dati distribuite

38
Basi di dati distribuite Prof. M.T. PAZIENZA a.a. 2003-2004

description

Basi di dati distribuite. Prof. M.T. PAZIENZA a.a. 2003-2004. INFORMATION EXTRACTION And QUESTION / ANSWERING. Information Extraction. - PowerPoint PPT Presentation

Transcript of Basi di dati distribuite

Page 1: Basi di dati distribuite

Basi di dati distribuite

Prof. M.T. PAZIENZAa.a. 2003-2004

Page 2: Basi di dati distribuite

INFORMATION EXTRACTIONAnd

QUESTION / ANSWERING

Page 3: Basi di dati distribuite

Information Extraction

Information Extraction generally relates to automatic approaches to locate important facts in large collections of documents aiming at highlighting specific information to be used for enriching other texts and documents while populating summaries, feeding reports, filling in forms or storing information for further processing (e.g. data mining); the extracted information is usually structured in the form of “templates”

Page 4: Basi di dati distribuite

Information Extraction

The process of Information Extraction consists of two major steps:

• To extract individual “facts” from the text of a document through local text analysis

• To integrate extracted facts producing larger facts or new facts (through inference)

Page 5: Basi di dati distribuite

Information Extraction

Short history (1)IE originated in the natural language processing

community under the MUC conferences (starting at 1987 and sponsored by DARPA) with the definition of a task: inside a specific application domain and corpus, a template with the relevant information has to be filled for every event of each foreseen class.

Page 6: Basi di dati distribuite

Information Extraction

Short history (2)In 1995 further goals for IE were proposed:• To identify processing tasks largely domain

independent (e.g. NE Named Entity Recognition)• To focus on portability in the IE tasks to new

event classes• To add three new tasks: co-• reference resolution, word-sense-disambiguation,

predicate-argument syntactic structuring

Page 7: Basi di dati distribuite

Information Extraction

Terminology A template is a sort of linguistic pattern (a set of

attribute-value pairs with the values being texts string) described by experts to represent the structure of a specific event in a given domain. The template relates to the final output format of selected information

The scenario identifies the specification of the particular events or relations to be extracted.

Page 8: Basi di dati distribuite

Information Extraction

General Architecture for an IE system“An IE system is a cascade of transducers or

modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically”. (by J. Hobbs).

Page 9: Basi di dati distribuite

Information Extraction

General Architecture for an IE systemEach system could be characterized by its own

set of modules belonging to the following set: Text zoner, pre-processing, filter, preparser, parser, fragment combiner, semantic interpreter, lexical disambiguation, coreference resolution / discourse processing, template generator.

Page 10: Basi di dati distribuite

Information Extraction

Text zoner This module turns a text into a set of text

segments. As a minimum results it would separate the formatted from the unformatted regions.

Page 11: Basi di dati distribuite

Information Extraction

Pre-processing This module:locates sentence boundaries in the text producing

for each sentence a sequence of lexical items (words together with their possible POS). It recognizes also multiword (lexical lookup methods)

recognizes and normalizes certain basic types that occur in the genre, such as dates, times, personal and company names, locations, currency amounts, and so on.

Page 12: Basi di dati distribuite

Information Extraction

Filter For speeding processing time this module uses

superficial techniques to filter out (from previously recognized ones) the sentences that are likely to be irrelevant. In any application, subsequent modules will be looking for patterns of words that signal relevant events. If a sentence has none of this words, then there is no reason to process it further.

Page 13: Basi di dati distribuite

Information Extraction

Preparser This module recognizes very common small-scale

structures, simplifying the task of the parser. A few systems at this level recognize noun groups (noun phrases up through the head noun) as well as verb groups (verbs together with their auxiliaries). Appositives can be attached to their head nouns with high reliability (e.g. Prime Minister, President of the Republic, etc.).

Page 14: Basi di dati distribuite

Information Extraction

Parser This module takes a sequence of lexical items

(fragments) and tries to produce a parse tree for the entire sentence.

Recently more and more systems are abandoning full-sentence parsing in information extraction applications being interested just in recognizing fragments, then they try only to locate within the sentence various patterns that are of interest for the application.

Page 15: Basi di dati distribuite

Information Extraction

Fragment combiner This module provides indication on how to

combine the previously obtained parse tree fragments

Page 16: Basi di dati distribuite

Information Extraction

Semantic interpreterThis module translates the parse tree or parse tree

fragments into any of: a semantic structure, a logical form or event frame. Often lexical disambiguation takes place at this level as well.

The method for semantic interpretation is function application or an equivalent process that matches predicates with their arguments.

Page 17: Basi di dati distribuite

Information Extraction

Lexical disambiguation Lexical disambiguation allows translating a

semantic structure with general or ambiguous predicates into a semantic structure with specific, unambiguous predicates.

More generally, lexical disambiguation generally happens by constraining the interpretation by the context in which the ambiguous word occurs, perhaps together with the “a priori” probabilities of each word sense.

Page 18: Basi di dati distribuite

Information Extraction

Coreference resolution / discourse processingThis module revolves:co-reference for basic entities such as pronouns,

definite noun phrases, and anaphora. the reference for more complex entities like

events identified either with an event that was found previously or as a consequence of a previously found event, or it may fill a role in a previous event.

Page 19: Basi di dati distribuite

Information Extraction

Template generatorSemantic structures generated by the natural

language processing modules are used to produce the template as described by the final user only in the case events pass the defined threshold of interest.

Page 20: Basi di dati distribuite

Information Extraction

There is an agreement also on a number of features: named entity recognition, co-reference resolution, template production, scenario template production.

Page 21: Basi di dati distribuite

Information Extraction

Named entity recognitionIt refers to named entities (NE) identification

(inside the text) and extraction. NEs generally relate to domain concepts and are associated to semantic classes such as person, organization, place, date, amount, etc.

The accuracy in NE recognition is very high (more than 90%) and comparable with those of humans.

Page 22: Basi di dati distribuite

Information Extraction

Co-reference resolutionIt allows identifying identity relations between

previously extracted NEs. Anaphora resolution is widely used to recognize

relevant information about either concepts (NE) or events sparse in the text: this activity constitutes an important source of information enabling the system to assign a statistical relevance to recognized events.

Page 23: Basi di dati distribuite

Information Extraction

Template productionAs a result of the previous activities, an IE system

becomes aware of NEs and their descriptions. This represents a first level of template (called TE – “Template Element”). The TEs collections may be considered as a basic knowledge base to which the system accesses for getting information on main domain concepts, as they have been recognized in the text.

Page 24: Basi di dati distribuite

Information Extraction

Scenario template productionIt results in a synthesis of several tasks,

mainly the identification of Template Elements that relate among them: it represents an event (scenario) related to the domain under analysis; recognized values are used to fill in a scenario template

Page 25: Basi di dati distribuite

Information Extraction

Adaptive IE systemsAs an example, several big companies have millions

of documents, stored in different parts of the world, available via intranets, where the knowledge of their employees is stored.

Textual documents cannot be queried in a traditional fashion and therefore the stored knowledge can neither be used by automatic systems, nor be easily managed by humans.

Knowledge is difficult to capture, share and reuse among employees, reducing the company's efficiency and competitiveness.

Page 26: Basi di dati distribuite

Information Extraction

Adaptive IE systemsIE is the perfect support for knowledge identification

and extraction from Web documents as it can provide support in documents analysis either in an automatic approach (unsupervised extraction of information) or in a semi-automatic one (e.g. as support for human annotators in locating relevant facts in documents, via information highlighting).

Machine-learning approach may be helpful.

Page 27: Basi di dati distribuite

Information Extraction

Adaptive IE systemsMachine learning (ML) techniques has been

successfully applied to some lower level NLP tasks.

NE recognition, chunking, co-reference and anaphora resolution, are interesting examples of such approaches.

Page 28: Basi di dati distribuite

Question / Answering

A Q/A system • accepts questions in natural language form,• searches for answer over a collection of

documents • extracts relevant information for the

question• formulates concise answers.

Page 29: Basi di dati distribuite

Question / Answering

Short history TREC Conferences Q/A tracks has supported

the definition of a common approach to the matter.

Q/A systems are open domain, then their performances are tightly coupled with the complexity of the questions asked and the difficulty of answer extraction.

Page 30: Basi di dati distribuite

Question / Answering

Taxonomy of Q/A systems1. Linguistic and knowledge resources2. Natural language processing involved3. Document processing4. Reasoning methods5. Wheather or not answer is explicitely

stated in a document6. Wheather or not answer fusion is

necessary

Page 31: Basi di dati distribuite

Question / Answering

Questions classes1. Q/A systems capable of processing

factual questions2. Q/A systems enabling simple reasoning

mechanisms3. Q/A systems capable of answer fusion

from different documents 4. Interactive Q/A systems 5. Speculative questions

Page 32: Basi di dati distribuite

Question / Answering

Current approaches1. Question analysis 2. Document collection processing3. Candidate document selection4. Candidate document analysis 5. Answer extraction6. Response generation

Page 33: Basi di dati distribuite

Question / Answering

Question analysis The question is analyzed for subsequent

processing. The question may be interpreted in the context of an on-going dialogue and in the light of a model which the system has of the user. The user could be asked to clarify his question before processing

Page 34: Basi di dati distribuite

Question / Answering

Document collection processingThe reference document collection is the

knowledge source for answering questions. It requires to be preprocessed.

Page 35: Basi di dati distribuite

Question / Answering

Candidate document selectionA subset of documents collection is

selected, comprising those documents deemed most likely to contain an answer to the question.

Page 36: Basi di dati distribuite

Question / Answering

Candidate document analysis Additional detailed analysis of the candidates

selected at the preceding stage could be required.

Page 37: Basi di dati distribuite

Question / Answering

Answer extractionCandidate answers are extracted from the the

documents and ranked in terms of probale correctness.

Page 38: Basi di dati distribuite

Question / Answering

Response generationA response is returned to the user. It may be

affected by the dialogue context and user model, if present, and may in turn lead to this neing updated.