Practical Project of the 2006 Joint International Master’s Degree.
-
Upload
laurel-long -
Category
Documents
-
view
214 -
download
0
Transcript of Practical Project of the 2006 Joint International Master’s Degree.
Practical Project of the 2006Joint International Master’s Degree
Agenda
Introduction Technologies in use Architecture Demonstration Remaining Issues Work packages for Semester II Questions & Comments
Introduction
Practical project during the course of studies Timeframe: two terms Topic: Prototype of a semantic search engine
using UIMA
Objectives of the first semester Study the UIMA-Framework and OpenNLP library Search for players, teams, matches and dates Semantic search for goal events Implement an executable prototype
Technologies in Use
UIMA-Framework OpenNLP Java / Java Server Pages Tomcat-Server Python (Webcrawler)
ArchitectureOverview
Unstructured informationPlain Text
converter (parser)
Persistent Search index
UIMA-Framework
OpenNLP
Input
Output
Sentence detection
Word detection
Paragraph detection
Date & Time annotator
Player annotator Match annotator
CAS
NLP-Annotator 1
Goal-Event annotator
User Interface
ArchitectureWebcrawler
Usage of web crawler for preselection of Texts
Implemented in Python Crawls ca. 2500 pages in 20 minutes Presently based on keywords Transfer of results to Jimgle still
manual
ArchitectureNLP-Annotator
Usage of the OpenNLP-Tools & API Rule based approach Tagging of paragraphs, sentences and words Part-of-Speech-Tagging
Implementation in UIMA as separate annotator Results are used by consecutive annotators Internal usage only, not displayed in the search
index
Architecture
Identification of players of the WM2006 Rule based implementation Usage of the OpenNLP word-annotations Matching against the player database
(XML-File) Consideration of last names and
nicknames
Player-Annotator
ArchitectureDate & Time-Annotator
Identification of time and date information Usage of the OpenNLP word-annotations Presently custom, rule based implementation Detecs standard conform time & date
information Detection of relative or colloquial time
information not implemented yet
ArchitectureMatch-Annotator
Identification of matches Based on 3 components
Detection of locality Detection of participating teams Detection of the match result
Usage of upstream annotators OpenNLP word-annotations Player annotations Date- & time-annotations
ArchitectureGoal-Event Annotator
Description of goals are too complex for a rule-based detection
Therefore: Machine based learning Usage of the OpenNLP library Based on statistical information of sentences Comprehensive training necessary
Implementation as OpenNLP component Integration into UIMA by wrapper-classes
ArchitecturePersistent Indexing
Functionality Import of all files in a specific directory Annotation of all available texts Compilation of XML-Files with CAS-data of
every source text Adjacent creation of a search index
Provision of index files for the web-server
ArchitectureGraphical User Interface
Linux server with tomcat installation Simple operation via web-based GUI Search queries are handled by Java server
pages Processing of requests by Java beans
Demonstration Search engine
Open IssuesFurther proceeding…?
Search for attributes e.g. Player AND Germany (presently only via OmniFind)
Automate processing of search engine results
Further training of the components Usage improvements at front- and
backend
New scenarios……for the second semester
Automated analysis of eMails Search for phone numbers Search for customer contacts of employee Find employees with specific skills Find links & relations between employees
Competitive analysis Compare own products with ones from competitors Find out about customer opinions in internet portals
Further ideas??
Ideas……for the second semester
Natural language based search queries Design templates for customizable
annotators Machine based learning for the Web-Crawler Mark annotations in the search results Automated processing of search results Implement more anotators via OpenNLP Provide annotators as web-services
Further ideas??
JIMGLEJIM Master-Project
Questions?
Suggestions?
JIMGLEJIM Master-Project
Thanks for your attention…