© T. Horton1 Object-oriented Models for Text and Document Processing Dr. Tom Horton Dept. of...
-
Upload
austen-fox -
Category
Documents
-
view
217 -
download
2
Transcript of © T. Horton1 Object-oriented Models for Text and Document Processing Dr. Tom Horton Dept. of...
© T. Horton 1
Object-oriented Models forText and Document Processing
Dr. Tom Horton
Dept. of Computer Science and Engineering
Florida Atlantic University
Boca Raton, FL 33431
Email: [email protected]
© T. Horton 2
My Areas of Interest
• CS and Software Engineering Education
• Software Engineering and Development
• Humanities Computing
© T. Horton 3
Software Engineering and Development
• My interests in these topics are primarily:– How to teach such things better– How to apply them to certain problem areas
• Object-oriented analysis and design– Stuff taught in CS494 (patterns, architecture)– SW architecture (particularly in problem domains)– GUI development, testing– HCI and usability (CS305)
• including evaluation experiments)• Pen and ink in tablet PCs
© T. Horton 4
CS and SW Engin. Education
• Important: Needs to be an education research oriented project, which means:– create something new– try it out experimentally– evaluate results
• Doing this with real subjects can:– require lots of advance planning– be time-consuming– be messy
© T. Horton 5
CS and SW Engin. Education
• Possible topic areas:– How do beginning students learn how to code, or
use tools?– How do more experienced students learn how to
design, debug, problem-solve?– Usability of tools (like IDEs, debuggers, design
tools)– Creating tools or environments to support
education• collecting feedback on student learning, etc.
© T. Horton 6
Humanities Computing:An “Old” User Community
• 1964 Literary Data Processing Conference– Papers on corpus preparation, stylistics,
dictionaries– Most common software tool: concordance
program• Joseph Raben’s research problem:
– Given two texts, find pairs of sentences that contain verbal echoes
– Shelley’s Prometheus Unbound heavily influenced by the language of Milton’s Paradise Lost
– Raben’s papers emphasize algorithm
© T. Horton 7
Humanities Computing
• Applying software, algorithms, etc. to problems of interest to humanities scholars– In particular, literary text analysis– Tools for finding things, features, “chunks”, or
showing relationships between texts• Data mining and text mining• Information visualization• XML processing, databases of features from
documents structured in XML
• Working with real users, real problems here at UVa
© T. Horton 8
Overview
• Background: text processing, humanities computing• Domain analysis and engineering• Applying DA to text processing• A common architecture
– Regions– Object model
• Some benefits– Relation to XML solutions– Query visualization
• Conclusions
© T. Horton 9
Background and Assumptions
• More and better software tools are possible• Modern markup makes tool development harder
– So many new possibilities for using markup! How?– Processing SGML has not been simple
• New developments will help:– XML, XSL, Document Object Model (DOM)– Java, Unicode, SGML/XML tool support
• Perhaps it’s the time for tools to catch up with markup
© T. Horton 10
My Interests
• Improving our ability to develop new text processing and analysis software tools for humanities users– Based on my viewpoint as a software engineer– Includes study of user requirements, designs,
frameworks, reusable components• We could develop a family of software tools that:
– satisfy common core requirements in the same way
– share common core concepts and approaches– are based on a general model of text processing– that use a flexible software architecture
© T. Horton 11
Background: Text Processing and Humanities Computing
• Users: scholars studying texts, linguists, etc. for the purpose of – preparing editions,– carrying out stylistic or authorship analyses,– finding relationships between multiple texts,– finding parts of a text (perhaps in a large corpus)
with certain traits,• Characteristics:
– Texts often not modern English– Mark-up such as SGML or XML very important
© T. Horton 12
An “Old” User Community
• 1964 Literary Data Processing Conference– Papers on corpus preparation, stylistics,
dictionaries– Most common software tool: concordance
program• Joseph Raben’s research problem:
– Given two texts, find pairs of sentences that contain verbal echoes
– Shelley’s Prometheus Unbound heavily influenced by the language of Milton’s Paradise Lost
– Raben’s papers emphasize algorithm
© T. Horton 13
Software Tools Needed
• Few software tools have been developed– The user community recognizes this a major
problem– Example: No program exists to solve Raben’s
problem• Reasons:
– Community is dispersed and is not resource rich– Supporting multi-lingual definitions of alphabets,
collation sequence, etc. and their output– Recently: SGML markup adds complexity– Users need good user interfaces
© T. Horton 14
Existing Software Tools
• Concordance programs– Define text characteristics– Find and group words, in variety of formats
• Text retrieval systems– Centered on data repository– Possibly client-server (e.g. Sara)– Interactive (not batch) so promotes text
exploration– Work at finer level of text granularity than
traditional information retrieval systems
© T. Horton 15
Existing Software Tools (2)
• Collation programs– Process and group two or more related texts
© T. Horton 16
Domain Engineering and Text Processing
• Use domain-specific approach for software development for the text analysis domain.
• Develop common definitions of requirements, objects, desired features, etc.
• We should work to develop requirements, architectures and components that are– reusable– not limited to one environment or architectural
approach
© T. Horton 17
Domain Engineering
• Domain engineering is an approach to support systematic reuse when developing multiple systems in a product family
• Domain: a problem space for a family of applications with similar requirements
• Domain boundary: separates one domain from another (possibly related domain)
• Subdomain: a subset of a domain that describes related components or assets
© T. Horton 18
Software Reuse
• Three levels of reuse– ad hoc; opportunistic; and, systematic
• Systematic reuse:– well-planned; cost-effective; part of an
organization’s process– requires infrastructure, asset management,
culture-change, standards, policies• Question: Can we hope for systematic reuse in the
text processing domain? Probably not:– No “central” development organization
© T. Horton 19
Domain Engineering Model
DomainAnalysis
DomainDesign
DomainRepository
DomainImplementation
© T. Horton 20
Domain Analysis
• A “metalevel” version of conventional requirements analysis– Define a domain engineering lifecycle similar to
software engineering, but with the goal of developing reusable requirements, designs, components, etc.
– Output of domain analysis is used in the next stage to “design” reusable architectures and components that can be used in many systems.
• Similar in concept to knowledge engineering in expert systems development
© T. Horton 21
Domain Analysis Activities
• Find and agree upon:– domain definition and boundary
– common objects, functions, features
– user characteristics
– scenarios to determine what are common interactions
• Outputs: Domain model– common vocabulary– generic requirements that apply to more than one
system– combinations and trade-offs of features
• Form: text, UML diagrams, etc.
© T. Horton 22
Approaches to Domain Analysis
• Two views of domain analysis:– Study the domain top-down, as an abstract
problem.– Study existing products and systems.
• For second approach:– Analyze existing software applications– What do they have in common?– Why are there differences? “Accidental” or
necessary?
© T. Horton 23
How to Use the Domain Model
• As a shared language for describing existing systems (or new systems)
• To define product requirements for new systems to be developed– derived requirements
• To design a reusable domain architecture and smaller reusable components
© T. Horton 24
Applying All This to Text Processing
• Broad categories of common requirements:
– Definition and description of words and text structure.
– Initial text processing, e.g. recognizing tokens, mark-up
– Defining and selecting context or regions within a text
– Retrieval or search.
– Document organization
– Quantitative results, e.g. counts, statistics.
– User-interface issues (displaying context; displaying tagged text; export of information)
• The following slides give examples of reusable components and systems they might produce.
© T. Horton 25
How is Mark-up Used?
• Highest level of description (abstraction):– (A) To select a subdocument(s), or restrict an
action to part of a document.– (B) To identify a location in a text for output.
• Reference identifiers in KWIC, etc.
– (C) To navigate within a text, or reference next entity, parent, etc.
• Get next sentence, or get chapter number for current sentence.
© T. Horton 26
How Mark-up is Used? (cont’d)
– (D) To contain a word-level alternative for some part of the text for processing.
• E.g., in TEI: Bob likes <corr sic=“it’s”> its </corr> name.
– (E) To qualify a word to distinguish it from words with same orthographic representation.
• E.g. <mentioned lang=fra>Savoir-faire</mentioned> is French for know-how. <name>Mark</name> put his mark there.
© T. Horton 27
SW Requirements To Support These Needs
• Store and navigate an SGML/XML element tree (Needs A, C)
• Given a location, find set of elements its stored in (B, C)
• Retrieve elements based on attribute values (Need A)– E.g. all elements of any type with lang=fra
© T. Horton 28
SW Requirements (cont’d)
• For word extraction from PCdata:– Choose among alternatives for <corr>, <del>, etc.– Distinguish words with same string values but with
mark-up for name, foreign, etc.
© T. Horton 29
Reusable Requirements
• (R1) Standards for word, character set, alphabet, etc.• (R2) Definitions and requirements for selecting a
region or sub-document in a text.• (R3) Search, defined in terms of abstract concepts
relating to words, regions, mark-up, etc.
© T. Horton 30
Reusable Designs, Components, Interfaces, Etc.
• (D1) SGML or XML front-end components.• (D2) Data structure models for R1, word and
alphabet standards.• (D3) Fuzzy word matching algorithm and
implementation.• (D4) An API for a text-base query language (XML
XPATH)• (D5) A design and reusable code for a text database
management system (Berkeley DB)• (D6) An API for document structure manipulation
(XML DOM).
© T. Horton 31
Programs and Tools That Could Result
• Repository-centered systems
– New TACT: a information-retrieval like system
– WordCluster: find parts of a document with concentrations of words from one or more categories or themes.
– PageTurner: a program to manipulate and “conditionally” display a text according to its mark-up
• Pipeline (or pipe and filter) systems:
– New concordance-like programs
– TextComp: an implementation of Raben’s solution for finding passages in a pair of texts that echo each other
• Others might include: Collate, a SGML Tag Editor
© T. Horton 32
A Region-based Architecture
• Used by sgrep, a command-line search (or grep) that:– models “things” in files as regions (AKA spans)– similar to TIPSTER Info. Retr. architecture (GATE)
• Why isn’t a markup-centered approach enough?– The DOM for XML provides a model and an API
for software to manipulate structured documents.– Query languages are coming: XQL or ???
• We need both: A markup-centered approach does not completely address:– finer-grained features not marked up– non-hierarchical features
© T. Horton 33
Definitions
• sgrep queries and results defined as:– region: a chunk of text defined by a starting and
ending byte-position in a file– region sets: a collection of regions
• Region sets can be combined in powerful ways– sgrep finds nested regions– given two region sets, finds subset of one that is
contained in another– given two region-sets, finds subset of one that
contains some member of the other– sets can be merged etc.
© T. Horton 34
Example: Regions and Region Sets
“honor”:Text:
(1) All Occurrences of “honor”:
DIV1Text:
(2) All Occurrences of DIV1 elements:
Text:
(3) All Occurrences of “honor” in a DIV1 element:
“honor”:
© T. Horton 35
Region Examples
• Region sets:– all words in a text; all characters; all syllables– all occurrences of a given token– all DIV1 elements– all elements that have attribute with a given value
• Queries or subtext selection. Examples: Let’s find:– Speeches by Hamlet in Acts 4 and 5,...– choosing only those marked-up as verse,...– choosing only those with the word “honor”
• This example illustrates selection of a document or subdocument, a core user requirement
© T. Horton 36
Benefits of Using Regions
• Provides a general model of things in texts– Markup such as SGML elements can be modeled
as regions– sgrep’s model of regions uses a concept of
nesting or inclusion, which is naturally useful• A good model can be quite powerful
– Spreadsheets: cells, rows, columns, ranges
© T. Horton 37
A General Model for Text Processing
• Text Object (TO): A thing that a user wants to identify and manipulate in a text.Examples: markup elements, tokens, syllables, user-defined regions (e.g. an “analogy”)
• Text Object Occurrence (TOO): an occurrence of a TO, represented by a region. (Or a region set if non-contiguous.) Examples:– the particular word-token found on line 3– the 2nd DIV1 markup element in the file– the next-to-last syllable of a given word– a region selected by the user using the mouse
© T. Horton 38
A General Model (continued)
• Text Object Occurrence List (TOO-List): a Text Object and a set of occurrences (TOOs) associated with it. Examples:– All word tokens– All occurrences of “the”– All pronouns– All occurrences of “he”– All syllables– All markup elements– All DIV1 elements– All elements with LANG attribute value = “DE”
© T. Horton 39
Manipulating TOO-Lists
• New lists can be found using sgrep-like operations– Occurrences of “the” in DIV1 elements:– All DIV1 elements where element attribute
ATTR="y"– Occurrences of syllable "-er" in pronouns.– DIV1 elements that contain pronouns.– Pronouns containing "er" spoken by Hamlet.– Occurrences of “the” or words with "-er" in that
chunk of text the user selected with the mouse and in DIV1 elements that are not in German.
• Or any combination of these!
© T. Horton 40
Where do TOO-Lists Come From?
• TOO-Lists can be found by software components:– An XML parser or sgrep could find all elements– A tokenizer could find all word-tokens– Markup tags could be used to identify all pronouns– A linguistic tool or an exhaustive dictionary could
be used to create a list of syllable occurrences– A user could select a set of regions by XML query,
mouse-selection, etc.• Some TOO-Lists are stored in a tool’s database• Some are calculated on the fly in response to
queries, operations
© T. Horton 41
Benefits: Supports Core Needs
• Does it support core user requirements?– Selection of documents and part of documents– Identify occurrences of text features– Allow queries and other operations on text
features within a document selection– Support common operations other than search:
• Identify, name and store• Output• Visualization• Number and count occurrences. Operations based on
sequencing. (E.g. find “next” word.)• Other processing or analysis. Linguistic, etc.
© T. Horton 42
Common View of Markup, Words, Etc.
• Markup and words (or other objects) have a common representation.
– Search for any Text Object (TO) is handled the same by the software
– Query language for the user may not be simple!
– There’s no difference between:• Finding markup unit containing a particular word
E.g. All occurrences of “love” in a DIV1
• Finding word containing a particular markup tagE.g. All word-tokens that have CORR tags
• A common representation simplifies software development, even if user perceives differences.
© T. Horton 43
Multifile Documents and Corpora
• sgrep finds patterns in files– A region is a part of a file
• We need more:– SGML/XML allows more than one file to comprise
a single document– Corpora and Digital Libraries
• Extend a region to include a document or file identifier:– A triple not a pair: (docID, start, end)– Requires a relative ordering of docIDs for multifile
documents
© T. Horton 44
Non-contiguous Text Objects
• We may wish to define a Text Object (TO) that cannot be represented by one single span. Examples:– Occurrence of verb forms in German– Some document-defined element– User-defined objects
• A Text Object Occurrence becomes a region set instead of a region.
• Modify primitive region operations like includes, contained in, etc. to work on two region sets
© T. Horton 45
Integrating Markup- and Region-Centered Approaches
• Clearly hierarchical structure of SGML/XML documents is powerful and necessary
• The Document Object Model (DOM) for XML– A model for accessing XML structures by software– Defines APIs for accessing nodes, subtrees, etc.– XML parsers can build a DOM– Built-on-the-fly and temporary vs. persistent
• Query languages for XML documents– Retrieves a node, a subtree, or a set of nodes.– Not yet standard. Proposals, like XSL.
© T. Horton 46
Integrating the Two Approaches
• Region view can address aspects not well-addressed in a DOM:– Objects at a finer grain of detail than XML node– Objects defined without markup
• DOM nodes correspond to regions– A DOM implementation could include region-info– Or, a separate index could be created to map DOM
node-IDs to regions• Thus, a software tool could have two integrated
databases reflecting both views– Allowing XSL-like queries by the user– Allowing sgrep-like manipulations too
© T. Horton 47
Overlapping Hierarchies
• Region-based view not limited by hierarchical structures
• Example: consider use of milestones to indicate verses in a play where speeches are marked-up using tags.
– A verse may span several speeches
– A verse may not start or end at a speech boundary
• Perhaps we wish to do some operation on word tokens by speech and also by verse (e.g. word-list of all words, find first word, etc.)
• Regions used to represent words, and also
– SPEECH elements in SGML/XML
– Text between verse-begin and verse-end milestone tags
© T. Horton 48
Example: Overlapping Hierarchies
Words:
Verses:
Lines
Text:
Words:
Text:
Words in lines:
Words in verses:
© T. Horton 49
Numbering or Sequencing Elements
• Relative position of Text Objects is needed:– Find all words after “of”.– Find occurrences of “in the”.– Select first and second speeches of the first two
scenes of each act in a play.– Select successive 500-word blocks.– Find the first noun (per markup) in each sentence.– Find the first verb (per markup) after each noun
(per markup).– Find the next syllable after “er” (if any) in words.
© T. Horton 50
Problem: How to Number Things?
• How to handle markup boundaries?– Is the next word after the last word in a sentence
the next word?– What if there’s a chapter break?
• Intervening markup elements. E.g. a footnote.• In critical editions, how to handle tags for SIC/CORR,
REG/ORIG, GAP/ADD/DEL etc.
© T. Horton 51
Sequencing of TOs using Region Sets
• Goal: Create a sequence of Text Objects (TOs)• Define a Sequence Context to be the set of “units”
inside of which we wish to number.– Each “unit” might be a contiguous region, or it might
be a region set– Thus the Sequence Context is a sequence of region
sets• To sequence occurrences of a given TO, process its
region-set against the Sequence Context– Ignore TO Occurrences not in the context– Optionally restart counter for each region-set in the
context
© T. Horton 52
Example: Sequencing TOs
TOs:
Context 2:
Context 1
Text:
TOs:
Text:
Numbered TOs in Some Context, Ignoring Some:
1 2 1 2 3 1 2
1 2 3 1 2 3 1 2
Numbered TOs in Another Context:
© T. Horton 53
Visualization
• Previous slide shows a simple visualization of TOs inside of regions
• Potential for a general model of visualization of TOs (words, markup elements etc.) in relation to other TOs (regions, markup elements, etc.)
• Examples:– Show presence or absence of some TO in a
selected region– Counts of how many occurrences in each region
• Given a region-based database of TO Occurrences, we have a dynamic and flexible way to visualize all TOs known to the tool
© T. Horton 54
Tilebars
• Marti Hearst developed tilebars as a method for visualizing query results
• Dimensions:– documents,– terms (text objects),– segments (regions)
• Shading of each cell shows strength of occurrence of a term
• Adjacent shaded cells show co-occurrence of terms
© T. Horton 55
Tilebars (example)
© T. Horton 56
Conclusions
• Regions and region sets show promise as a general model for representing all types of Text Objects– Integrates markup and other document features
such as word tokens• Supports selection of documents and subdocuments
– Apply operations (display, count, remember, etc) to a selection
• Should integrate with markup-based document models (e.g. the XML DOM)
• Supports numbering/sequencing and visualization techniques.
© T. Horton 57
What’s Next?
• A proof-of-concept application to show the usefulness of the approach
• Development of software components based on regions to support the “middle-level” of software architectures
• How to best store region-based data for use by software applications
• Develop a model for distributed text-bases based on Text Objects, TOO-Lists, etc.– Implement this for existing texts in an existing
Digital Library
© T. Horton 58
Educational Research in CS/SE
• We have a CS Education Group here at UVa– Teaching Faculty: Horton, Milner, Anderson, Prey– Also, others like Knight, Evans, Cohoon
• Education topics are suitable for a Senior Thesis!
© T. Horton 59
Education Projects
• SW Engineering Education (Horton and Knight)– Share across the Net with other universities
• CS201 camera client/server system• CS340 Robot systems?
– Web Repository for educational artifacts• MySQL and PHP
– Winding down for now, but future could include:• Case studies for use in CS202, CS340, etc.• Pair-programming (like XP) in CS101?• New robot-like project suitable for CS340?
© T. Horton 60
Projects (cont’d)
• Next Generation Computer Labs (Knight, Horton, Milner, Anderson)– Focuses on CS101 and CS201– Do they have to be in lab? If so, what should they
be doing as a group with TAs?– Can we use technology better?
• Netmeeting, Instant Messaging, Web discussion boards• Collaborative learning: small groups, using a Web board
etc.
– Studio Labs (like for CS340) in CS201?– Evaluate effectiveness (surveys, interviews, etc.)
© T. Horton 61
Projects (cont’d)
• Fourth-year course/project that does:– Larger scale system development– Distributed computing
• Client/server• Database
– Integration of custom code with existing components
• Database, middle-ware, COTS components
– Uses .NET, SOAP/XML, etc.
• What would this course or project look like?