© T. Horton1 Object-oriented Models for Text and Document Processing Dr. Tom Horton Dept. of...

© T. Horton 1

Object-oriented Models forText and Document Processing

Dr. Tom Horton

Dept. of Computer Science and Engineering

Florida Atlantic University

Boca Raton, FL 33431

Email: [email protected]

© T. Horton 2

My Areas of Interest

• CS and Software Engineering Education

• Software Engineering and Development

• Humanities Computing

© T. Horton 3

Software Engineering and Development

• My interests in these topics are primarily:– How to teach such things better– How to apply them to certain problem areas

• Object-oriented analysis and design– Stuff taught in CS494 (patterns, architecture)– SW architecture (particularly in problem domains)– GUI development, testing– HCI and usability (CS305)

• including evaluation experiments)• Pen and ink in tablet PCs

© T. Horton 4

CS and SW Engin. Education

• Important: Needs to be an education research oriented project, which means:– create something new– try it out experimentally– evaluate results

• Doing this with real subjects can:– require lots of advance planning– be time-consuming– be messy

© T. Horton 5

CS and SW Engin. Education

• Possible topic areas:– How do beginning students learn how to code, or

use tools?– How do more experienced students learn how to

design, debug, problem-solve?– Usability of tools (like IDEs, debuggers, design

tools)– Creating tools or environments to support

education• collecting feedback on student learning, etc.

© T. Horton 6

Humanities Computing:An “Old” User Community

• 1964 Literary Data Processing Conference– Papers on corpus preparation, stylistics,

dictionaries– Most common software tool: concordance

program• Joseph Raben’s research problem:

– Given two texts, find pairs of sentences that contain verbal echoes

– Shelley’s Prometheus Unbound heavily influenced by the language of Milton’s Paradise Lost

– Raben’s papers emphasize algorithm

© T. Horton 7

Humanities Computing

• Applying software, algorithms, etc. to problems of interest to humanities scholars– In particular, literary text analysis– Tools for finding things, features, “chunks”, or

showing relationships between texts• Data mining and text mining• Information visualization• XML processing, databases of features from

documents structured in XML

• Working with real users, real problems here at UVa

© T. Horton 8

Overview

• Background: text processing, humanities computing• Domain analysis and engineering• Applying DA to text processing• A common architecture

– Regions– Object model

• Some benefits– Relation to XML solutions– Query visualization

• Conclusions

© T. Horton 9

Background and Assumptions

• More and better software tools are possible• Modern markup makes tool development harder

– So many new possibilities for using markup! How?– Processing SGML has not been simple

• New developments will help:– XML, XSL, Document Object Model (DOM)– Java, Unicode, SGML/XML tool support

• Perhaps it’s the time for tools to catch up with markup

© T. Horton 10

My Interests

• Improving our ability to develop new text processing and analysis software tools for humanities users– Based on my viewpoint as a software engineer– Includes study of user requirements, designs,

frameworks, reusable components• We could develop a family of software tools that:

– satisfy common core requirements in the same way

– share common core concepts and approaches– are based on a general model of text processing– that use a flexible software architecture

© T. Horton 11

Background: Text Processing and Humanities Computing

• Users: scholars studying texts, linguists, etc. for the purpose of – preparing editions,– carrying out stylistic or authorship analyses,– finding relationships between multiple texts,– finding parts of a text (perhaps in a large corpus)

with certain traits,• Characteristics:

– Texts often not modern English– Mark-up such as SGML or XML very important

© T. Horton 12

An “Old” User Community

• 1964 Literary Data Processing Conference– Papers on corpus preparation, stylistics,

dictionaries– Most common software tool: concordance

program• Joseph Raben’s research problem:

– Given two texts, find pairs of sentences that contain verbal echoes

– Shelley’s Prometheus Unbound heavily influenced by the language of Milton’s Paradise Lost

– Raben’s papers emphasize algorithm

© T. Horton 13

Software Tools Needed

• Few software tools have been developed– The user community recognizes this a major

problem– Example: No program exists to solve Raben’s

problem• Reasons:

– Community is dispersed and is not resource rich– Supporting multi-lingual definitions of alphabets,

collation sequence, etc. and their output– Recently: SGML markup adds complexity– Users need good user interfaces

© T. Horton 14

Existing Software Tools

• Concordance programs– Define text characteristics– Find and group words, in variety of formats

• Text retrieval systems– Centered on data repository– Possibly client-server (e.g. Sara)– Interactive (not batch) so promotes text

exploration– Work at finer level of text granularity than

traditional information retrieval systems

© T. Horton 15

Existing Software Tools (2)

• Collation programs– Process and group two or more related texts

© T. Horton 16

Domain Engineering and Text Processing

• Use domain-specific approach for software development for the text analysis domain.

• Develop common definitions of requirements, objects, desired features, etc.

• We should work to develop requirements, architectures and components that are– reusable– not limited to one environment or architectural

approach

© T. Horton 17

Domain Engineering

• Domain engineering is an approach to support systematic reuse when developing multiple systems in a product family

• Domain: a problem space for a family of applications with similar requirements

• Domain boundary: separates one domain from another (possibly related domain)

• Subdomain: a subset of a domain that describes related components or assets

© T. Horton 18

Software Reuse

• Three levels of reuse– ad hoc; opportunistic; and, systematic

• Systematic reuse:– well-planned; cost-effective; part of an

organization’s process– requires infrastructure, asset management,

culture-change, standards, policies• Question: Can we hope for systematic reuse in the

text processing domain? Probably not:– No “central” development organization

© T. Horton 19

Domain Engineering Model

DomainAnalysis

DomainDesign

DomainRepository

DomainImplementation

© T. Horton 20

Domain Analysis

• A “metalevel” version of conventional requirements analysis– Define a domain engineering lifecycle similar to

software engineering, but with the goal of developing reusable requirements, designs, components, etc.

– Output of domain analysis is used in the next stage to “design” reusable architectures and components that can be used in many systems.

• Similar in concept to knowledge engineering in expert systems development

© T. Horton 21

Domain Analysis Activities

• Find and agree upon:– domain definition and boundary

– common objects, functions, features

– user characteristics

– scenarios to determine what are common interactions

• Outputs: Domain model– common vocabulary– generic requirements that apply to more than one

system– combinations and trade-offs of features

• Form: text, UML diagrams, etc.

© T. Horton 22

Approaches to Domain Analysis

• Two views of domain analysis:– Study the domain top-down, as an abstract

problem.– Study existing products and systems.

• For second approach:– Analyze existing software applications– What do they have in common?– Why are there differences? “Accidental” or

necessary?

© T. Horton 23

How to Use the Domain Model

• As a shared language for describing existing systems (or new systems)

• To define product requirements for new systems to be developed– derived requirements

• To design a reusable domain architecture and smaller reusable components

© T. Horton 24

Applying All This to Text Processing

• Broad categories of common requirements:

– Definition and description of words and text structure.

– Initial text processing, e.g. recognizing tokens, mark-up

– Defining and selecting context or regions within a text

– Retrieval or search.

– Document organization

– Quantitative results, e.g. counts, statistics.

– User-interface issues (displaying context; displaying tagged text; export of information)

• The following slides give examples of reusable components and systems they might produce.

© T. Horton 25

How is Mark-up Used?

• Highest level of description (abstraction):– (A) To select a subdocument(s), or restrict an

action to part of a document.– (B) To identify a location in a text for output.

• Reference identifiers in KWIC, etc.

– (C) To navigate within a text, or reference next entity, parent, etc.

• Get next sentence, or get chapter number for current sentence.

© T. Horton 26

How Mark-up is Used? (cont’d)

– (D) To contain a word-level alternative for some part of the text for processing.

• E.g., in TEI: Bob likes <corr sic=“it’s”> its </corr> name.

– (E) To qualify a word to distinguish it from words with same orthographic representation.

• E.g. <mentioned lang=fra>Savoir-faire</mentioned> is French for know-how. <name>Mark</name> put his mark there.

© T. Horton 27

SW Requirements To Support These Needs

• Store and navigate an SGML/XML element tree (Needs A, C)

• Given a location, find set of elements its stored in (B, C)

• Retrieve elements based on attribute values (Need A)– E.g. all elements of any type with lang=fra

© T. Horton 28

SW Requirements (cont’d)

• For word extraction from PCdata:– Choose among alternatives for <corr>, <del>, etc.– Distinguish words with same string values but with

mark-up for name, foreign, etc.

© T. Horton 29

Reusable Requirements

• (R1) Standards for word, character set, alphabet, etc.• (R2) Definitions and requirements for selecting a

region or sub-document in a text.• (R3) Search, defined in terms of abstract concepts

relating to words, regions, mark-up, etc.

© T. Horton 30

Reusable Designs, Components, Interfaces, Etc.

• (D1) SGML or XML front-end components.• (D2) Data structure models for R1, word and

alphabet standards.• (D3) Fuzzy word matching algorithm and

implementation.• (D4) An API for a text-base query language (XML

XPATH)• (D5) A design and reusable code for a text database

management system (Berkeley DB)• (D6) An API for document structure manipulation

(XML DOM).

© T. Horton 31

Programs and Tools That Could Result

• Repository-centered systems

– New TACT: a information-retrieval like system

– WordCluster: find parts of a document with concentrations of words from one or more categories or themes.

– PageTurner: a program to manipulate and “conditionally” display a text according to its mark-up

• Pipeline (or pipe and filter) systems:

– New concordance-like programs

– TextComp: an implementation of Raben’s solution for finding passages in a pair of texts that echo each other

• Others might include: Collate, a SGML Tag Editor

© T. Horton 32

A Region-based Architecture

• Used by sgrep, a command-line search (or grep) that:– models “things” in files as regions (AKA spans)– similar to TIPSTER Info. Retr. architecture (GATE)

• Why isn’t a markup-centered approach enough?– The DOM for XML provides a model and an API

for software to manipulate structured documents.– Query languages are coming: XQL or ???

• We need both: A markup-centered approach does not completely address:– finer-grained features not marked up– non-hierarchical features

© T. Horton 33

Definitions

• sgrep queries and results defined as:– region: a chunk of text defined by a starting and

ending byte-position in a file– region sets: a collection of regions

• Region sets can be combined in powerful ways– sgrep finds nested regions– given two region sets, finds subset of one that is

contained in another– given two region-sets, finds subset of one that

contains some member of the other– sets can be merged etc.

© T. Horton 34

Example: Regions and Region Sets

“honor”:Text:

(1) All Occurrences of “honor”:

DIV1Text:

(2) All Occurrences of DIV1 elements:

Text:

(3) All Occurrences of “honor” in a DIV1 element:

“honor”:

© T. Horton 35

Region Examples

• Region sets:– all words in a text; all characters; all syllables– all occurrences of a given token– all DIV1 elements– all elements that have attribute with a given value

• Queries or subtext selection. Examples: Let’s find:– Speeches by Hamlet in Acts 4 and 5,...– choosing only those marked-up as verse,...– choosing only those with the word “honor”

• This example illustrates selection of a document or subdocument, a core user requirement

© T. Horton 36

Benefits of Using Regions

• Provides a general model of things in texts– Markup such as SGML elements can be modeled

as regions– sgrep’s model of regions uses a concept of

nesting or inclusion, which is naturally useful• A good model can be quite powerful

– Spreadsheets: cells, rows, columns, ranges

© T. Horton 37

A General Model for Text Processing

• Text Object (TO): A thing that a user wants to identify and manipulate in a text.Examples: markup elements, tokens, syllables, user-defined regions (e.g. an “analogy”)

• Text Object Occurrence (TOO): an occurrence of a TO, represented by a region. (Or a region set if non-contiguous.) Examples:– the particular word-token found on line 3– the 2nd DIV1 markup element in the file– the next-to-last syllable of a given word– a region selected by the user using the mouse

© T. Horton 38

A General Model (continued)

• Text Object Occurrence List (TOO-List): a Text Object and a set of occurrences (TOOs) associated with it. Examples:– All word tokens– All occurrences of “the”– All pronouns– All occurrences of “he”– All syllables– All markup elements– All DIV1 elements– All elements with LANG attribute value = “DE”

© T. Horton 39

Manipulating TOO-Lists

• New lists can be found using sgrep-like operations– Occurrences of “the” in DIV1 elements:– All DIV1 elements where element attribute

ATTR="y"– Occurrences of syllable "-er" in pronouns.– DIV1 elements that contain pronouns.– Pronouns containing "er" spoken by Hamlet.– Occurrences of “the” or words with "-er" in that

chunk of text the user selected with the mouse and in DIV1 elements that are not in German.

• Or any combination of these!

© T. Horton 40

Where do TOO-Lists Come From?

• TOO-Lists can be found by software components:– An XML parser or sgrep could find all elements– A tokenizer could find all word-tokens– Markup tags could be used to identify all pronouns– A linguistic tool or an exhaustive dictionary could

be used to create a list of syllable occurrences– A user could select a set of regions by XML query,

mouse-selection, etc.• Some TOO-Lists are stored in a tool’s database• Some are calculated on the fly in response to

queries, operations

© T. Horton 41

Benefits: Supports Core Needs

• Does it support core user requirements?– Selection of documents and part of documents– Identify occurrences of text features– Allow queries and other operations on text

features within a document selection– Support common operations other than search:

• Identify, name and store• Output• Visualization• Number and count occurrences. Operations based on

sequencing. (E.g. find “next” word.)• Other processing or analysis. Linguistic, etc.

© T. Horton 42

Common View of Markup, Words, Etc.

• Markup and words (or other objects) have a common representation.

– Search for any Text Object (TO) is handled the same by the software

– Query language for the user may not be simple!

– There’s no difference between:• Finding markup unit containing a particular word

E.g. All occurrences of “love” in a DIV1

• Finding word containing a particular markup tagE.g. All word-tokens that have CORR tags

• A common representation simplifies software development, even if user perceives differences.

© T. Horton 43

Multifile Documents and Corpora

• sgrep finds patterns in files– A region is a part of a file

• We need more:– SGML/XML allows more than one file to comprise

a single document– Corpora and Digital Libraries

• Extend a region to include a document or file identifier:– A triple not a pair: (docID, start, end)– Requires a relative ordering of docIDs for multifile

documents

© T. Horton 44

Non-contiguous Text Objects

• We may wish to define a Text Object (TO) that cannot be represented by one single span. Examples:– Occurrence of verb forms in German– Some document-defined element– User-defined objects

• A Text Object Occurrence becomes a region set instead of a region.

• Modify primitive region operations like includes, contained in, etc. to work on two region sets

© T. Horton 45

Integrating Markup- and Region-Centered Approaches

• Clearly hierarchical structure of SGML/XML documents is powerful and necessary

• The Document Object Model (DOM) for XML– A model for accessing XML structures by software– Defines APIs for accessing nodes, subtrees, etc.– XML parsers can build a DOM– Built-on-the-fly and temporary vs. persistent

• Query languages for XML documents– Retrieves a node, a subtree, or a set of nodes.– Not yet standard. Proposals, like XSL.

© T. Horton 46

Integrating the Two Approaches

• Region view can address aspects not well-addressed in a DOM:– Objects at a finer grain of detail than XML node– Objects defined without markup

• DOM nodes correspond to regions– A DOM implementation could include region-info– Or, a separate index could be created to map DOM

node-IDs to regions• Thus, a software tool could have two integrated

databases reflecting both views– Allowing XSL-like queries by the user– Allowing sgrep-like manipulations too

© T. Horton 47

Overlapping Hierarchies

• Region-based view not limited by hierarchical structures

• Example: consider use of milestones to indicate verses in a play where speeches are marked-up using tags.

– A verse may span several speeches

– A verse may not start or end at a speech boundary

• Perhaps we wish to do some operation on word tokens by speech and also by verse (e.g. word-list of all words, find first word, etc.)

• Regions used to represent words, and also

– SPEECH elements in SGML/XML

– Text between verse-begin and verse-end milestone tags

© T. Horton 48

Example: Overlapping Hierarchies

Words:

Verses:

Lines

Text:

Words:

Text:

Words in lines:

Words in verses:

© T. Horton 49

Numbering or Sequencing Elements

• Relative position of Text Objects is needed:– Find all words after “of”.– Find occurrences of “in the”.– Select first and second speeches of the first two

scenes of each act in a play.– Select successive 500-word blocks.– Find the first noun (per markup) in each sentence.– Find the first verb (per markup) after each noun

(per markup).– Find the next syllable after “er” (if any) in words.

© T. Horton 50

Problem: How to Number Things?

• How to handle markup boundaries?– Is the next word after the last word in a sentence

the next word?– What if there’s a chapter break?

• Intervening markup elements. E.g. a footnote.• In critical editions, how to handle tags for SIC/CORR,

REG/ORIG, GAP/ADD/DEL etc.

© T. Horton 51

Sequencing of TOs using Region Sets

• Goal: Create a sequence of Text Objects (TOs)• Define a Sequence Context to be the set of “units”

inside of which we wish to number.– Each “unit” might be a contiguous region, or it might

be a region set– Thus the Sequence Context is a sequence of region

sets• To sequence occurrences of a given TO, process its

region-set against the Sequence Context– Ignore TO Occurrences not in the context– Optionally restart counter for each region-set in the

context

© T. Horton 52

Example: Sequencing TOs

TOs:

Context 2:

Context 1

Text:

TOs:

Text:

Numbered TOs in Some Context, Ignoring Some:

1 2 1 2 3 1 2

1 2 3 1 2 3 1 2

Numbered TOs in Another Context:

© T. Horton 53

Visualization

• Previous slide shows a simple visualization of TOs inside of regions

• Potential for a general model of visualization of TOs (words, markup elements etc.) in relation to other TOs (regions, markup elements, etc.)

• Examples:– Show presence or absence of some TO in a

selected region– Counts of how many occurrences in each region

• Given a region-based database of TO Occurrences, we have a dynamic and flexible way to visualize all TOs known to the tool

© T. Horton 54

Tilebars

• Marti Hearst developed tilebars as a method for visualizing query results

• Dimensions:– documents,– terms (text objects),– segments (regions)

• Shading of each cell shows strength of occurrence of a term

• Adjacent shaded cells show co-occurrence of terms

© T. Horton 56

Conclusions

• Regions and region sets show promise as a general model for representing all types of Text Objects– Integrates markup and other document features

such as word tokens• Supports selection of documents and subdocuments

– Apply operations (display, count, remember, etc) to a selection

• Should integrate with markup-based document models (e.g. the XML DOM)

• Supports numbering/sequencing and visualization techniques.

© T. Horton 57

What’s Next?

• A proof-of-concept application to show the usefulness of the approach

• Development of software components based on regions to support the “middle-level” of software architectures

• How to best store region-based data for use by software applications

• Develop a model for distributed text-bases based on Text Objects, TOO-Lists, etc.– Implement this for existing texts in an existing

Digital Library

© T. Horton 58

Educational Research in CS/SE

• We have a CS Education Group here at UVa– Teaching Faculty: Horton, Milner, Anderson, Prey– Also, others like Knight, Evans, Cohoon

• Education topics are suitable for a Senior Thesis!

© T. Horton 59

Education Projects

• SW Engineering Education (Horton and Knight)– Share across the Net with other universities

• CS201 camera client/server system• CS340 Robot systems?

– Web Repository for educational artifacts• MySQL and PHP

– Winding down for now, but future could include:• Case studies for use in CS202, CS340, etc.• Pair-programming (like XP) in CS101?• New robot-like project suitable for CS340?

© T. Horton 60

Projects (cont’d)

• Next Generation Computer Labs (Knight, Horton, Milner, Anderson)– Focuses on CS101 and CS201– Do they have to be in lab? If so, what should they

be doing as a group with TAs?– Can we use technology better?

• Netmeeting, Instant Messaging, Web discussion boards• Collaborative learning: small groups, using a Web board

etc.

– Studio Labs (like for CS340) in CS201?– Evaluate effectiveness (surveys, interviews, etc.)

© T. Horton 61

Projects (cont’d)

• Fourth-year course/project that does:– Larger scale system development– Distributed computing

• Client/server• Database

– Integration of custom code with existing components

• Database, middle-ware, COTS components

– Uses .NET, SOAP/XML, etc.

• What would this course or project look like?

© T. Horton1 Object-oriented Models for Text and Document Processing Dr. Tom Horton Dept. of...

Documents

Transcript of © T. Horton1 Object-oriented Models for Text and Document Processing Dr. Tom Horton Dept. of...