Mdst3705 2013-02-19-text-into-data
-
Upload
rafael-alvarado -
Category
Documents
-
view
320 -
download
0
description
Transcript of Mdst3705 2013-02-19-text-into-data
Text into Data
Prof AlvaradoMDST 3705
19 February 2013
Business• Quiz 1 graded
– Let me know if you have questions• Readings
– Apologies for mis-posting!
Review• Last week, we took a 30,000 foot view of
the use of databases in the digital humanities– We found that databases are everywhere
• Databases form the foundation of all projects– Even if a database management system is
not used• Relational databases are sophisticated
and mature choices for foundations
Overview• We began this course by looking at code
as language– Code structured like natural language– Code implies, models, and creates a world
• We then looked at the opposite process – looking at language, and the products of culture, as code– We called this “reverse engineering”
• Today we continue this and look specifically at text
What do you remember when you read a book?
We remember scenes, images, plot lines, values, etc.
We sometimes remember verbatim passages
We don’t normally remember the words
We get much of our culture through books (and other "cultural models" in Colby's words)
Like cigarettes, books are a delivery mechanism
(not of nicotine, but of culture)
Colby's theory
TEXTS
CULTURE
If texts contain cultural meanings . . .
How do we get to them?
How do we represent them?
Models of Text
Competing Approaches• A common approach to model text is to
use XML– XML is like HTML, but more general– It allows you to mark up a text
• XML assumes a text is like a tree– An “ordered hierarchy of content objects”
• XML was also specifically designed to work with text
XML looks like this
Notice how the element names reference units, not layout or style
Text as Tree
XML turns out to be very useful for defining the physical or
logical structure of a text, but not for figures and meanings
Texts are actually more like networks
This image shows three "figures" in the text of an Old French poem. Note how they do not "nest" neatly into the structure of the text, but instead cross-cut it.
It is hard to model this kind of data with XML.
Relational databases are a better choice for this since they are more abstract
The problem is, what data model to use?
How do you model text in a relational database?
Liu and Smith argue for a radical model, in which text is parsed at the workd level
Each word gets its own record
The Princeton Charrette Project used a database-driven application called Figura
It was designed to represent the critical edition of an Old French poem along with
the figural annotations of the text made by scholars
A “figure” is a figure of speech or rhetorical device, like rhyming or the use
of chiasmus
The database stored information about grammar, manuscript images, figures, and other data that had been accumulated over the years prior to building
the database
At the heart of the database is the text model that links figures to text
In my model and in Liu & Smith’s, the text becomes a database
The readable text is just a query
As is the index, table of contents, etc.
The database of words and figures can be read by a
program to generate a visually rich and interactive edition on
the web
But it can also be used to discover patterns in the text not visible to
the reader
It can help us discover the cultural patterns that are “delivered” by
the text to our brains
The results of a query showing the relationship between proper nouns (agents) and figure types
A structural reading of the data
Form and content are interwoven, each reinforcing the other
Form – the delivery system – is used to transmit the meaningful content,
the stuff that remains in your brain after reading or hearing the story
This is a "hypergraph" of the same data, also easily generated from the database by code
Text is like this
http://anthonyflo.tumblr.com/post/7590868323/photographer-and-self-described-geek-of-maps
A text is a signal
Culture is a transmitter