Mdst3705 2013-02-19-text-into-data

37
Text into Data Prof Alvarado MDST 3705 19 February 2013

description

 

Transcript of Mdst3705 2013-02-19-text-into-data

Page 1: Mdst3705 2013-02-19-text-into-data

Text into Data

Prof AlvaradoMDST 3705

19 February 2013

Page 2: Mdst3705 2013-02-19-text-into-data

Business• Quiz 1 graded

– Let me know if you have questions• Readings

– Apologies for mis-posting!

Page 3: Mdst3705 2013-02-19-text-into-data

Review• Last week, we took a 30,000 foot view of

the use of databases in the digital humanities– We found that databases are everywhere

• Databases form the foundation of all projects– Even if a database management system is

not used• Relational databases are sophisticated

and mature choices for foundations

Page 4: Mdst3705 2013-02-19-text-into-data

Overview• We began this course by looking at code

as language– Code structured like natural language– Code implies, models, and creates a world

• We then looked at the opposite process – looking at language, and the products of culture, as code– We called this “reverse engineering”

• Today we continue this and look specifically at text

Page 5: Mdst3705 2013-02-19-text-into-data

What do you remember when you read a book?

Page 6: Mdst3705 2013-02-19-text-into-data

We remember scenes, images, plot lines, values, etc.

We sometimes remember verbatim passages

We don’t normally remember the words

Page 7: Mdst3705 2013-02-19-text-into-data

We get much of our culture through books (and other "cultural models" in Colby's words)

Page 8: Mdst3705 2013-02-19-text-into-data

Like cigarettes, books are a delivery mechanism

(not of nicotine, but of culture)

Page 9: Mdst3705 2013-02-19-text-into-data

Colby's theory

TEXTS

CULTURE

Page 10: Mdst3705 2013-02-19-text-into-data

If texts contain cultural meanings . . .

How do we get to them?

How do we represent them?

Page 11: Mdst3705 2013-02-19-text-into-data

Models of Text

Page 12: Mdst3705 2013-02-19-text-into-data

Competing Approaches• A common approach to model text is to

use XML– XML is like HTML, but more general– It allows you to mark up a text

• XML assumes a text is like a tree– An “ordered hierarchy of content objects”

• XML was also specifically designed to work with text

Page 13: Mdst3705 2013-02-19-text-into-data

XML looks like this

Notice how the element names reference units, not layout or style

Page 14: Mdst3705 2013-02-19-text-into-data

Text as Tree

Page 15: Mdst3705 2013-02-19-text-into-data

XML turns out to be very useful for defining the physical or

logical structure of a text, but not for figures and meanings

Texts are actually more like networks

Page 16: Mdst3705 2013-02-19-text-into-data

This image shows three "figures" in the text of an Old French poem. Note how they do not "nest" neatly into the structure of the text, but instead cross-cut it.

It is hard to model this kind of data with XML.

Page 17: Mdst3705 2013-02-19-text-into-data

Relational databases are a better choice for this since they are more abstract

The problem is, what data model to use?

How do you model text in a relational database?

Page 18: Mdst3705 2013-02-19-text-into-data

Liu and Smith argue for a radical model, in which text is parsed at the workd level

Each word gets its own record

Page 19: Mdst3705 2013-02-19-text-into-data
Page 20: Mdst3705 2013-02-19-text-into-data

The Princeton Charrette Project used a database-driven application called Figura

It was designed to represent the critical edition of an Old French poem along with

the figural annotations of the text made by scholars

A “figure” is a figure of speech or rhetorical device, like rhyming or the use

of chiasmus

Page 21: Mdst3705 2013-02-19-text-into-data
Page 22: Mdst3705 2013-02-19-text-into-data
Page 23: Mdst3705 2013-02-19-text-into-data

The database stored information about grammar, manuscript images, figures, and other data that had been accumulated over the years prior to building

the database

Page 24: Mdst3705 2013-02-19-text-into-data

At the heart of the database is the text model that links figures to text

Page 25: Mdst3705 2013-02-19-text-into-data

In my model and in Liu & Smith’s, the text becomes a database

The readable text is just a query

As is the index, table of contents, etc.

Page 26: Mdst3705 2013-02-19-text-into-data
Page 27: Mdst3705 2013-02-19-text-into-data

The database of words and figures can be read by a

program to generate a visually rich and interactive edition on

the web

Page 28: Mdst3705 2013-02-19-text-into-data
Page 29: Mdst3705 2013-02-19-text-into-data

But it can also be used to discover patterns in the text not visible to

the reader

It can help us discover the cultural patterns that are “delivered” by

the text to our brains

Page 30: Mdst3705 2013-02-19-text-into-data

The results of a query showing the relationship between proper nouns (agents) and figure types

Page 31: Mdst3705 2013-02-19-text-into-data

A structural reading of the data

Page 32: Mdst3705 2013-02-19-text-into-data
Page 33: Mdst3705 2013-02-19-text-into-data

Form and content are interwoven, each reinforcing the other

Form – the delivery system – is used to transmit the meaningful content,

the stuff that remains in your brain after reading or hearing the story

Page 34: Mdst3705 2013-02-19-text-into-data

This is a "hypergraph" of the same data, also easily generated from the database by code

Page 35: Mdst3705 2013-02-19-text-into-data
Page 36: Mdst3705 2013-02-19-text-into-data

Text is like this

http://anthonyflo.tumblr.com/post/7590868323/photographer-and-self-described-geek-of-maps

Page 37: Mdst3705 2013-02-19-text-into-data

A text is a signal

Culture is a transmitter