Digital Text and
Data Processing
Week 1
□ Future of reading?
□ Understanding “Machine reading”: □ Text analysis tools□ Visualisation tools
Course background
□ Differences between machine reading and human reading
Images taken from textarc.org and from Google App store, Javelin for Android
Scale
□ “a collection of methods used to find patterns and create intelligence from unstructured text data” (1)
□ Information is found “not among formalised database records, but in the unstructured textual data” (2)
□ Related to data mining
Text Mining
(1) Francis, Louise. “Taming Text: An Introduction to Text Mining.” Casualty Actuarial Society Forum Winter (2006), p. 51(2) Feldman, Ronan. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge: Cambridge University Press, 2007, p. 1
□ Information is often implicit
□ Homonyms and synonyms
□ Computers do not have access to the meaning of the text
□ Spelling changes over time or may be vary according to region
Difficulties natural language
I trod on grass made green by summer's rain,
Through the fast-falling rain and high-wrought sea
'Tis like a wondrous strain that sweepsAnd suddenly my brain became as sand
She mixed; some impulse made my heart refrain
were found where the rainbow quenches its points upon the earth
Rain rain rains rain’s Rain’s Rain. rain. Rain! ‘rain’
The outworn creeds again believed,
Hatred, despair, and fear and vain belief
Because I am a Priest do you believe
imagine, while asserting what it believes to be true …
The pleasure of believing what we see
long-believing courage, and the systematic efforts of generations of
□ Data creation
□ Data analysis
Two stages in text mining
□ W1: Introduction to the course and introduction to the Perl programming language
□ W2: Regular expressions, word segmentation, frequency lists, types and tokens
□ W3: Natural language processing: Part of Speech tagging, lemmatisation
□ W4: Exploration of existing text mining tools
Weekly Programme
Cluster 1: Data creation
□ W5: Introduction to R package□ W6: Multivariate analysis: Principal
Component Analysis, Clustering techniques
□ W7: Visualisation□ W8: Conclusion: What type of knowledge
can we create?
Weekly Programme
Cluster 2: Data analysis
□ 5 assignments (2 points to be earned for each)□ Final essay (ca. 3,000 words)
□ Report of your individual research project□ Critical reflection on the merits of text mining:
□What sort of knowledge can be produced? □How does this type of research relate to
traditional scholarship? □Main obstacles or challenges?□ Is the creation of a text analysis tool a
legitimate scholarly activity in the humanities?
Course evaluation
□ Programming languages: used to give instructions to a computer
□ There is a gap between human language and machine language
□ Digital information is information represented as combinations of 1s and 0s,e.g.: A = 01100001
Introduction to programming
□ First generation programming languages: Assembler, eg ADD X1 Y1
□ Higher-level programming languages: Compilers or Interpreter
Human Programmer
Language processor Computer
Programming language, e.g. Perl
Machine Language 0101100101010
The Perl programming language
□ Open source
□ Developed by the linguist Larry Wall
□ Easy to learn; Code is often easy to read
□ Developed specifically for text processing
Getting started
1. Create a working directory on your computer2. Open a code editor and type the following
lines:
use strict ;use warnings ;
print “It works!” ;
3. Modify the .bat file that is provided
Today’s exercise
Create an application in Perl which can read a machine readable version of Shelley’s Collected Poems (file is provided) and which can print all lines that contain a given keyword.
(suggestions: “fire” , “rain” , “moon”, “storm”, “time”)
Variables□ Always preceded by a dollar sign
$keyword
□ Variables can be assigned a value with a specific data type (‘string’ or ‘number’)
$keyword = “time” ;$number = 10 ;
□ Three types of variables: scalar, array, hash
Strings□ Can be created with single quotes and with double quotes
□ In the case of double quotes, the contents of the string will be interpreted.
□ For instance, you can then use “escape characters” in your string:
“\n” new line“\t” tab“\a” alarm bell
Statements□ Perl statements can be compared to sentences.
□ Perl statements end in a semi-colon!
print “Now this makes a statement!” ;
ExercisePrint a string that looks as follows:
This is the first line.This is the second line.This line contains a tab.
Also try to use the “\a” escape character in your string.
Reading a fileIs done as follows:
open ( IN , “shelley.txt” ) ;
while ( <IN> ) {
print $_ ;
}
close ( IN ) ;
Exercise
Create a Perl application which can read the text file “shelley.txt” and which can print all the lines.
Control keywordsif ( <condition> ) {
<first block of code>
} elsif {<second block of code>
} else {<last block of code ;
default option>}
Regular expressions (2)
□ The pattern is given within two forward slashes
□ Use the =~ operator to test if a given string contains the regex.
□ Example:
$keyword =~ /rain/
Control keywordsif ( <condition> ) {
<first block of code>
} elsif {<second block of code>
} else {<last block of code ;
default option>}
Regular expressions
□ The pattern is given within two forward slashes
□ Use the =~ operator to test if a given string contains the regex.
□ Example:
$keyword =~ /rain/
Exercise
You should now be able to make the exercise that was discussed earlier
Regular expressions (2)
□ If you place “i” directly after the second forward slash, the comparison will take place in a case insensitive manner.
□ \b can be used in regular expressions to represent word boundaries
if ( $keyword =~ /\btime\b/i ) {
}
Additional exercises
□ Create a program that can count the total number of lines in the file “shelley.txt”
□ Create a program that can calculate the length of each line, using the length() function
length( $line ) ;
□ Calculate the average line length (in characters) for the entire file.
Top Related