Digital Text and Data Processing Introduction to R.
-
Upload
ferdinand-mcgee -
Category
Documents
-
view
219 -
download
0
Transcript of Digital Text and Data Processing Introduction to R.
![Page 1: Digital Text and Data Processing Introduction to R.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649de85503460f94ae1f13/html5/thumbnails/1.jpg)
Digital Text and
Data Processing
Introduction to R
![Page 2: Digital Text and Data Processing Introduction to R.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649de85503460f94ae1f13/html5/thumbnails/2.jpg)
□ Tools themselves are often based on specific assumptions / subjective decisions
□ There is subjectivity in the way in which tools are used
□ Reproducible results
□ Rockwell & Ramsay, in “Developing Things”: A tool is a theory
Objectivity of DH Research
![Page 3: Digital Text and Data Processing Introduction to R.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649de85503460f94ae1f13/html5/thumbnails/3.jpg)
Willard McCarty, Humanities Computing (Palgrave, 2005)
"The point of all modelling exercises, as of scholarly research generally, is the process seen in and by means of a developing product, not the definitive achievement"(p. 22).
Models, "however finely perfected, are better understood as temporary states in a process of coming to know rather than fixed structures of knowledge"(p. 27)
-> Clash between tacit and intuitive knowledge of scholar and computer’s need for consistency and explicitness
![Page 4: Digital Text and Data Processing Introduction to R.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649de85503460f94ae1f13/html5/thumbnails/4.jpg)
□ Data creation
□ Data analysis
Two stages in text mining
![Page 5: Digital Text and Data Processing Introduction to R.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649de85503460f94ae1f13/html5/thumbnails/5.jpg)
□ Finding distinctive vocabulary
□ Finding stylistic or grammatical differences and similarities
□ Examining topics or themes
□ Clustering texts on the basis of quantifiable aspects
Types of analyses
![Page 6: Digital Text and Data Processing Introduction to R.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649de85503460f94ae1f13/html5/thumbnails/6.jpg)
opendir (DIR, $dir) or die "Can't open directory!";
while (my $file = readdir(DIR)) {
if ( $file =~ /txt$/) {push ( @files, $file ) ;
}
}
Reading a directory
![Page 7: Digital Text and Data Processing Introduction to R.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649de85503460f94ae1f13/html5/thumbnails/7.jpg)
Inverse document frequency
For an application, see Stephen Ramsay, Algorithmic Criticism
![Page 8: Digital Text and Data Processing Introduction to R.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649de85503460f94ae1f13/html5/thumbnails/8.jpg)
□ Both a programme and a programming language
□ Successor of “S”
□ “a free software environment for statistical computing and graphic”
□ The capabilities of R can be extended via external “packages”
![Page 9: Digital Text and Data Processing Introduction to R.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649de85503460f94ae1f13/html5/thumbnails/9.jpg)
![Page 10: Digital Text and Data Processing Introduction to R.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649de85503460f94ae1f13/html5/thumbnails/10.jpg)
□ Any combination of alphanumerical characters, underscore and dot
□ Unlike Perl, they do not begin with a $ □ First characters cannot be a number. The second characters
cannot be a number if the first character is a dot
Variables in R
Allowed: Not allowed:data 3rdDataSetmy.data .4thData.setmy_2ndDataSet.myCsv
![Page 11: Digital Text and Data Processing Introduction to R.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649de85503460f94ae1f13/html5/thumbnails/11.jpg)
□ A collection of indexed values
□ Can be created using the c() function, or by supplying a range
□ N.B. The assignment operator in R is <-
□ Examples:
Vectors
x <- c( 4, 5, 3, 7) ;
y <- 1:30 ;
![Page 12: Digital Text and Data Processing Introduction to R.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649de85503460f94ae1f13/html5/thumbnails/12.jpg)
□ A collection of vectors, all of the same length
□ Each column of the table is stored in R as a vector.
Data frame
V1 V2 V3R1 3, 4, 5R2 1, 21, 8R3 23, 5, 6
![Page 13: Digital Text and Data Processing Introduction to R.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649de85503460f94ae1f13/html5/thumbnails/13.jpg)
Comma Separated Values
i,you,heEmma,160416,3178,1994Persuasion,77431,1284,918PrideAndPrejudice,121812,2068,1356
N.B. The first row has one column less
![Page 14: Digital Text and Data Processing Introduction to R.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649de85503460f94ae1f13/html5/thumbnails/14.jpg)
□ Use the read.csv function, with parameter header = TRUE□ The CSV file will be represented as a data frame□ Values on first line and first value of each subsequent line will be used as rownames and colnames
Reading data
data <- read.csv( "data.csv" , header = TRUE) ;
colnames(data)
![Page 15: Digital Text and Data Processing Introduction to R.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649de85503460f94ae1f13/html5/thumbnails/15.jpg)
□ Can be accessed using the $ operator
Data frame columns
data <- read.csv( "data.csv" , header = TRUE) ;
data$you
![Page 16: Digital Text and Data Processing Introduction to R.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649de85503460f94ae1f13/html5/thumbnails/16.jpg)
□ max(), min(), mean(), sd()
Calculations
y <- data$you ;
max(y) ;
sd(y) ;
![Page 17: Digital Text and Data Processing Introduction to R.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649de85503460f94ae1f13/html5/thumbnails/17.jpg)
□ Run the program “typeToken.pl”
□ Use the file “ratio.csv” that is created by this program.
□ Print a list of all the texts that have been read□ Calculate the average number of tokens□ Calculate the total number of tokens in the full corpus□ Identify the lowest number in the column “types”□ Identify the highest number in the column “ratio”
Exercise
![Page 18: Digital Text and Data Processing Introduction to R.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649de85503460f94ae1f13/html5/thumbnails/18.jpg)
d <- read.csv("data.csv") ;
d <- d[ 1 , 2 ] ;
d <- d[ 2 , ] ;
od <- data[ order( data$ratio ), ]
Subsetting and sorting
![Page 19: Digital Text and Data Processing Introduction to R.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649de85503460f94ae1f13/html5/thumbnails/19.jpg)
□ Qualitative data (categorical)
□ Nominal scale (unordered scale), e.g. eye colour, marital status□ Ordinal scale (ordered scale), e.g. educational level
□ Quantitative data
□ Interval (scale with no mathematical zero)□ Ratio (multipliable scale), e.g. age
Quantitative and Qualitative
Source: Seminar Basic Statistics, Laura Bettens
![Page 20: Digital Text and Data Processing Introduction to R.](https://reader035.fdocuments.us/reader035/viewer/2022062516/56649de85503460f94ae1f13/html5/thumbnails/20.jpg)
□ Two quantitative variables can be clarified in a variety of ways (e.g. line chart, pie chart)
□ A combination of one qualitative variable and one quantitative variable is best presented using a bar chart or a dot chart
Diagrams