Data Science Academy Student Demo day--Michael blecher,the importance of cleaning, organizing and...

The Importance of Cleaning, Organizing and Analyzing Datain Research

Michael Blecher

Why Data is Important To Researchers

What makes research scientific is the fact it relies on the use of statistics and data.

Researchers do not speculate. For example, they can’t just claim something like colder temperatures make people smarter.

To make hypotheses and theories into working truths, large sets of data have to be collected (i.e. running the experiment), and analyzed.

Only by having such empirical evidence, can we establish if our theories have validity.

Thus, there may be nothing more important for people interested in research than having a background in statistics and knowledge on how to manipulate data (the latter is where R comes in!!!)

The Plan

Background into the psychology experiment I programmed

Amazon Mechanical Turk and what data often looks like in academia.

Making sense of the data and cleaning it Organizing the data. Analyzing the data Making the data look pretty!

My Experiment-The Famous DRM Paradigm

When one is asked to remember a list of strongly related words, they have a high rate of falsely misremembering a word that semantically embodies all these previously presented words (category word).

Ex: If I asked you to remember the words mad, fear, hate, rage and temper (related words), most of you would likely misremember seeing the word anger as well which is considered the category word to this list (there are 24 lists altogether-each with a different category word and set of related words)

Hence, this experiment had four conditions-lists of words that were presented before the category word (pre-lure) they are associated with, lists of words that were presented after the category word, and the type of word it was (related or category)

Ex: If anger, which is a category word was presented after all its related words were presented, it be considered a post-lure category word, while words like mad be viewed as a pre-lure related word.

What Data Often Looks Like

Experiment was ran through Amazon Mechanical Turk-an internet site that allows people to participate in psychological experiments. (among other things)

And now the fun begins! Looking at the data.

C:\Users\Michael\Documents\R datascience notes

C:\Users\Michael\Documents\R datascience notes

The data in a more organized fashion but this was done manually.

One goal is going to be using R to make our raw data look like that excel file.

raw data.txt

Where to Start

Before anything, we have to set the working directory to the folder where our data is-

setwd("C:/Users/Michael/Documents/R datascience notes")

Next, let’s just make a list of different variables. They won’t contain anything at the moment but they have the ability to hold multiple items at once.

Country<-c()

Gender<-c()

Subject<-c()

Age<-c()

Handedness<-c()

Many more variables in the actual R file!

Creating a Loop

One of the most important things about programming experiments is knowing ahead of time how you plan to extract the data. For instance, we assured that certain punctuation would appear before the data of each trial in the experiment was recorded (a colon in this case). A colon also appeared before the demographic information was recorded.

Now we can make a loop

for (subnum in 5:55) #50 participants and only data for subjects 5-55 was reocrded

{

test<-scan(file="raw data.txt", what = "character", sep = " ", skip=(0+subnum), nlines=1) #skip and nlines allows us to select which lines we want read

data<-unlist(strsplit(test,split=":")) #will split the data by the colon-we use unlist to make all the pieces of the data into a single vector-this will allow us to extract any data we like just through the use of an index

if (length(data)==172)…{ #created to assure that we only extract participants’ data when they completed the entire experiment (there were 172 trials).

Loop (continued)

As you will see, our variable data contains many different pieces of data. 13-172 contains all the info that was recorded for every trial.

For instance:

[14] "concert,music,Related,Pre-Lure,9,1398816317238,1398816319063,old,4,1,1"

On trial 14, the word presented was concert, the category word that is associated with this word is music. Thus, it was a Related Pre-Lure word, etc.

So now we can split these things up even more with this code

CurrentTrial<-unlist(strsplit(data[i],split=","))

Now each piece of each trial has it’s own index.

Putting Data into our variables

And now we can put each piece of our data into the correct variables we made earlier.

Ex:

CategoryWord<-c(CategoryWord,CurrentTrial[2])

RelatedWord<-c(RelatedWord,CurrentTrial[1]) etc.

And we can also do the same with demographic information.

for (i in 12:12){

CurrentTrial2<-unlist(strsplit(data[i],split=","))

Country<-c(Country,CurrentTrial2[2])

Age<-c(Age,CurrentTrial2[4])

Making data.frames

First before we forget, we should change certain variables so R reads them as numeric-ex: Memory<-as.numeric(Memory)

Next, let’s make some neat data.frames

One is a simple truncated version and the other is a longer version.

AllData<-data.frame(Subject,Phase,Type,Memory,Confidence)

AllData2<-data.frame(Subject,Phase,Type,Memory,Confidence,CategoryWord,RelatedWord, OnsetTime, ResponseTime)

Adding columns to our data

Still, so much of our data is missing. For instance, the times we have recorded does not provide us yet with a true reaction time. Instead, we only have the exact time when the stimulus was presented and when a button was clicked.

So let’s add to our data.frame by making a new column that creates reaction time by finding the difference between the time-related variables we do have.

Ex: library(dplyr)

AllData3<-mutate(AllData2,Reaction_Time=(ResponseTime-OnsetTime))

Let’s also make it into a separate variable -

Reaction_Time<-AllData3$Reaction_Time

Putting it in excel

Our data as an excel file-

write.table(AllData3, file='data.csv', sep=",",col.names=TRUE,row.names=FALSE)

Doesn’t look half bad but so much more work to go

For starters, let’s make the data more concise by calculating the means for each subject by type and phase-

AllData<-aggregate(Memory~Subject*Phase*Type,AllData,mean)

library(reshape2)

Can also use this code from reshape2:

organizedata<dcast(AllData,Subject*Phase*Type~.,value.var='Memory',row.names=TRUE)

Splitting Data

We can even split the data into smaller subsets-we can separate the data for each subject

Ex:

Averag4 <- dlply(.data=AllData,

.variables='Subject')

The code below will make our last column labeled Memory.

names(organize_data)[4]="Memory“

Next comes trying to merge the demographic information into our data. We start by changing our data structure where Type and Phase take their own columns.

Ex:

clean_data<-dcast(organize_data, Subject ~ Phase + Type,value.var="Memory")

Subsetting Data

Now we are going to combine all those demographic variables we made earlier with the most recent data frame.

demographic_info<-data.frame(Country,Vision,Handedness,Gender,clean_data)

Now we can find pretty much anything in our data. Ex: can get values of all memory scores during the Post-Lure Category Phase when the participant was a male.

Ex: demographic_info$Post.Lure_Category[demographic_info$Gender=="Male"]

If we examine the country participants were from, we will see that many participants were from the United States but they all didn’t wrote that exact term in their questionnaire (some wrote U.S., America, etc.)

Recoding Data

demographic_info$Country[demographic_info$Country=="usa"|

demographic_info$Country=="USA"|

demographic_info$Country=="America"

|demographic_info$Country=="United States of America"

|demographic_info$Country=="US"|demographic_info$Country=="United states"

|demographic_info$Country=="us"]<-"United States“

This code above makes the adjustment so when we examine the frequencies, it will not appear that people who wrote America are from a different country than those wrote the United States.

Analyzing our data

Let’s do a 2-way anova to see if there is actually an interaction effect!

aov.out<-aov(Memory~Phase*Type,AllData) #We use * to get the interaction effect as well as the effect of the two independent variables.

print(summary(aov.out),digits=10)

print(model.tables(aov.out,"means"),digits= 3) #will give us a summary tables for the analysis we ran.

Can also use this code to get the model.tables bar=tapply(AllData$Memory,list(AllData$Type,AllData$Phase),mean)

And it turns out our interaction was significant.

Showing the Interaction Graphically

And we can make different graphs from this code too!

Ex: library(sciplot)

lineplot.CI(x.factor=AllData$Type,response=AllData$Memory,group=AllData$Phase,

trace.label="Phase",xlab="Type",ylab="Percentage of Wrong Answers",

main="Interaction Effect")

Or if you want a bar graph-ex:

graph2 = barplot(bar, beside=T, ylim=c(0.2,1),xpd=FALSE,

space=c(.1,.8), main="TIP Effect in DRM Paradigm",

xlab="Phase", ylab="Percentage of Wrong Answers", legend =T,

args.legend = list(x="topleft"), col=c("red","blue"))

Applying this to other parts of the data

Can use the same code to also find the difference in reaction times between the four different conditions and run an Anova on that to see if there’s a significant difference in reaction time.

We can also analyze demographic information-first let’s again create isolated variables to represent certain columns that we made in our demographic data.

Now, let’s create a new data.frame but with a structure that looks more familiar to us.

Ex: Demographic_info2=data.frame(Gender,Post.Lure_Category,Pre.Lure_Category, Pre.Lure_Related,Post.Lure_Related)

Organizing the data

by_gender<-melt(data=demographic_info2,id="Gender")

dcast(data=by_gender, formula=Gender~variable, value.var='value',fun=mean)

Code above is exactly what we need to now find means by gender.

-In fact we can use this code to merge all this data into a really organized package.

demographic_info3=data.frame(Gender, Country, Handedness,Vision,

Post.Lure_Category,Pre.Lure_Category,Pre.Lure_Related,Post.Lure_Related)

by_gender2<-melt(data=demographic_info3,id.vars=c("Gender","Country","Handedness","Vision"))

Running an ANCOVA

However, if we want to run an ANCOVA and examine if we should control for gender, we should merge the demographic data with the original data.frame we created-just so type and phase are separate variables again.

Ex:

combine_everything<-data.frame(by_gender2$Gender,by_gender2$Country,

by_gender2$Handedness,by_gender2$Vision,AllData)

ANCOVA<-aov(Memory~Phase*Type+by_gender2.Gender,combine_everything)

print(summary(ANCOVA),digits=10)

Can even examine which was the most luring category word!

bark=tapply(AllData2$Memory,list(AllData2$CategoryWord),mean)

###Will get the mean for each category word

library(plyr)

organize_bark<-ddply(AllData2, "CategoryWord", summarise, mean =mean(Memory))

##Can then make this data into a nice data.frame using summarise.

organize_bark <- organize_bark[order(organize_bark$mean, decreasing=TRUE),]

##Can make it where it’s sorted too

And we can graph this as well!

Conclusion

R makes the process of cleaning, organizing and analyzing data quite enjoyable.

The fact we can easily take raw data that looks like gibberish and easily translate it into something coherent is quite amazing.

R allows us to subset our data in various ways, and explore if certain categorical independent variables (e.g. gender) are influencing the effects our investigation may revolve around.

R gives us the ability to run some kind of statistical analysis to see if our results are actually significant

R can easily depict our findings through graphs, facilitating a communal understanding of the results among researchers.

Data Science Academy Student Demo day--Michael blecher,the importance of cleaning, organizing and...

Engineering

Transcript of Data Science Academy Student Demo day--Michael blecher,the importance of cleaning, organizing and...