Data_Wrangling.utf8.md

##Data Wrangling

###notes: df is the name of the data we are analyzing df[1,] #row 1 (leaving blank means everything in the column) df[,1] # column 1 df[1:4,] #rows 1 through 4 and all columns df[,4:5] #columns 4 to 5 and all the rows df[1:3,6:7] #rows 1 to 3 and columns 6 to 7

df[df$name==“luke skywalker”,] #return all columns with row equl to luke sky df[df$height > 180,] #find all the rows with height greater than 180 df[1,2] <- 173 #goes to that particular cell and edit what is in it dim(df) #returns rows and columns df <- cbind(df,random_number=runif(dim(df)[1],0,1)) #dim df gives me the first thing and generates 87numbers between 0:1

new_df <- rbind(df[1:2,],df[5:6,]) #takes the first two rows and adds row 5-6 combing the four rows

df$species <- as.factor(df$species) levels(df$species) # treat the column species as a factor and put it back into itself so it becomes that

levels(df$species)[3] <- “hello” #renames the third level, which gets applied to all listings in the df

names of all chracters that are taller than 80, shorter than 140, and are female.

df[df$height > 80 & df$height < 140 & df$gender == “female”, ]$name

tatooine for their home world

df[df$homeworld==“tatooine”,] dim(df[df$homeworld==“tatooine”,])[1] #counts the NA

tatooine <- df [df$homeworld==“tatooine”,] tatooine[is.na(tatooine$name)==FALSE,] dim(tatooine[is.na(tatooine$name)==FALSE,])[1]

selecint different variables within the data frame

library(dplyr)

new_df <- df %>% filter(height > 100) %>% group_by(homeworld) %>% summarise(mean_birth_year = mean(birth_year,na.rm=TRUE)) #create a new column called mean birth year, and then compute the mean of those numbers in the column

new_df <- df %>% filter(height > 120, # filters out any columns with height greater than 120 height < 180) # filters out any columns with height less than 180

new_df < df%>% filter (gender != “male”) #not equal to male

new_df <- df %>% group_by(hair_color,eye_color) %>% summarise(counts=length(name)) #how many of each eye and hair color

###mutate

use mutate to change or add a column

new_df <- <df %>% mutate(hm_ratio = height/mass) #gives a new column of height divided by mass new_df <- df %>% select(name,height,mass) #a new df that only has name height mass

new_def < df %>% select(name,films) %>% group_by (name) %>%

DATA INPUT ASSIGMENT

Flanker task

In a flanker task, participants identify a central stimulus (as quickly and accurately) as possible, while ignoring distracting stimuli presented on the left or right of the central stimulus (the flankers).

For example, the stimulus could be “HHH”, and the correct response would be H. This is called a compatible (or congruent) stimulus because the flanking Hs are the same as the central stimulus. Alternatively, the stimulus could be “HSH”, and the correct resposne would be S. This is called an incompatible (or incongruent) stimulus because the flanking Hs are different from the central stimulus.

The data for this assignment come from a flanker task where participants responded to many flanker stimuli over several trials.

Load the data and libraries you will use

I will help you with some sample code that compiles all of the text files in a single long format data frame.

The data is contained in this .zip file: FlankerData.zip

The code chunk below assumes that you have placed the folder FlankerData into your R project folder.

library(data.table)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)


# get the file names
file_names <- list.files(path="data_wrangle_assignment")

# create headers for each column
the_headers <- c("stimulus","congruency","proportion",
                 "block","condition","dualtask","unknown",
                 "stimulus_onset","response_time","response","subject")
# Load data
# create empty dataframe
all_data<-data.frame()

# loop to add each file to the dataframe
for(i in file_names){
  one_subject <- fread(paste("data_wrangle_assignment/",i, sep=""))#fread looks for file name all files are in data_wrangle_assignment"/" then nothing adds all he files not any one individual. the i represents all the files
  names(one_subject) <- the_headers
  one_subject$subject <- rep(i,dim(one_subject)[1])
  one_subject <- cbind(one_subject, trial= 1:dim(one_subject)[1])
  all_data <- rbind(all_data,one_subject)
}

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<SSS C HPC B3 C1 DA 1
## 1429545653881 1429545654464 s 1429545654485 1429545655705 s >>

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<SDS I LPC B3 C1 DA 1
## 1429721027215 1429721034673 d 1429721034693 1429721037336 d >>

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<HJH I LPC B3 C1 DA 2
## 1429720989538 1429720993779 j 1429720993785 1429720995236 h >>

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<DSD I LPC B3 C1 DA 1
## 1429806083269 1429806083950 s 1429806083972 1429806085767 s >>

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<SDS I HPC B3 C1 DA 1
## 1429806313480 1429806314183 d 1429806314205 1429806316071 d >>

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<JHJ I HPC B3 C1 DA 2
## 1429806117326 1429806118655 h 1429806118674 1429806119735 j >>

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<DDD C HPC B3 C1 DA 2
## 1430150133059 1430150133679 d 1430150133696 1430150135632 d >>

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<HJH I HPC B3 C1 DA 1
## 1430150314437 1430150315598 j 1430150315619 1430150317838 h >>

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<DDD C HPC B3 C1 DA 1
## 1430150145985 1430150146826 d 1430150146834 1430150147850 d >>

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<SSS C LPC B3 C1 DA 2
## 1430150564606 1430150566266 s 1430150566279 1430150568347 s >>

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<JHJ I LPC B3 C1 DA 2
## 1430241499262 1430241500273 h 1430241500282 1430241501330 j >>

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<HHH C HPC B3 C1 DA 2
## 1429636904389 1429636905985 h 1429636905998 1429636906777 h >>

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<JHJ I HPC B3 C1 DA 1
## 1430241629273 1430241630523 h 1430241630539 1430241632971 h >>

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<SSS C HPC B3 C1 DA 2
## 1430241437362 1430241438250 s 1430241438269 1430241440674 j >>

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<HHH C HPC B3 C1 DA 2
## 1430241452455 1430241453338 h 1430241453358 1430241455091 h >>

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<SSS C HPC B3 C1 DA 2
## 1429636722104 1429636723280 s 1429636723285 1429636725896 s >>

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<HJH I HPC B3 C1 DA 2
## 1429636735659 1429636737614 j 1429636737632 1429636739711 h >>

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<SSS C HPC B3 C1 DA 2
## 1429636772535 1429636774150 s 1429636774156 1429636774959 s >>

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<DSD I LPC B3 C1 DA 1
## 1429640807139 1429640808174 s 1429640808193 1429640810247 s >>

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<JHJ I LPC B3 C1 DA 2
## 1429641048574 1429641052721 h 1429641052737 1429641054609 j >>

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<HHH C HPC B3 C1 DA 1
## 1429720958417 1429720960347 h 1429720960357 1429720966822 h >>

## Warning in fread(paste("data_wrangle_assignment/", i, sep = "")): Stopped
## early on line 193. Expected 11 fields but found 14. Consider fill=TRUE
## and comment.char=. First discarded non-empty line: <<DDD C HPC B3 C1 DA 1
## 1429721150782 1429721151699 d 1429721151714 1429721152547 d >>

Pre-processing

get accuracy for each trial

A correct response occurs when the letter in the response column is the same as the letter in the middle position of item in the stimulus column. Create an accuracy column that codes whether the response was correct or incorrect on each trial (coding can be TRUE/FALSE, 0/1, or some other coding scheme that identifies correct vs incorrect)

center_letter <- unlist(lapply(strsplit(all_data$stimulus,""),
                               FUN=function(x)unlist(x)[2]))
                #lapply allows me to apply any function over any row in a list,unlist the function and then take the [2] thing in the item

all_data <- cbind(all_data,center_letter)

all_data <- all_data %>% 
              mutate(response = tolower(response),
                     center_letter = tolower(center_letter),
                     accuracy = response==center_letter)

#loop version

stimulus <- all_data$stimulus
center_letter <- c()

for (i in stimulus){
  answer <-unlist(strsplit(i,split = ""))[2]
  center_letter <- c(center_letter,i)
}
 center_letter <- tolower(center_letter)

Get Reaction time on each trial

The stimulus_onset column gives a computer timestamp in milliseconds indicating when the stimulus was presented. The response_time column is a timestamp in milliseconds for the response. The difference between the two (response_time - stimulus_onset) is the reaction time in milliseconds. Add a column that calculates the reaction time on each trial.

tip: notice that the numbers in response_time and stimulus_onset have the class integer64. Unfortunately, ggplot does not play nice with integers in this format. you will need to make sure your RT column is in the class integer or numeric.

all_data <- all_data %>% 
  mutate(RT = as.integer(response_time - stimulus_onset))

Checks

Check how many trials each subject completed in the congruent and incongruent conditions, the mean accuracy for each subject in each congruency condition, and the mean RT for each subject in each congruency condition.

check_subjects <- all_data %>% #adds on a process
                  group_by(subject,congruency) %>%
                  summarise(num_trials = length(RT),
                            mean_RT = mean(RT),
                            acc = mean(accuracy))

Exclusion

It is common to exclude Reaction times that are very slow. There are many methods and procedures for excluding outlying reaction times. To keep it simple, exclude all RTs that are longer than 2000 ms

restricted <- all_data %>%
              filter(RT < 2000)

Analysis

Reaction Time analysis

Get the individual subject mean reaction times for correct congruent and incongruent trials.

mean_RT <- all_data %>%
          summarise(mean(RT))

subject_meanRTs <- restricted %>%
                    filter(accuracy == TRUE) %>%
                    group_by(subject,congruency) %>%
                    summarise(mean_RT = mean(RT))
sd(mean_RT)/sqrt(length(mean_RT))

## [1] NA

Get the overall mean RTs and SEMs (standard errors of the mean) for the congruent and incongruent condition. Make a table and graph.

overall_RT <- all_data %>%
        group_by(congruency) %>%
        summarise(mean_RT = mean(RT)) %>%
        summarise(sd(mean_RT))

Compute the flanker effect for each subject, taking the difference between their mean incongruent and congruent RT. Then plot the mean flanker effect, along with the SEM of the mean flanker effect

tip: Not all problems have an easy solution in dplyr, this is one them. You may have an easier time using logical indexing of the dataframe to solve this part.

-index means to take

C <- subject_meanRTs[ subject_meanRTs$congruency == "C", ]$mean_RT 
I <- subject_meanRTs[ subject_meanRTs$congruency == "I", ]$mean_RT 

congruent_mean <- mean(I-C)

   



flanker_df <- data.frame(dv = "flanker_effect",
                      congruent_mean,
                  SEM = sd(congruent_mean)sqrt(length(flanker_means)))

             
                 
ggplot(x=flanker_effect, y=congruent_mean,
       group=congruent_mean,
       color=congruent_mean,
       fill=congruent_mean)+#fills inside the bar
   coord_cartesian(ylim=c(30,50))+
  geom_bar(stat="identity", 
           position="dodge")

Exploratory analysis

Multiple questions may often be asked of data, especially questions that may not have been of original interest to the researchers.

In flanker experiments, like this one, it is well known that the flanker effect is modulated by the nature of the previous trial. Specifically, the flanker effect on trial n (the current trial), is larger when the previous trial (trial n-1) involved a congruent item, compared to an incongruent item.

Transform the data to conduct a sequence analysis. The dataframe should already include a factor (column) for the congruency level of trial n. Make another column that codes for the congruency level of trial n-1 (the previous trial). This creates a 2x2 design with trial n congruency x trial n-1 congruency.

First get teh subject means for each condition, then create a table and plot for teh overall means and SEMs in each condition. This should include:

trial n congruent : trial n-1 congruent
trial n incongruent : trial n-1 congruent
trial n congruent : trial n-1 incongruent
trial n incongruent : trial n-1 incongruent

tip: be careful, note that the first trial in each experiment can not be included, because it had no preceding trial


prev_congruency <- 0

for (i in 2:length(included_data$subject)) {
  if (included_data$subject[i]==included_data$subject[i-1]) {
    prev_congruency[i] <- included_data$congruency[i-1]
  }
  else {
    prev_congruency[i] <- 0
  }
}

prev_analysis <- included_data %>%
  mutate(prev_congruency = prev_congruency) %>%
  filter(prev_congruency != 0, accuracy == TRUE) %>%
  group_by(subject,congruency,prev_congruency) %>%
  dplyr::summarise(mean_reaction = mean(reaction_time))

prev_analysis <- prev_analysis %>%
  group_by(congruency,prev_congruency) %>%
  dplyr::summarise(mean_react = mean(mean_reaction),
                   standard_error = sd(mean_reaction)/(sqrt(length(mean_reaction))))

knitr::kable(prev_analysis)