2019-05-20

Analysis of a Likert survey in R

for_google.utf8.md

Which university departments have a more favorable view of Wikipedia usage?

Executive Summary: This is an analysis of a Likery survey on university faculty perceptions and practices of using Wikipedia as a teaching resource. In particular, departments will be clustered by how favorably they view Wikipedia usage according to 5 Likert scale questions. Categorical Principal Component Analysis was used to aggregate the questions, and then a Kruskal Wallis test and Dunn post hoc test were on the first principal component to cluster the departments. Two groups are distinguished from each other: Sciences, Engineering & Architecture and Arts & Humanities view Wikipedia more favorably than Health Sciences and Law & Politics.



The data was procured from UCI Machine Learning repository.


Six columns will be considered: DOMAIN, a string of the names of different schools at a single university, and five separate Likert questions on a 5 point scale, ranging from “Strongly Disagree” to “Strongly Agree.” The five questions were designed to measure the Use Behavior of faculty members. The questions themselves and the distribution of their scores can be seen in the graphic below.

library(likert);library(Gifi);library(psych);library(dunn.test);library(FSA);library(tidyverse)
# Preprocessing. DOMAIN (Department) and 5 Use questions will be used.

wiki <- read_delim("wiki4HE.csv", delim=';')

wiki[wiki=='?'] <- NA

wiki$DOMAIN <- factor(wiki$DOMAIN, labels=c('Arts & Humanities', 'Sciences', 'Health Sciences', 'Engineering & Architecture', 'Law & Politics', 'other'))

wiki[,11:53] <- lapply(wiki[,11:53],ordered)
# Convert questions to likert data and plot

wiki_qs <- cbind(wiki[,substr(names(wiki), 1,2) == 'Us'])

names(wiki_qs) <- c(
  Use1="I use Wikipedia to develop my teaching materials",
  Use2="I use Wikipedia as a platform to develop educational activities with students",
  Use3="I recommend my students to use Wikipedia",
  Use4="I recommend my colleagues to use Wikipedia",
  Use5="I agree my students can use Wikipedia in my courses")

wiki_likert_qs <- likert(wiki_qs)

plot(wiki_likert_qs)

# Make plot of questions by department.

the_nas <- which(is.na(wiki$DOMAIN))

wiki_likert <- likert(wiki_qs[-the_nas,], grouping=wiki$DOMAIN[-the_nas])

plot(wiki_likert)


Testing whether departments answer a single question differently is straight forward since a Chi-Square test for independence can be used. One way to analyze all five questions at once would be to use Categorical Principal Component Analysis (CATPCA) to reduce the questions to a single latent variable and then test that.

The first principal component (PC1) that CATPCA returns captures 66% of the variance. It’s loadings are all negative and none stand out as much larger or smaller than the rest, implying that it is a measurement of how negatively a faculty member perceives Wikipedia.

# Categorical PCA of the Use questions to check that they answer same overarching question. First pc captures 66% of variance.
# wiki was a table and df, needs to only be df for princals().
data_for_pca <- as.data.frame(na.omit(wiki[c("DOMAIN","Use1","Use2","Use3","Use4","Use5")]))

use_pca <- princals(data_for_pca[,-1])

summary(use_pca)
## 
## Loadings (cutoff = 0.1):
##      Comp1  Comp2 
## Use1 -0.809  0.264
## Use2 -0.707  0.608
## Use3 -0.912 -0.174
## Use4 -0.880 -0.146
## Use5 -0.739 -0.482
## 
## Importance (Variance Accounted For):
##                  Comp1   Comp2
## Eigenvalues     3.3070  0.7229
## VAF            66.1399 14.4584
## Cumulative VAF 66.1400 80.6000
plot(use_pca, "screeplot")


A Kruskal-Wallis test is chosen because the distribution of PC1 is not normal. It rejects the null hypothesis that there is no difference between schools. To cluster these schools based on PC1, a post hoc Dunn’s test is used. It shows that there are 3 clusters, with two of them being completely distinguishable. The first is group A, which includes Sciences, Engineering & Architecture and Arts & Humanities. Since their means for PC1 are lower, they view Wikipedia use more favorably. The second is group C, which is composed of Health Sciences and Law & Politics. These schools have higher means for PC1, indicating less favorable views on Wikipedia usage.

pc1 <- use_pca$objectscores[,1]

for_kw <- data.frame(DOMAIN=data_for_pca$DOMAIN, pc=pc1)

kruskal.test(pc ~ DOMAIN, data = for_kw)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  pc by DOMAIN
## Kruskal-Wallis chi-squared = 54.391, df = 5, p-value = 1.742e-10
# post hoc test

dt <- dunnTest(for_kw$pc, for_kw$DOMAIN, method="bonferroni")
# Create table that visualizes clusters.
domain_means <- aggregate(pc ~ DOMAIN, for_kw, mean)

colnames(domain_means) <- c("Department","mean pc")
domain_means$Department <- as.character(domain_means$Department)

pairwise_visual <- data.frame(Department=c("Sciences","Engineering & Architecture","Arts & Humanities","other","Health Sciences","Law & Politics"),
                              group1=c("A","A","A"," "," "," "),
                              group2=c(" "," ","B","B","B"," "),
                              group3=c(" "," "," "," ","C","C"))
pairwise_visual$Department <- as.character(pairwise_visual$Department)

inner_join(pairwise_visual, domain_means, by="Department")
##                   Department group1 group2 group3     mean pc
## 1                   Sciences      A               -0.38869250
## 2 Engineering & Architecture      A               -0.35978208
## 3          Arts & Humanities      A      B        -0.07066003
## 4                      other             B         0.06102902
## 5            Health Sciences             B      C  0.23014760
## 6             Law & Politics                    C  0.47618655