Thursday 13 March 2014

Diagnosing normality in R: QQ Plots and Shapiro-Wilk

I've become a teaching assistance for a 3rd year 'Stats for Psychologists' course in Australia. I thought I might share a little visualization to help my students intuit skew, normality, QQ-plots, and the Shapiro-Wilks test versus the Kolmogorov-Smirnov test. See the R code below the graphic output: I generate random data from a Gamma distribution (left) and a Normal distribution, to show how QQ-plots compare to the sample histograms, and under what sample sizes the tests of normality have any power to catch skew.

Highlights:
- as sample size decreases (heading down the graphic), both the KS-test and the Shapiro-Wilk loose their power to 'detect normality' (i.e., p values > 0.05)
- Shapiro-Wilk is more powerful than the more general KS-test.
- for QQ plots, one should focus on systematic trends along the inner-quantiles and ignore large deviations in the tails. The tails for the samples drawn from the Normal distribution (Right) can be wonky, but nonetheless the Shapiro-Wilk tests are not significant (as they shouldn't be).
- Subplotting in R is terrible (i.e., the nice histograms in the upper corners of the QQ plots): if you look at my code, most of the nastiest that detracts from the simple qqnorm and qqline functions is to make the inset histogram (on a positive note, I learned out to finely control where plots are positioned, rather than relying on the par(mfrow) command.
- Histograms are very difficult to interpret at low sample sizes. Herein lies the beauty of QQ-plots to show subtle patterns.
- I haven't shown a follow up simulation, but the KS-test is very sensitive to bimodal distributions of 2 mixtures of Normal distributions. The Shapiro-Wilk is not!

Left, different samples from a Gamma distribution (shape=5,scale=5), and Right, draws from a Normal distribution with the same mean and variance as the draws from the Gamma distribution.


Here is some R code to run the above simulation and get a feel for the QQ-plots and tests.

# SETUP SIMULATION PARAMETERS
N <- 100# sample size
shp_ <- 5; scale_ <- 5 # shape and scale parameters for Gamma distr
dat.skew <- as.numeric(scale(rgamma(n=N,shp_,scale_))) # random skewed data
dat.norm <- rnorm(N,mean(dat.skew),sqrt(var(dat.skew)))# random normal data
d <- list(skew=dat.skew, norm=dat.norm)
pltreg <- matrix(c(0,0.5,2/3,1, # x1,x2,y1,y2
0.5,1,2/3,1,
0,0.5,1/3,2/3,
0.5,1,1/3,2/3,
0,0.5,0,1/3,
0.5,1,0,1/3
),nrow=6,ncol=4,byrow=T)# arduous way to handle subplotting

# LOOP THROUGH DIFFERENT SAMPLE SIZES, PLOT 3x2
j <- 0 # counter to track which subplot we're on
invisible(par()) # make blank plot canvas
for(subn in c(1,0.3,0.15)){ # loop through different samples sizes
for(i in c("skew","norm")){ # loop through Gamma and Normal data
j <- j+1 # tracking plot region
sub.d <- sort(sample(d[[i]],round(N*subn))) # take subsample
par(fig=pltreg[j,],mar=c(4,4,1,1),new=TRUE*(j>1),mgp=c(1.2,0.5,0),bty="l")# pain in the ass way to position plots along the 3x2 canvas layout.
qqnorm(sub.d, datax=TRUE, col="forestgreen", pch=19,main=c(skew=paste("QQ (N=",round(N*subn)," from Gamma)",sep=""),norm=paste("QQ (N=",round(N*subn)," from Normal)",sep=""))[i]) # MAIN FUNCTION
qqline(sub.d,datax=TRUE) # 1:1 line
sw_p <- round(shapiro.test(sub.d)$p.value,3) # test of normality
ks_p <- round(ks.test(x=sub.d,y="pnorm")$p.value,3) # ''
text(0.8*diff(range(sub.d))+min(sub.d),y=0,labels=paste("N:",round(N*subn),"\nShapiro-Wilk:",sw_p,"\n","KS-test:",ks_p),pos=1)
par(fig=c(pltreg[j,c(1,1,4,4)]-c(0,-0.25,0.33/1.5,0)),new=TRUE)# setup inset histogram
hist(sub.d,col="grey",breaks=15,main="",xlab="",ylab="",xaxt="n",yaxt="n") # histogram
} # inner loop through two distributions
} # loop through sub-setting sample


Made using R v3, ESS and EMACS24 on Bodhilinux 2.4

No comments:

Post a Comment