Auk and Seal: 2014

Saturday 2 August 2014

remote connect R, SSH, EMACS, ESS + persistent sessions

I'm heading to Duke University on an exchange to study statistics, and hope to maintain access to my high-performance desktop at Murdoch University in Australia (aka, 'The Legend') which runs Ubuntu. After googling around a couple of options (and thanks in particular to this helpful exchange with Dirk Eddelbuettel at stackoverflow) I've set up a means to keep R processes running persistently over ssh, even after logging out. I thank the Great Speggetti Monster that I decided to learn EMACS! (see below for an alternative way to create persistent R sessions withouts using EMACS)

The following instructions assume that the remote computer is running in a Linux Environment, is available to the outside world with an IP address, has an ssh server installed and IS running (I installed openssh-server), and of course has a version of EMACS 24 installed.

On the local machine, open a terminal and enter the command:



ssh -X user@hostipaddress

... then go through prompts for passwords and passkeys.
On the first log-in, activate the emacs-daemon and start emacs buffers



emacs --daemon

emacsclient -nw

The daemon is what allows processes to keeping running after disconnection and to resume the session later when you log back in. The second line will open EMACS while displaying the buffers in the ssh terminal. One can proceed to open R scripts and start R sessions, etc., using familar commands suggest as M-x R-mode, etc.

Now, you can simply exit EMACS (C-x C-c) and logout of the remote computer (logout), but the buffers and processes will persist! To reconnect simply restart ssh access to the remote computer and redo the emacsclient -nw.



ssh -X user@hostipaddress 

emacsclient -nw

... R should still be running :) For some reason, my C-space and C-S space keybindings do not work (though all other customized keybindings do), so be prepared to C-c C-n and C-c C-j.

Reconnecting a remote R session (without EMACS)

For those of you who don't use EMACS, I found tmux to serve as an easy alternative way to disconnect and reconnect to R sessions. While in a remote ssh session, one must simply start the tmux session by typing the command "tmux", then type "R" to initialize an R instance (This assumes the remote computer has installed tmux). When you want to disconnect the ssh connection, type "CTRL-b d" to exit tmux (while the R session nonetheless persists), then you're free to logout and close the remote connection. Later, when you want to re-connect to R, reestablish the ssh connection and type "tmux attach" to re-enter tmux. Your R session will still be there, chugging along as if nothing happened.

Sunday 1 June 2014

Planned contrasts and ANCOVA in Excel / LibreOffice Calc

I just finished TA'ing Advanced Quantitative Methods in Psychology at Murdoch University, and I'd like to share an Excel/Calc file (note its an .ods) to help students visualize ANCOVA/regression. Users can tweak regression coefficients to minimize the Sum of Square Residuals and/or find the lines of best fit. For the math enthusiasts, it also demos the underlying matrix multiplication needed to solve the problem analytically. The Experiment: we are interested in whether treatments of sucrose (8%, 32%, 64%) effect rats' speed at completing a maze (DV) while controlling for the rats age (aka, the 'covariate' in ANCOVA).

One should never 'do stats' in excel (so I also provide R code below), but this is a great, quick way to visualize several statistics concepts that many non-staticians may not understand, including:
- how a mix of continuous and categorical variables can be visualized and different parallel lines, - how categorical variables are represented as indicator variables in the model matrix. - how Planned Constrasts determine the type of indicator columns (the default in the .ods file is coded using the 'first' treatment as the reference treatment, like in R, but users can do their own. Try the Helmert!)
- how to solve least squares regression coefficients via matrix multiplication

If I were a teacher in an intro stats course, I'd ask students to reproduce the above. One operation I couldn't figure out was how to calculate the S.E. of the regression coefficients: this requires more matrix operations than Excel offers.


# ANCOVA in R

# data: DV is speed to complete maze

# categorical variables: 3 sucrose treatments

# covariate: rate AGE

library(car)

d<-data.frame(speed=c(1.4,2,3.2,1.4,2.3,4,5,4.7,6.2,3.1,3.2,4,4.5,6.4,4.4,4.1,5.8,6.6,6.5,5.9,5.9,3,5.9,5.6),
              trt=factor(rep(c(0,1,2),each=8),labels=c("sucrose8","sucrose32","sucrose64")),
              age=c(12,5,8,7,11,2,3,4,7,5,12,12,8,4,3,2,5,8,12,2,7,11,4,3))

# calculate regression coefficients and statistics manually

# then compare to worksheet and R

X <- model.matrix(~age+trt,data=d) # model matrix

betas<-solve(t(X)%*%X)%*%(t(X)%*%d$speed) # regression coefficients

SSm<-sum(((X%*%betas) - mean(d$speed))**2) # explained variation

SSr<- sum(((X%*%betas) - d$speed)**2) # residual sum of squares

df_m<-ncol(X)-1; df_r<-nrow(X)-ncol(X) # degrees of freedom

MSm<-SSm/df_m # Mean Square of Model

MSr <-SSr/df_r # means square error / estimate of variance

F.stat <- MSm/MSr



# standard error of regression coefficients

betas.se<- sqrt(diag(solve(t(X)%*%X)*MSr)) # MSr is an estimate of sigma^2

betas.t <- betas/betas.se # chance of seeing this large of effect ~ t distr

betas.p <- 2*(1-pt(q=abs(betas.t),df=df_r)) # p-values of regression coefs




# redo above, but with R's linear model function

m1 <- lm(speed~age+trt,d)

F.stat1<-summary(m1)$fstatistic # compare F.stat and F.stat1

coef(m1) # compare with betas

summary(m1)$coefficients # compare with betas.se,t,and p



# try with different Planned Constrasts:

# use Helmert contrast to test the hypothesis that 8% is different from

# 32% and 64% is different from both 8 and 32%

m.helm <- lm(speed~ age+trt, data=d,  contrasts = list(trt = "contr.Helmert"))

Tuesday 18 March 2014

lmms: powerful Industrial music studio

Last Tuesday I saw Nine Inch Nails live in Perth (with guest appearance by How to Destroy Angels!), and they inspired me to finally compose synth / industrial music. This is surprisingly possible when you have a GNU/Linux distro and the industry-grade software lmms, Linux MultiMedia Studio, something akin to the commercial Protools or FL Studio. You can engineer your own sounds and compose your own music with a large collection of instruments, effects, and synthesizers. For many years I have played around with fragments of melodies and drumbeats, but here is my first "song". With dose of creativity and OCD, this shows what you can do in a mere 24 hours.

For the uninitiated, the lingo and layout of lmms or Ardour3 or other Digital Audio workstations can be very intimidating. I hardly know where to start, nor how to give advice. Thanks to the rise of Dubstep, there are hundreds of very helpful youtube tutorials such as: http://www.youtube.com/watch?v=naQW2RkpfYY. I like this one because it is a nice visual intro to detuning and combing saw-tooth sound waves via an 'oscillator' to make a sound (instrument?) recognizable in, say, Skrillex.

Composing music with a DAW has revolutionized how I listen to songs: I can't help but listen for the subtle signatures of familiar saw-waves or moog-filters or peculiar sounds and then try to recreate them. It feels great, like becoming an active participant rather than just a passive consumer.

Thursday 13 March 2014

Diagnosing normality in R: QQ Plots and Shapiro-Wilk

I've become a teaching assistance for a 3rd year 'Stats for Psychologists' course in Australia. I thought I might share a little visualization to help my students intuit skew, normality, QQ-plots, and the Shapiro-Wilks test versus the Kolmogorov-Smirnov test. See the R code below the graphic output: I generate random data from a Gamma distribution (left) and a Normal distribution, to show how QQ-plots compare to the sample histograms, and under what sample sizes the tests of normality have any power to catch skew.

Highlights:
- as sample size decreases (heading down the graphic), both the KS-test and the Shapiro-Wilk loose their power to 'detect normality' (i.e., p values > 0.05)
- Shapiro-Wilk is more powerful than the more general KS-test.
- for QQ plots, one should focus on systematic trends along the inner-quantiles and ignore large deviations in the tails. The tails for the samples drawn from the Normal distribution (Right) can be wonky, but nonetheless the Shapiro-Wilk tests are not significant (as they shouldn't be).
- Subplotting in R is terrible (i.e., the nice histograms in the upper corners of the QQ plots): if you look at my code, most of the nastiest that detracts from the simple qqnorm and qqline functions is to make the inset histogram (on a positive note, I learned out to finely control where plots are positioned, rather than relying on the par(mfrow) command.
- Histograms are very difficult to interpret at low sample sizes. Herein lies the beauty of QQ-plots to show subtle patterns.
- I haven't shown a follow up simulation, but the KS-test is very sensitive to bimodal distributions of 2 mixtures of Normal distributions. The Shapiro-Wilk is not!

Left, different samples from a Gamma distribution (shape=5,scale=5), and Right, draws from a Normal distribution with the same mean and variance as the draws from the Gamma distribution.

Here is some R code to run the above simulation and get a feel for the QQ-plots and tests.




# SETUP SIMULATION PARAMETERS

N <- 100# sample size

shp_ <- 5; scale_ <- 5 # shape and scale parameters for Gamma distr

dat.skew <- as.numeric(scale(rgamma(n=N,shp_,scale_))) # random skewed data

dat.norm <- rnorm(N,mean(dat.skew),sqrt(var(dat.skew)))# random normal data

d <- list(skew=dat.skew, norm=dat.norm) 

pltreg <- matrix(c(0,0.5,2/3,1, # x1,x2,y1,y2

                   0.5,1,2/3,1,

                   0,0.5,1/3,2/3,

                   0.5,1,1/3,2/3,

                   0,0.5,0,1/3,

                   0.5,1,0,1/3

                   ),nrow=6,ncol=4,byrow=T)# arduous way to handle subplotting



# LOOP THROUGH DIFFERENT SAMPLE SIZES, PLOT 3x2

j <- 0 # counter to track which subplot we're on 

invisible(par()) # make blank plot canvas

for(subn in c(1,0.3,0.15)){ # loop through different samples sizes

    for(i in c("skew","norm")){ # loop through Gamma and Normal data

        j <- j+1 # tracking plot region

        sub.d <- sort(sample(d[[i]],round(N*subn))) # take subsample

        par(fig=pltreg[j,],mar=c(4,4,1,1),new=TRUE*(j>1),mgp=c(1.2,0.5,0),bty="l")# pain in the ass way to position plots along the 3x2 canvas layout. 

        qqnorm(sub.d, datax=TRUE, col="forestgreen", pch=19,main=c(skew=paste("QQ (N=",round(N*subn)," from Gamma)",sep=""),norm=paste("QQ (N=",round(N*subn)," from Normal)",sep=""))[i]) # MAIN FUNCTION 

        qqline(sub.d,datax=TRUE) # 1:1 line

        sw_p <- round(shapiro.test(sub.d)$p.value,3) # test of normality

        ks_p <- round(ks.test(x=sub.d,y="pnorm")$p.value,3) # ''

        text(0.8*diff(range(sub.d))+min(sub.d),y=0,labels=paste("N:",round(N*subn),"\nShapiro-Wilk:",sw_p,"\n","KS-test:",ks_p),pos=1)

        par(fig=c(pltreg[j,c(1,1,4,4)]-c(0,-0.25,0.33/1.5,0)),new=TRUE)# setup inset histogram

        hist(sub.d,col="grey",breaks=15,main="",xlab="",ylab="",xaxt="n",yaxt="n") # histogram

    } # inner loop through two distributions

} # loop through sub-setting sample

Made using R v3, ESS and EMACS24 on Bodhilinux 2.4

Wednesday 29 January 2014

Japan for skiing - Epic 'Pow'

2014 is great so far: I managed to dodge the scorching Australian summer and went skiing in Japan for 3 weeks -- on a budget! I volunteer about 2-8 hours/day in hospitality work in exchange for free ski pass, gear, room and board! I'm staying at Appi-Kogen in Iwate Prefecture, northern Honshu, perhaps the snowiest place I've ever been (see map). Not only have I realized one of my dreams of being a ski-bum, but the work I do is as interesting (or more) as being a tourist: I get to experience a different side of the ever-so-facinating Japanese culture, especially the dynamic between worker and employer. I've heard that the austere worker-boss relationship inherits from Samuari, with the expectation of one being willing to die for one's master, etc. I've never been an obsequious person, and so I've struggled a bit to adapt to the workplace, but learned a lot in the process, about myself and Japan.

5 reasons Japan is awesome.
Aside from the skiing some of my favourite cultural tidbits include:

i) Humility. For example, when you present a gift or food offering to another person, you should ask forgiveness in the process for presenting to them such a terrible gift/food. This even applies to introducing someone to your spouse ("here is my terrible wife").

ii) Soba bars. Imagine ordering and paying for a meal from a vending machine outside a restaurant, receiving a coupon, and then entering to sit alone on a long bench along the wall with other quiet business-attired Japenese men. Yum!

iii) Mochi-panda. An adorable 'hello-kitty' kid's character and multitudinous consumer item, except its a panda, and its frowning, because its about to be eaten.

iv) dancing to Nine-inch-Nails on a TV Dance competition show

v) 'epic Pow' as they say in Australia. "Lots of snow."