Auk and Seal: Zotero and R: automatically find relevant scientific articles with the Microsoft Academic Search API

Zotero is a powerful scientific-article manager that is part of my 'cannot-live-without' toolbox for research. Its one drawback compared to ReadCube or Mendeley is the lack of a 'find relevant articles' engine to smartly expand your library. But, unlike the aforementioned rivals, Zotero is an open-source project, and is naturally hackable and open to exploration with programs such as R. To make up for the deficiency, I made a script that i) provides R access to your Zotero database; ii) loops through articles and finds them online via the Microsoft Academic Search API; iii) finds the references and 'cited by' articles; iv) and outputs html files with links to Google Scholar, dx.doi.org, and the publisher's website to make it as easy as possible to get high-relevancy articles into Zotero.

The principle is simple: find articles that are most frequently cited by the authors in your Zotero database (and which you haven't yet read) and find other articles that cite the articles in your database. Rather than using keyword similarity algorithms, this script just assumes that Authors who think and read similarly as you do probably know what's relevant.

Below is the code which can be yanked to your R terminal. There is one source-code file that should be downloaded into your working directory (download from here) and called: "SQL_zotero_query.txt" (thank you Royce). The sections which need to be customized are the location of: a) your Zotero database folder (which has zotero.sqlite) and b) the folder to save the output html files. If you use this frequently, you should also get your own free MSA API key (mine is provided below but has limited amount of queries allowed).

Enjoy! Please send me any suggestions and questions!

# Microsoft Academic Search API (please get your own)

  apikey <- "88cab0fa-2dd9-4eef-9d07-67d4e0a5c933"

  options(gsubfn.engine = "R")

  library(sqldf)

  library(RSQLite)

  library(rjson)

#  WORKING DIRECTORIES AND WHERE TO FIND ZOTERO, etc

  stem <- ifelse(Sys.info()['sysname']=="Linux","//home/rob","C:/Users/Rob") # your user directory

  msadir <- paste(stem,"/Documents/",sep="") # working directory

  setwd(msadir)

  zot_db <- paste(stem,"/Documents/Literature/zotero.sqlite.bak",sep="") # zotero original databaset to copy

  file.copy(from=zot_db, to=msadir, overwrite=TRUE) # notice backup of zotero databse

  db <- paste(msadir,"/zotero.sqlite.bak",sep="")

# SQL command

  SQLquery_txt_fname <- paste("SQL_zotero_query.txt",sep="") # source code for sqlcommand

  SQLquery_txt <- readLines(SQLquery_txt_fname)

# CONNECT TO ZOTERO DATABASE AND QUERY

  conn <- dbConnect("SQLite", dbname = "zotero.sqlite.bak")

  dbListTables(conn)#list tables in zotero database

  res2 <- dbSendQuery(conn, statement = paste(SQLquery_txt,collapse=" ") )

  zd <- fetch(res2, n=-1)# 

  dbClearResult(res2)

# FILTER to only journal articles

  zd <- zd[which(zd$TYPE %in% c("journalArticle")),]

# remove TAGS and abstract

  zd <- zd[,-(grep("TAG_",names(zd)))]

  zd <- zd[,-(grep("ABSTRACT",names(zd)))]

# CLEAN UP UGLY CHARACTERS 

  zd$TITLE <- gsub('<i>',"",zd$TITLE)

  zd$TITLE <- gsub('<i>',"",zd$TITLE)

  zd$TITLE <- gsub(':',"",zd$TITLE) # remove some other

  zd$TITLE <- gsub(',',"",zd$TITLE) # remove some other

  zd$TITLE <- gsub('-'," ",zd$TITLE) # remove some other

# DONE PREPARING THE DATA BASE

# DEFINE SPECIAL FUNCTIONS FOR THE NEXT STEPS

  gettits <- function(rmas){ lapply( rmas$d[[8]][[5]][1:length(rmas$d[[8]][[5]])], function(lx){ ti <- lx$Title

 gsub('<i>',"", gsub('<i>',"", gsub(':',"",gsub(',',"",gsub('-'," ",ti)))))

 }) } # get the titles of the returned objects

  getDOI <- function(rmas){lapply( rmas$d[[8]][[5]][1:length(rmas$d[[8]][[5]])], function(lx) lx[["DOI"]]) } # get the DOI of the returned objects

  getMSID <- function(rmas){lapply( rmas$d[[8]][[5]][1:length(rmas$d[[8]][[5]])], function(lx) lx[["ID"]]) } # MAS internal ID

  getauths <- function(rmas){ lapply(rmas$d[[8]][[5]][1:length(rmas$d[[8]][[5]])], function(lx) {

 ret<-unlist(lapply(lx[["Author"]], function(al) al[["LastName"]])); ifelse(is.null(ret),NA,ret)}

 )} # last names of authors

  getyear <- function(rmas){ lapply(rmas$d[[8]][[5]][1:length(rmas$d[[8]][[5]])], function(lx) lx[["Year"]])}

  geturl <- function(rmas){ lapply(rmas$d[[8]][[5]],function(lx){ret<-lx$FullVersionURL; if(is.null(ret) | length(ret)==0){ret<-NA}; ret[1]})}

  getpub <- function(rmas){ lapply(rmas$d[[8]][[5]],function(lx){ret<-c(lx$Journal[c("FullName","ShortName")],NA); ret[[which(!is.null(ret))]]})}

# fuzzy matching by agrep

  fuzzycor <- function(s1,s2){

   sv <- list(one=  unlist(strsplit(s1," ")),two=unlist(strsplit(s2," ")))

   sv$one <- sv$one[sv$one!=""]; sv$two <- sv$two[sv$two!=""] 

   find1in2 <- sum(sapply(sv[[1]],function(ss) length(agrep(ss,s2,ignore.case=TRUE))>0))

   find2in1 <- sum(sapply(sv[[2]],function(ss) length(agrep(ss,s1,ignore.case=TRUE))>0))

   (find1in2+find2in1)/(length(sv[[1]])+length(sv[[2]]))}

# STEP one: match local zotero holdings with the MSAR numbers

  authcol <- grep("AUTHOR_[[:digit:]]{1}_LAST",names(zd))

  iddbf <- paste(msadir,"msID_db.csv",sep="") # local file of MSA id's

  iddb <- data.frame(ITEMID=-1,msID=NA,ztitle=NA,mstitle=NA,zdoi=NA,msdoi=NA)

for(i in 1:nrow(zd)){

  msID <- NULL # a handler for the final matching option in retrieval

 ctit <- cdoi <- mj <- NA

   # zotero title

   ztit <-zd$TITLE[i]

  zauths <-zd[i,authcol][which(!is.na(zd[i,authcol]))]

   zdoi <- zd[i,"DOI"]

   zyear <- strsplit(zd$DATE[i],split="(\ )|-" ,perl=TRUE)[[1]][1]

 print(paste("MSA query for '",substring(ztit,1,50),"..."))

   textcall <- paste("http://academic.research.microsoft.com/json.svc/search?AppId=",apikey,"&TitleQuery=",gsub(" ","+",ztit),"&ResultObjects=publication&PublicationContent=AllInfo&StartIdx=1&EndIdx=5",sep="")

   con<- url(description=textcall)            

   rmas <- fromJSON(readLines(con))

   close(con)

 # check if the resource is in MSA

 if(length(rmas$d$Publication$Result)==0){ 

  msID <-NA; ctit<-"not_found"; cdoi<-NA; mj <- 1

  print(paste("no MSA results for '",substring(ztit,1,50),"..."))

 } else {

   print(paste("found MSA results for '",substring(ztit,1,50),"..."))

     # authors

   ctit <- gettits(rmas)

   cauths <- getauths(rmas)

     # first try matching by the doi

     if(!is.na(zdoi)){

       cdoi <- unlist(getDOI(rmas))

         mj <- which(cdoi == zdoi)[1]

         if(length(mj>0)){

             msID <- getMSID(rmas)[[mj]]

         }

     }

     if(is.null(msID)){

     # title correlation

     Rtit <- unlist(lapply(ctit, function(mstit,ztit) fuzzycor(mstit,ztit), ztit=ztit))

     # check the number of Zotero authrs are in the MSA listing

     Rauths <- unlist(lapply(cauths, function(msauths,zauths){

  mean(sapply(zauths, function(auth){ length(agrep(auth, msauths,ignore.case=TRUE))>0}))},

  zauths=zauths))

         # year correlations

     Ryear <- unlist(lapply(getyear(rmas), function(cyr){ 1*(cyr==zyear)}))

     Rs <- data.frame(do.call('cbind', list(Rtit,Rauths,Ryear)))

     mincrit <- apply(Rs, 1, function(rw) all(rw > 0.8))

     Rs <- Rs[which(mincrit),]

     if(nrow(Rs)!=0){

  mj <- which.max(rowMeans(Rs))[1]

  msID <- getMSID(rmas)[[mj]]

     } else {

  msID <- NA

  }

   } #is.null(msID)

 } # resource found

 iddb <- rbind(iddb, data.frame(ITEMID=zd$ITEMID[i], msID=msID,ztitle=as.character(ztit),mstitle=as.character(ctit[mj]),zdoi=zdoi,msdoi=cdoi[mj]))

}

  iddb2 <- iddb[which(!is.na(iddb$msID)),] # remove not found

# STEP TWO: query the MS id and learn all the "cited" and "cited by" references

# loop through found MSA records

  refs <- list(cites=as.list(rep(NA,nrow(iddb2))),cited=as.list(rep(NA,nrow(iddb2))))

  names(refs$cited) <- names(refs$cites) <- iddb2$ITEMID

# make a db to store the journal results

  pdb <- data.frame(msID=NULL, cited=NULL, title=NULL, author1=NULL, year=NULL, journal=NULL, doi=NULL,url=NULL)

  for(i in 1:nrow(iddb2)){

 # find other works which cite the current article

 msID <- iddb2[i,"msID"]

 textcall <- paste("http://academic.research.microsoft.com/json.svc/search?AppId=",apikey,"&PublicationID=",msID,"&ReferenceType=Citation&PublicationContent=AllInfo&StartIdx=1&EndIdx=50&OrderType=CitationCount",sep="")

 con<- url(description=textcall)            

   rmas <- fromJSON(readLines(con))

   close(con)

 refs$cites[[i]] <- unlist(getMSID(rmas))

 if(!is.null(unlist(getMSID(rmas)))){

 pdb <- rbind(pdb, 

  data.frame(msID=unlist(getMSID(rmas)),

  cited=unlist(lapply(rmas$d[[8]][[5]],function(lx){lx$CitationCount})),

  title=unlist(gettits(rmas)),

  author1=unlist(lapply(getauths(rmas), function(lx2) lx2[[1]])),

  year=unlist(getyear(rmas)),

  journal=unlist(getpub(rmas)),

  doi=unlist(getDOI(rmas)),

  url=unlist(geturl(rmas))

  ))

 }

 # find other works which are cited by the current article

 msID <- iddb2[i,"msID"]

 textcall <- paste("http://academic.research.microsoft.com/json.svc/search?AppId=",apikey,"&PublicationID=",msID,"&ReferenceType=Reference&PublicationContent=AllInfoIdx=1&EndIdx=50&OrderType=CitationCount",sep="")

 con<- url(description=textcall)
            
   rmas <- fromJSON(readLines(con))

   close(con)

 refs$cited[[i]] <- unlist(getMSID(rmas))

 if(!is.null(unlist(getMSID(rmas)))){

 pdb <- rbind(pdb, 

  data.frame(msID=unlist(getMSID(rmas)),

  cited=unlist(lapply(rmas$d[[8]][[5]],function(lx){lx$CitationCount})),

  title=unlist(gettits(rmas)),

  author1=unlist(lapply(getauths(rmas), function(lx2) lx2[[1]])),

  year=unlist(getyear(rmas)),

  journal=unlist(getpub(rmas)),

  doi=unlist(getDOI(rmas)),

  url=unlist(geturl(rmas))

  ))

 }

 }

  names(refs$cited) <- names(refs$cites) <- iddb2$ITEMID

  pdb <- unique(pdb)

# STEP THREE: tally results 

  citetally <- list(cites=NA,cited=NA) # storage for results (back/forwards citations)

  for(i in 1:length(refs)){

 tpt <- table(unlist(refs[[i]]))

 tpt <- tpt[order(tpt,decreasing=TRUE)]

 alreadyhave <- names(tpt)[which(names(tpt) %in% iddb2$msID)]

 tpt <- tpt[which(names(tpt) %in% alreadyhave==FALSE)]

 tptdb <- data.frame(msID=names(tpt),zcount=as.numeric(tpt))

 citetally[[i]] <- sqldf("SELECT pdb.*,tptdb.zcount as 'zcount' FROM tptdb LEFT JOIN pdb ON tptdb.msID=pdb.msID")

 rm(tpt,tptdb,alreadyhave)

 }

# STEP FOUR A: save final output as CSV

  write.csv(citetally[[1]], paste(msadir,"most_citedby.csv",sep=""),row.names=FALSE)

  write.csv(citetally[[2]], paste(msadir,"most_cited.csv",sep=""),row.names=FALSE)

# STEP FOUR B: save final output as html to open links 

# best way to get things into firefox :)

  sink(paste(msadir,"mostcitedby.html",sep=""))

  cat(paste("<html><body>zcount is the number of articles in your database cited by the focal article<br><table><tr><th>",

paste(names(citetally[[1]])[c(9,2,4,5,3,7,8)],collapse="</th><th>"),"</th></tr>"))

  htmltxt <- apply(citetally[[1]],1,function(x){

 paste("<tr><td>",x[9],"</td><td>",x[2],"</td><td>",x[4],"</td><td>",x[5],

 paste("</td><td><a href='http://scholar.google.com/scholar?hl=en&q=",

 gsub(" ","+",x[3]),collapse="",sep=""),

 "' target='_blank'>",x[3],"</a></td><td><a href='http://dx.doi.org/",x[7],"' target='_blank'>",x[7],

 "</a></td><td><a href='",x[8],"' target='_blank'>website</a></td></tr>\n",sep="")})

 cat(paste(htmltxt,collapse="",sep=""))

 cat("</table></body></html>")

 sink()

  # 2nd webpage

  sink(paste(msadir,"mostcited.html",sep=""))

  cat(paste("<html><body>zcount is the number of articles in your database which cite the focal article<table><tr><th>",

  paste(names(citetally[[1]])[c(9,2,4,5,3,7,8)],collapse="</th><th>"),"</th></tr>"))

  htmltxt <- apply(citetally[[2]],1,function(x){

 paste("<tr><td>",x[9],"</td><td>",x[2],"</td><td>",x[4],"</td><td>",x[5],

 paste("</td><td><a href='http://scholar.google.com.proxy.lib.uiowa.edu/scholar?hl=en&q=",

 gsub(" ","+",x[3]),collapse="",sep=""),

 "' target='_blank'>",x[3],"</a></td><td><a href='http://dx.doi.org/",x[7],"' target='_blank'>",x[7],

 "</a></td><td><a href='",x[8],"' target='_blank'>website</a></td></tr>\n",sep="")})

 cat(paste(htmltxt,collapse="",sep=""))

 cat("</table></body></html>")

sink()

And some example output...

Why R? The above script just serves as a one-stop-shop for SQL and JSON processing. One the side, I also use R's wonderful visualization tools and matrix processing facilities to play around with author and keywords. But really, the above script could probably be run more efficiently in Python or Java.

A special thanks to the post by Royce Kimmons at http://royce.kimmons.me/tutorials/zotero_to_excel for the SQL command to access Zotero databases.

BTW, in case you're wondering why I'm using two open-source projects with a Microsoft project: other online and free-tools such as CiteUlike or CiteseerX do NOT provide the needed forwards-citation or backwards-citation information, neither through an API or thourgh webscrapping. I'd love some alternatives

Tuesday 18 June 2013

Zotero and R: automatically find relevant scientific articles with the Microsoft Academic Search API

No comments:

Post a Comment