Here we are interested in identifying similar researchers based on content they have published. We will be using PubMed articles by researchers from the University of New Mexico College of Pharmacy and School of Medicine.
"college of pharmacy"[AD] AND "new mexico"[AD]
and save results to .csv
"school of medicine"[AD] AND "new mexico"[AD]
and save resutls to .csv
# load required packages
require(data.table)
require(dplyr)
require(tidyr)
require(tm)
require(NLP)
require(ggplot2)
require(ape)
require(reshape2)
options(repr.plot.width = 10, repr.plot.height=10)
After each of the article sets are downloaded from pubmed, they can be read and used to create a single table which contains the author and article title. In the code below we also create a separate column named affiliation
which shows which school the author is affiliated with.
# read in the data containing authors and article titles for both the college of pharmacy and school of medicine
cop = fread("cop.csv")
som = fread("som.csv")
# lable author affiliation
cop$affiliation = 'Pharmacy'
som$affiliation = 'Medicine'
# combine into single dataframe`
pubmed = rbind(cop, som)
head(pubmed, 1)
In the output above, we can see that the first articel has 6 authors associated with it. We will need to standardize the author names by converting to upper case and then split on ',' in order to separate them.
# Each article contains multiple authors
pubmed = mutate(pubmed, Description = strsplit(toupper(Description), ","))
Finally we will need to split the list appart so that each author can be separated from the rest. This can be done using the unnest()
function from the tidyr
package. Once we are finished with this step we should have a dataframe with each row as an author.
pubmed = pubmed %>% unnest(Description)
head(pubmed,2)
In this output above we can see that the author list has been split apart and each row now contains one of the authors of the article along with the title and other meta data. Next we will need to group by the author column (Description) and append all of the titles together into a single string. This can be done using the dplyr
package to group the data and simple user defined function clean.text
which converts the text to upper case and removes all non-alpha and non-whitespace characters.
clean.text = function(x, min=1, max=3){
# remove all non-alpha numerics except " "
s = gsub('[^A-Za-z -]', '', toupper(x))
# replace "-" with " "
s = gsub("-", " ", s)
# remove all words that are between min and max in length
gsub("\\b[A-Za-z]{1,3}\\b", "", s)
}
clean_names = function(x) trimws(gsub("[^A-Za-z ]", "", toupper(x)))
# group by author and concat title text into single text string and select the top
# 250 authors based on total number of publications
pubmed = pubmed %>%
mutate(Author = clean_names(Description)) %>%
group_by(Author) %>%
summarize(title.text = clean.text(paste(Title, collapse = ' ')),
pub.num = n(),
affiliation = names(sort(table(affiliation), decreasing=T)[1])) %>% # keep the most common affilation
filter(Author != "ET AL") %>%
arrange(desc(pub.num)) %>%
head(200)
head(pubmed %>% select(Author, title.text),1)
We now have a data frame where each row represents a unique author along with a concatenation of every title they have authored. Next we will need to create the Author-Term matrix, this can be done using the tm
package.
Now that we have a data frame containing authors and concatenated titles, we can easily make an author-term matrix which contains the term frequencies for each author according to how frequently the terms were used in article titles.
# create the corpus using the concatentated article title text for each author
corpus = VCorpus(VectorSource(pubmed$title.text))
# remove stopwords
mycorp = tm_map(corpus, removeWords, stopwords('english'))
# create the author term matrix
dtm = DocumentTermMatrix(mycorp)
rownames(dtm) = pubmed$Author
The term frequency-inverse document frequency is a measure that reflects how important a word is to a particular document. It is essentially the number of time a word appears in a document offset by the number of times the word appears throughout the corpus in general. In our case, it is the number of times an author uses a word in the title of their article offset by the number of times other authors use the same word in their articles.
# compute the tf-idf matrix
tfidf = weightTfIdf(dtm, normalize = T)
Since the TF-IDF weight reflects a words importance to a particular document (or in this case author) we can list the top 10 terms (ranked by TFIDF weight) for each author to see which terms are most descriptive of their research.
# Find the top 10 terms by TF-IDF for each author
top.terms = apply(tfidf, 1, function(x) colnames(tfidf)[order(x, decreasing = T)[1:10]])
top.terms[,1:5]
From the output above we can see that Dr. Jim Liu's research seems to focus on arsenite, ischemia, and the brain whereas Dr. Dennis Raisch's research is focused on adverse, events, and reactions this seems appropriate given Dr. Raisch's experience in mining adverse drug events from the FDA Adverse Event Reporting System.
Now that we've compute the TFIDF weights for each word for each author, we can use hierarchical clustering to group authors based on title term similarity. Authors who use similar words in the titles of their papers will tend to cluster together. This will allow us to see interesting patters such which authors are publishing on similar topics. One thing to note is that if there are many more variables (i.e. terms) than documents (i.e. authors) non-sensical groups may emerge. In our case we have a total of 310 authors and 3,959 different terms/variables therefore we do not really have to worry about this problem.
hc = hclust(dist(tfidf))
color = pubmed$affiliation
color[color=="Pharmacy"] = "#64706c"
color[color=="Medicine"] = "#935347"
svg('unigrams.svg', width=10, height=10)
plot(as.phylo(hc), main="PubMed Author Similarity by Unigrams",
tip.color = color, type="fan", cex=0.7, label.offset = 0.05)
dev.off()
From the dendrogram above we can see clear groups emerge. Some groups cluster pretty tighthly while other groups are less tightly connected. Some authors are identical, suggesting that their word vectors are identical. This seems to suggest that these authors co-occur on the same papers. In this next section we will write a function to extract the top 10 terms for a particular author.
# function for returning top 10 terms for particular author
get_top_terms = function(author, data, n = 10){
colnames(data)[order(as.matrix(data[author,]), decreasing=T)[1:n]]
}
Let look at the top terms for some tightly clustered authors; for example GARVER WS, and JELINEK D
get_top_terms('GARVER WS', tfidf)
get_top_terms('JELINEK D', tfidf)
As expected, these authors have identical word vectors suggesting that they have similar research intersts. In fact, it is likely that these two authors are co-authors.
get_top_terms('ANTONCULVER H', tfidf)
get_top_terms('ABEN KK', tfidf)
get_top_terms('PAI MP', tfidf)
get_top_terms('MERCIER RC', tfidf)
Its clear that the above authors frequently publish on infectious disease.
So in the first example, we broke the text apart into individuals words, however it may be useful to retain some of the context. This can be accomplished by using a bigram tokenizer which will split the text into bigrams (chunks of two words). We can use the NLP package to create a bigram tokenzier
bigram = function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse=" "), use.names=F)
corpus = VCorpus(VectorSource(pubmed$title.text))
my_corpus = tm_map(corpus, removeWords, stopwords('english'))
dtm = DocumentTermMatrix(corpus, control=list(tokenize = bigram))
tfidf = weightTfIdf(dtm)
rownames(tfidf) = pubmed$Author
# Find the top 10 terms by TF-IDF for each author
top.terms = apply(tfidf, 1, function(x) colnames(tfidf)[order(x, decreasing = T)[1:10]])
top.terms[,1:5]
hc = hclust(dist(tfidf))
color = pubmed$affiliation
color[color=="Pharmacy"] = "#64706c"
color[color=="Medicine"] = "#935347"
svg('bigrams.svg', width=10, height=10)
par(mai=c(1,1,1,1))
plot(as.phylo(hc), main="PubMed Author Similarity by Bigrams",
tip.color = color, type="fan", cex=0.7, label.offset = 0.05)
dev.off()
For the most part the groups revealed by the bigram model are similar to those in the unigram model; however with the bigram model you see some of the groups appearing to have more togetherness. Below we will look at the top bigrams for some of these authors.
get_top_terms('MERCIER RC', tfidf)
get_top_terms('PAI MP', tfidf)
Here we see that the top bigrams are similar to those from the unigram model in that they all are related to infectious disease however we gain a little more information using the bigram model. For instance, we see that Mercier appears to focus on methicillin resistance, staphylococcus aureus as well a vancomycin and piperacillin-tazobactam.
get_top_terms('ANDERSON JR', tfidf)
get_top_terms('NAWARSKAS JJ', tfidf)
Its no surprise that the cardiovascular pharmacists are publishing papers related to hypertension, heart failure, stents, and statins.
Now that we have successfully clustered authors into similar groups, we can potentially see which authors would collaborate well together. For example, lets look at Rey GM and Shah VO. I happen to know that Dr. Rey (College of Pharmacy) specializes in diabetes as well as Dr. Shah (School of Medicine). In addition, both of these authors cluster near each other in the dendrogram. A quick review of the dataset shows that these authors both published a paper together Comparison of the fatty acid composition of the serum phospholipids of controls, prediabetics and adults with type 2 diabetes suggesting these authors have collaborated in the past. What other examples of interdisciplinary collaboration are apparent in the graph? I will leave that exercise some of the more intrepid readers.
There are some limitations of this method, most notably is the fact that non-sensical groups will start to emerge when we have many more features (terms) than there are authors. In our case, we have only 200 authors compared to 3,644 terms so we do not have to worry. Another issues is how author affiliation was assigned. In this project I assigned the author to either "Pharmacy" or "Medicine" depending on which dataset the name appeared in most frequently. For example if an author published 4 papers that showed up in the College of Pharmacy search results and only 2 times in the School of Medicine results; they received the Pharmacy affiliation. This may misclassify authors into one of the two categories. Another limitation is the names used to identify the authors could change from paper to paper. For exmample, I may publish a paper using Bernauer ML one time and use Bernauer M the next. No efforts were made to correct for this and as a result we may not have correctly grouped the authors appropriately.