PubMed: Finding Similar Authors¶

Here we are interested in identifying similar researchers based on content they have published. We will be using PubMed articles by researchers from the University of New Mexico College of Pharmacy and School of Medicine.

Methods¶

Download article data from PubMed
1. From pubmed.gov find articles published by the College of Pharmacy using the following search query "college of pharmacy"[AD] AND "new mexico"[AD] and save results to .csv
2. From pubmed.gov find articles published by the School of Medicine using the following search query "school of medicine"[AD] AND "new mexico"[AD] and save resutls to .csv
For each author, append each article title they were associated with to a single string
Compute an Author-Term matrix which contains the term frequencies for each author as computed from article titles
Create a Term Frequency-Inverse Document Frequency (TF-IDF) matrix to identify author-specific keywords
Use hierarchical clustering to cluster authors based on keyword similarity

# load required packages
require(data.table)
require(dplyr)
require(tidyr)
require(tm)
require(NLP)
require(ggplot2)
require(ape)
require(reshape2)
options(repr.plot.width = 10, repr.plot.height=10)

Creating the dataset¶

After each of the article sets are downloaded from pubmed, they can be read and used to create a single table which contains the author and article title. In the code below we also create a separate column named affiliation which shows which school the author is affiliated with.

# read in the data containing authors and article titles for both the college of pharmacy and school of medicine
cop = fread("cop.csv")
som = fread("som.csv")

# lable author affiliation
cop$affiliation = 'Pharmacy'
som$affiliation = 'Medicine'

# combine into single dataframe`
pubmed = rbind(cop, som)
head(pubmed, 1)

In the output above, we can see that the first articel has 6 authors associated with it. We will need to standardize the author names by converting to upper case and then split on ',' in order to separate them.

# Each article contains multiple authors
pubmed = mutate(pubmed, Description = strsplit(toupper(Description), ","))

Finally we will need to split the list appart so that each author can be separated from the rest. This can be done using the unnest() function from the tidyr package. Once we are finished with this step we should have a dataframe with each row as an author.

pubmed = pubmed %>% unnest(Description)
head(pubmed,2)

In this output above we can see that the author list has been split apart and each row now contains one of the authors of the article along with the title and other meta data. Next we will need to group by the author column (Description) and append all of the titles together into a single string. This can be done using the dplyr package to group the data and simple user defined function clean.text which converts the text to upper case and removes all non-alpha and non-whitespace characters.

clean.text = function(x, min=1, max=3){
    # remove all non-alpha numerics except " "
    s = gsub('[^A-Za-z -]', '', toupper(x))
    # replace "-" with " "
    s = gsub("-", " ", s)
    # remove all words that are between min and max in length
    gsub("\\b[A-Za-z]{1,3}\\b", "", s)
}

clean_names = function(x) trimws(gsub("[^A-Za-z ]", "", toupper(x)))

# group by author and concat title text into single text string and select the top
# 250 authors based on total number of publications
pubmed = pubmed %>%
  mutate(Author = clean_names(Description)) %>%
  group_by(Author) %>%
  summarize(title.text = clean.text(paste(Title, collapse = ' ')),
            pub.num = n(),
            affiliation = names(sort(table(affiliation), decreasing=T)[1])) %>% # keep the most common affilation
  filter(Author != "ET AL") %>%
  arrange(desc(pub.num)) %>%
  head(200)

head(pubmed %>% select(Author, title.text),1)

We now have a data frame where each row represents a unique author along with a concatenation of every title they have authored. Next we will need to create the Author-Term matrix, this can be done using the tm package.

Creat the Author-Term Matrix¶

Now that we have a data frame containing authors and concatenated titles, we can easily make an author-term matrix which contains the term frequencies for each author according to how frequently the terms were used in article titles.

# create the corpus using the concatentated article title text for each author
corpus = VCorpus(VectorSource(pubmed$title.text))
# remove stopwords
mycorp = tm_map(corpus, removeWords, stopwords('english'))

# create the author term matrix 
dtm = DocumentTermMatrix(mycorp)
rownames(dtm) = pubmed$Author

Create Term Frequency Inverse Document Frequency Matrix¶

The term frequency-inverse document frequency is a measure that reflects how important a word is to a particular document. It is essentially the number of time a word appears in a document offset by the number of times the word appears throughout the corpus in general. In our case, it is the number of times an author uses a word in the title of their article offset by the number of times other authors use the same word in their articles.

# compute the tf-idf matrix
tfidf = weightTfIdf(dtm, normalize = T)

Since the TF-IDF weight reflects a words importance to a particular document (or in this case author) we can list the top 10 terms (ranked by TFIDF weight) for each author to see which terms are most descriptive of their research.

# Find the top 10 terms by TF-IDF for each author
top.terms = apply(tfidf, 1, function(x) colnames(tfidf)[order(x, decreasing = T)[1:10]])
top.terms[,1:5]

From the output above we can see that Dr. Jim Liu's research seems to focus on arsenite, ischemia, and the brain whereas Dr. Dennis Raisch's research is focused on adverse, events, and reactions this seems appropriate given Dr. Raisch's experience in mining adverse drug events from the FDA Adverse Event Reporting System.

Cluster Authors by TFIDF Weighted Article Terms¶

Now that we've compute the TFIDF weights for each word for each author, we can use hierarchical clustering to group authors based on title term similarity. Authors who use similar words in the titles of their papers will tend to cluster together. This will allow us to see interesting patters such which authors are publishing on similar topics. One thing to note is that if there are many more variables (i.e. terms) than documents (i.e. authors) non-sensical groups may emerge. In our case we have a total of 310 authors and 3,959 different terms/variables therefore we do not really have to worry about this problem.

hc = hclust(dist(tfidf))
color = pubmed$affiliation
color[color=="Pharmacy"] = "#64706c"
color[color=="Medicine"] = "#935347"

svg('unigrams.svg', width=10, height=10)
plot(as.phylo(hc), main="PubMed Author Similarity by Unigrams", 
     tip.color = color, type="fan", cex=0.7, label.offset = 0.05)
dev.off()

unigrams

Inspecting the Groups¶

From the dendrogram above we can see clear groups emerge. Some groups cluster pretty tighthly while other groups are less tightly connected. Some authors are identical, suggesting that their word vectors are identical. This seems to suggest that these authors co-occur on the same papers. In this next section we will write a function to extract the top 10 terms for a particular author.

# function for returning top 10 terms for particular author
get_top_terms = function(author, data, n = 10){
    colnames(data)[order(as.matrix(data[author,]), decreasing=T)[1:n]]
}

Let look at the top terms for some tightly clustered authors; for example GARVER WS, and JELINEK D

GARVER WS and JELINEK D¶

get_top_terms('GARVER WS', tfidf)

get_top_terms('JELINEK D', tfidf)

As expected, these authors have identical word vectors suggesting that they have similar research intersts. In fact, it is likely that these two authors are co-authors.

ANTONCULVER H and ABEN KK¶

get_top_terms('ANTONCULVER H', tfidf)
get_top_terms('ABEN KK', tfidf)

PAI MP and MERCIER RC¶

get_top_terms('PAI MP', tfidf)
get_top_terms('MERCIER RC', tfidf)

Its clear that the above authors frequently publish on infectious disease.

Bigram Clustering¶

So in the first example, we broke the text apart into individuals words, however it may be useful to retain some of the context. This can be accomplished by using a bigram tokenizer which will split the text into bigrams (chunks of two words). We can use the NLP package to create a bigram tokenzier

bigram = function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse=" "), use.names=F)

corpus = VCorpus(VectorSource(pubmed$title.text))
my_corpus = tm_map(corpus, removeWords, stopwords('english'))
dtm = DocumentTermMatrix(corpus, control=list(tokenize = bigram))

Compute TFIDF Weights on Bigrams¶

tfidf = weightTfIdf(dtm)
rownames(tfidf) = pubmed$Author

Inspect the Top 10 Bigrams from the Top 5 Authors¶

# Find the top 10 terms by TF-IDF for each author
top.terms = apply(tfidf, 1, function(x) colnames(tfidf)[order(x, decreasing = T)[1:10]])
top.terms[,1:5]

Clustering Authors by Bigrams¶

hc = hclust(dist(tfidf))
color = pubmed$affiliation
color[color=="Pharmacy"] = "#64706c"
color[color=="Medicine"] = "#935347"

svg('bigrams.svg', width=10, height=10)
par(mai=c(1,1,1,1))
plot(as.phylo(hc), main="PubMed Author Similarity by Bigrams", 
     tip.color = color, type="fan", cex=0.7, label.offset = 0.05)
dev.off()

bigrams

For the most part the groups revealed by the bigram model are similar to those in the unigram model; however with the bigram model you see some of the groups appearing to have more togetherness. Below we will look at the top bigrams for some of these authors.

Inspecting Top Bigrams¶

get_top_terms('MERCIER RC', tfidf)
get_top_terms('PAI MP', tfidf)

Here we see that the top bigrams are similar to those from the unigram model in that they all are related to infectious disease however we gain a little more information using the bigram model. For instance, we see that Mercier appears to focus on methicillin resistance, staphylococcus aureus as well a vancomycin and piperacillin-tazobactam.

get_top_terms('ANDERSON JR', tfidf)
get_top_terms('NAWARSKAS JJ', tfidf)

Its no surprise that the cardiovascular pharmacists are publishing papers related to hypertension, heart failure, stents, and statins.

Opportinities for Collaboration¶

Now that we have successfully clustered authors into similar groups, we can potentially see which authors would collaborate well together. For example, lets look at Rey GM and Shah VO. I happen to know that Dr. Rey (College of Pharmacy) specializes in diabetes as well as Dr. Shah (School of Medicine). In addition, both of these authors cluster near each other in the dendrogram. A quick review of the dataset shows that these authors both published a paper together Comparison of the fatty acid composition of the serum phospholipids of controls, prediabetics and adults with type 2 diabetes suggesting these authors have collaborated in the past. What other examples of interdisciplinary collaboration are apparent in the graph? I will leave that exercise some of the more intrepid readers.

Limitations¶

There are some limitations of this method, most notably is the fact that non-sensical groups will start to emerge when we have many more features (terms) than there are authors. In our case, we have only 200 authors compared to 3,644 terms so we do not have to worry. Another issues is how author affiliation was assigned. In this project I assigned the author to either "Pharmacy" or "Medicine" depending on which dataset the name appeared in most frequently. For example if an author published 4 papers that showed up in the College of Pharmacy search results and only 2 times in the School of Medicine results; they received the Pharmacy affiliation. This may misclassify authors into one of the two categories. Another limitation is the names used to identify the authors could change from paper to paper. For exmample, I may publish a paper using Bernauer ML one time and use Bernauer M the next. No efforts were made to correct for this and as a result we may not have correctly grouped the authors appropriately.

	Title	URL	Details	ShortDetails	Resource	Type	Identifiers	Db	EntrezUID	Properties	affiliation	Description
1	'Ethical responsibility' or 'a whole can of worms': differences in opinion on incidental finding review and disclosure in neuroimaging research from focus group discussions with participants, parents, IRB members, investigators, physicians and community members.	/pubmed/26063579	J Med Ethics. 2015 Oct;41(10):841-7. doi: 10.1136/medethics-2014-102552. Epub 2015 Jun 10.	J Med Ethics. 2015	PubMed	citation	PMID:26063579	pubmed	26063579	create date:2015/06/13 \| first author:Cole C	Pharmacy	COLE C
2	'Ethical responsibility' or 'a whole can of worms': differences in opinion on incidental finding review and disclosure in neuroimaging research from focus group discussions with participants, parents, IRB members, investigators, physicians and community members.	/pubmed/26063579	J Med Ethics. 2015 Oct;41(10):841-7. doi: 10.1136/medethics-2014-102552. Epub 2015 Jun 10.	J Med Ethics. 2015	PubMed	citation	PMID:26063579	pubmed	26063579	create date:2015/06/13 \| first author:Cole C	Pharmacy	PETREE LE

LIU KJ	BURCHIEL SW	HUDSON LG	GLEW RH	RAISCH DW
arsenite	aromatic	epidermal	fatty	adverse
ischemia	polycyclic	arsenite	nigeria	methods
cerebral	hydrocarbons	factor	sickle	reactions
hyperoxia	benzoapyrene	growth	nigerian	drug
normobaric	cell	receptor	northern	prescribing
zinc	calcium	ovarian	serum	pediatric
brain	following	cells	children	food
focal	cells	arsenic	fulani	events
ischemic	dimethylbenzaanthracene	zinc	disease	oncology
oxygen	line	keratinocytes	acids	administrations

LIU KJ	BURCHIEL SW	HUDSON LG	GLEW RH	RAISCH DW
normobaric hyperoxia	aromatic hydrocarbons	epidermal growth	sickle cell	food drug
focal cerebral	polycyclic aromatic	factor receptor	northern nigeria	drug administrations
cerebral ischemia	cell line	growth factor	cell disease	adverse drug
ischemic stroke	mammary epithelial	cancer cells	fatty acids	cost effectiveness
electron paramagnetic	human mammary	ultraviolet radiation	fatty acid	educational methods
paramagnetic resonance	following oral	ovarian cancer	nigerian children	influencing prescribing
tissue oxygenation	rhesus monkeys	polyadp ribose	with sickle	methods influencing
ischemia reperfusion	human cells	contributions epidermal	acid composition	model methods
matrix metalloproteinase	intracellular calcium	zinc finger	serum phospholipids	part review
oxygen species	flow cytometry	monomethylarsonous acid	nigeria serum	payment methods