PubMed: Finding Similar Authors

Here we are interested in identifying similar researchers based on content they have published. We will be using PubMed articles by researchers from the University of New Mexico College of Pharmacy and School of Medicine.

Methods

  1. Download article data from PubMed
    1. From pubmed.gov find articles published by the College of Pharmacy using the following search query "college of pharmacy"[AD] AND "new mexico"[AD] and save results to .csv
    2. From pubmed.gov find articles published by the School of Medicine using the following search query "school of medicine"[AD] AND "new mexico"[AD] and save resutls to .csv
  2. For each author, append each article title they were associated with to a single string
  3. Compute an Author-Term matrix which contains the term frequencies for each author as computed from article titles
  4. Create a Term Frequency-Inverse Document Frequency (TF-IDF) matrix to identify author-specific keywords
  5. Use hierarchical clustering to cluster authors based on keyword similarity
In [2]:
# load required packages
require(data.table)
require(dplyr)
require(tidyr)
require(tm)
require(NLP)
require(ggplot2)
require(ape)
require(reshape2)
options(repr.plot.width = 10, repr.plot.height=10)

Creating the dataset

After each of the article sets are downloaded from pubmed, they can be read and used to create a single table which contains the author and article title. In the code below we also create a separate column named affiliation which shows which school the author is affiliated with.

In [3]:
# read in the data containing authors and article titles for both the college of pharmacy and school of medicine
cop = fread("cop.csv")
som = fread("som.csv")

# lable author affiliation
cop$affiliation = 'Pharmacy'
som$affiliation = 'Medicine'

# combine into single dataframe`
pubmed = rbind(cop, som)
head(pubmed, 1)
Out[3]:
TitleURLDescriptionDetailsShortDetailsResourceTypeIdentifiersDbEntrezUIDPropertiesaffiliation
1'Ethical responsibility' or 'a whole can of worms': differences in opinion on incidental finding review and disclosure in neuroimaging research from focus group discussions with participants, parents, IRB members, investigators, physicians and community members./pubmed/26063579Cole C, Petree LE, Phillips JP, Shoemaker JM, Holdsworth M, Helitzer DL.J Med Ethics. 2015 Oct;41(10):841-7. doi: 10.1136/medethics-2014-102552. Epub 2015 Jun 10.J Med Ethics. 2015PubMedcitationPMID:26063579pubmed26063579create date:2015/06/13 | first author:Cole CPharmacy

In the output above, we can see that the first articel has 6 authors associated with it. We will need to standardize the author names by converting to upper case and then split on ',' in order to separate them.

In [4]:
# Each article contains multiple authors
pubmed = mutate(pubmed, Description = strsplit(toupper(Description), ","))

Finally we will need to split the list appart so that each author can be separated from the rest. This can be done using the unnest() function from the tidyr package. Once we are finished with this step we should have a dataframe with each row as an author.

In [5]:
pubmed = pubmed %>% unnest(Description)
head(pubmed,2)
Out[5]:
TitleURLDetailsShortDetailsResourceTypeIdentifiersDbEntrezUIDPropertiesaffiliationDescription
1'Ethical responsibility' or 'a whole can of worms': differences in opinion on incidental finding review and disclosure in neuroimaging research from focus group discussions with participants, parents, IRB members, investigators, physicians and community members./pubmed/26063579J Med Ethics. 2015 Oct;41(10):841-7. doi: 10.1136/medethics-2014-102552. Epub 2015 Jun 10.J Med Ethics. 2015PubMedcitationPMID:26063579pubmed26063579create date:2015/06/13 | first author:Cole CPharmacyCOLE C
2'Ethical responsibility' or 'a whole can of worms': differences in opinion on incidental finding review and disclosure in neuroimaging research from focus group discussions with participants, parents, IRB members, investigators, physicians and community members./pubmed/26063579J Med Ethics. 2015 Oct;41(10):841-7. doi: 10.1136/medethics-2014-102552. Epub 2015 Jun 10.J Med Ethics. 2015PubMedcitationPMID:26063579pubmed26063579create date:2015/06/13 | first author:Cole CPharmacy PETREE LE

In this output above we can see that the author list has been split apart and each row now contains one of the authors of the article along with the title and other meta data. Next we will need to group by the author column (Description) and append all of the titles together into a single string. This can be done using the dplyr package to group the data and simple user defined function clean.text which converts the text to upper case and removes all non-alpha and non-whitespace characters.

In [6]:
clean.text = function(x, min=1, max=3){
    # remove all non-alpha numerics except " "
    s = gsub('[^A-Za-z -]', '', toupper(x))
    # replace "-" with " "
    s = gsub("-", " ", s)
    # remove all words that are between min and max in length
    gsub("\\b[A-Za-z]{1,3}\\b", "", s)
}

clean_names = function(x) trimws(gsub("[^A-Za-z ]", "", toupper(x)))

# group by author and concat title text into single text string and select the top
# 250 authors based on total number of publications
pubmed = pubmed %>%
  mutate(Author = clean_names(Description)) %>%
  group_by(Author) %>%
  summarize(title.text = clean.text(paste(Title, collapse = ' ')),
            pub.num = n(),
            affiliation = names(sort(table(affiliation), decreasing=T)[1])) %>% # keep the most common affilation
  filter(Author != "ET AL") %>%
  arrange(desc(pub.num)) %>%
  head(200)

head(pubmed %>% select(Author, title.text),1)
Out[6]:
Authortitle.text
1LIU KJAPPLICATION VIVO BRAIN RESEARCH MONITORING TISSUE OXYGENATION BLOOD FLOW OXIDATIVE STRESS ARSENITE BINDING INDUCED ZINC LOSS FROM PARP EQUIVALENT ZINC DEFICIENCY REDUCING PARP ACTIVITY LEADING INHIBITION REPAIR ARSENITE CAUSES DAMAGE KERATINOCYTES GENERATION HYDROXYL RADICALS ARSENITE INTERACTS SELECTIVELY WITH ZINC FINGER PROTEINS CONTAINING MOTIFS ARSENITE INTERACTS WITH DIBENZODEFPCHRYSENE LEVELS SUPPRESS BONE MARROW LYMPHOID PROGENITORS MICE ARSENITE SELECTIVELY INHIBITS MOUSE BONE MARROW LYMPHOID PROGENITOR CELL DEVELOPMENT VIVO VITRO SUPPRESSES HUMORAL IMMUNITY VIVO MEDIATES INHIBITION NITRIC OXIDE LIPOPOLYSACCHARIDE INDUCED MATRIX METALLOPROTEINASE EXPRESSION CULTURED ASTROCYTES BENZOAPYRENE QUINONES INCREASE CELL PROLIFERATION GENERATE REACTIVE OXYGEN SPECIES TRANSACTIVATE EPIDERMAL GROWTH FACTOR RECEPTOR BREAST EPITHELIAL CELLS CEREBRAL TISSUE OXYGENATION OXIDATIVE BRAIN INJURY DURING ISCHEMIA REPERFUSION COMPARISON NITROXIDE LABILE ESTERS DELIVERING ELECTRON PARAMAGNETIC RESONANCE PROBES INTO MOUSE BRAIN CONFERENCE SUMMARY RECENT ADVANCES CONFERENCE METAL TOXICITY CARCINOGENESIS CONTRIBUTIONS REACTIVE OXYGEN SPECIES MITOGEN ACTIVATED PROTEIN KINASE SIGNALING ARSENITE STIMULATED HEMEOXYGENASE PRODUCTION CORRIGENDUM IMMUNOTOXICITY BIODISTRIBUTION ANALYSIS ARSENIC TRIOXIDE MICE FOLLOWING WEEK INHALATION EXPOSURE TOXICOL APPL PHARMACOL DIFFERENTIAL BINDING MONOMETHYLARSONOUS ACID COMPARED ARSENITE ARSENIC TRIOXIDE WITH ZINC FINGER PEPTIDES PROTEINS DIFFERENTIAL EXPRESSION TISSUE INHIBITOR METALLOPROTEINASES CULTURED ASTROCYTES NEURONS REGULATES ACTIVATION MATRIX METALLOPROTEINASE DIRECT VISUALIZATION MOUSE BRAIN OXYGEN DISTRIBUTION ELECTRON PARAMAGNETIC RESONANCE IMAGING APPLICATION FOCAL CEREBRAL ISCHEMIA DIRECT VISUALIZATION TRAPPED ERYTHROCYTES BRAIN AFTER FOCAL ISCHEMIA REPERFUSION DOES NORMOBARIC HYPEROXIA INCREASE OXIDATIVE STRESS ACUTE ISCHEMIC STROKE CRITICAL REVIEW LITERATURE DUAL ACTIONS INVOLVED ARSENITE INDUCED OXIDATIVE DAMAGE EBSELEN INDUCED GLIOMA CELL DEATH OXYGEN GLUCOSE DEPRIVATION EFFECT PHENYLEPHRINE PRETREATMENT EXPRESSIONS AQUAPORIN TERMINAL KINASE IRRADIATED SUBMANDIBULAR GLAND EFFECTS GLUCOSE CONCENTRATION REDOX STATUS PRIMARY CORTICAL NEURONS UNDER HYPOXIA ELECTRON PARAMAGNETIC RESONANCE GUIDED NORMOBARIC HYPEROXIA TREATMENT PROTECTS BRAIN MAINTAINING PENUMBRAL OXYGENATION MODEL TRANSIENT FOCAL CEREBRAL ISCHEMIA ENHANCED PRODUCTION REDOX SIGNALING WITH COMBINED ARSENITE EXPOSURE CONTRIBUTION NADPH OXIDASE ENVIRONMENTALLY RELEVANT CONCENTRATIONS ARSENITE MONOMETHYLARSONOUS ACID INHIBIT STAT CYTOKINE SIGNALING PATHWAYS MOUSE CDCD DOUBLE NEGATIVE THYMUS CELLS ENVIRONMENTALLY RELEVANT CONCENTRATIONS ARSENITE INDUCE DOSE DEPENDENT DIFFERENTIAL GENOTOXICITY THROUGH POLYADP RIBOSE POLYMERASE INHIBITION OXIDATIVE STRESS MOUSE THYMUS CELLS EVALUATION SPIN TRAPPING AGENTS TRAPPING CONDITIONS DETECTION CELL GENERATED REACTIVE OXYGEN SPECIES EXTENDED NORMOBARIC HYPEROXIA THERAPY YIELDS GREATER NEUROPROTECTION FOCAL TRANSIENT ISCHEMIA REPERFUSION RATS GENERATION HYDROGEN PEROXIDE DURING BRIEF OXYGEN GLUCOSE DEPRIVATION INDUCES PRECONDITIONING NEURONAL PROTECTION PRIMARY CULTURED NEURONS GLUCOSE REGULATES ALPHA EXPRESSION PRIMARY CORTICAL NEURONS RESPONSE HYPOXIA THROUGH MAINTAINING CELLULAR REDOX STATUS HYDROETHIDINE DETECTION SUPEROXIDE PRODUCTION DURING LITHIUM PILOCARPINE MODEL STATUS EPILEPTICUS HYDROXYL RADICAL FORMATION GREATER STRIATAL CORE THAN PENUMBRA MODEL ISCHEMIC STROKE IMMUNOTOXICITY BIODISTRIBUTION ANALYSIS ARSENIC TRIOXIDE MICE FOLLOWING WEEK INHALATION EXPOSURE VIVO EVIDENCE METHAMPHETAMINE INDUCED ATTENUATION BRAIN TISSUE OXYGENATION MEASURED OXIMETRY VIVO REDUCTION CHROMIUM RELATED FREE RADICAL GENERATION INDUCTION HEME OXYGENASE ARSENITE INHIBITS CYTOKINE INDUCED MONOCYTE ADHESION HUMAN ENDOTHELIAL CELLS INORGANIC ARSENIC COMPOUNDS CAUSE OXIDATIVE DAMAGE PROTEIN INDUCING GENERATION HUMAN KERATINOCYTES INTERSTITIAL ISCHEMIC PENUMBRA CORE DIFFERENTIALLY AFFECTED FOLLOWING TRANSIENT FOCAL CEREBRAL ISCHEMIA RATS CONCENTRATION ARSENITE EXACERBATES INDUCED STRAND BREAKS INHIBITING PARP ACTIVITY DOSE SYNERGISTIC IMMUNOSUPPRESSION DEPENDENT ANTIBODY RESPONSES POLYCYCLIC AROMATIC HYDROCARBONS ARSENIC CBLJ MURINE SPLEEN CELLS MONOMETHYLARSONOUS ACID INHIBITS SIGNALING MOUSE CELLS NITRIC OXIDE INTERACTS WITH CAVEOLIN FACILITATE AUTOPHAGY LYSOSOME MEDIATED CLAUDIN DEGRADATION OXYGEN GLUCOSE DEPRIVATION TREATED ENDOTHELIAL CELLS NORMOBARIC HYPEROXIA ATTENUATES EARLY BLOOD BRAIN BARRIER DISRUPTION INHIBITING MEDIATED OCCLUDIN DEGRADATION FOCAL CEREBRAL ISCHEMIA NORMOBARIC HYPEROXIA COMBINED WITH MINOCYCLINE PROVIDES GREATER NEUROPROTECTION THAN EITHER ALONE TRANSIENT FOCAL CEREBRAL ISCHEMIA NORMOBARIC HYPEROXIA DELAYS ATTENUATES EARLY NITRIC OXIDE PRODUCTION FOCAL CEREBRAL ISCHEMIC RATS NORMOBARIC HYPEROXIA INHIBITS NADPH OXIDASE MEDIATED MATRIX METALLOPROTEINASE INDUCTION CEREBRAL MICROVESSELS EXPERIMENTAL STROKE NORMOBARIC HYPEROXIA REDUCES NEUROVASCULAR COMPLICATIONS ASSOCIATED WITH DELAYED TISSUE PLASMINOGEN ACTIVATOR TREATMENT MODEL FOCAL CEREBRAL ISCHEMIA APPLICATION HYDROXYBENZOIC ACID TRAPPING AGENT STUDY HYDROXYL RADICAL GENERATION DURING CEREBRAL ISCHEMIA REPERFUSION OXIDATIVE MECHANISM ARSENIC TOXICITY CARCINOGENESIS OXIDATIVE STRESS APOPTOSIS METAL INDUCED CARCINOGENESIS PEROXYNITRITE DECOMPOSITION CATALYST REDUCES DELAYED THROMBOLYSIS INDUCED HEMORRHAGIC TRANSFORMATION ISCHEMIA REPERFUSED BRAINS POLYADP RIBOSE CONTRIBUTES ASSOCIATION BETWEEN POLYADP RIBOSE POLYMERASE XERODERMA PIGMENTOSUM COMPLEMENTATION GROUP NUCLEOTIDE EXCISION REPAIR POLYADP RIBOSE POLYMERASE INHIBITION ARSENITE PROMOTES SURVIVAL CELLS WITH UNREPAIRED LESIONS INDUCED EXPOSURE REACTION BASED FLUORESCENT PROBE ENABLING DETECTION ENDOGENOUS LABILE IMAGING INDUCED FLUX LIVING CELLS ELEVATED ISCHEMIC STROKE REDUCTION ARSENITE ENHANCED ULTRAVIOLET RADIATION INDUCED DAMAGE SUPPLEMENTAL ZINC REDUCTION ZINC ACCUMULATION MITOCHONDRIA CONTRIBUTES DECREASED CEREBRAL ISCHEMIC INJURY NORMOBARIC HYPEROXIA TREATMENT EXPERIMENTAL STROKE MODEL SELECTIVE SENSITIZATION ZINC FINGER PROTEIN OXIDATION REACTIVE OXYGEN SPECIES THROUGH ARSENIC BINDING SPATIOTEMPORAL EVOLUTION BLOOD BRAIN BARRIER DAMAGE TISSUE INFARCTION WITHIN FIRST AFTER ISCHEMIA ONSET TISSUE OXYGEN REDUCED WHITE MATTER SPONTANEOUSLY HYPERTENSIVE STROKE PRONE RATS LONGITUDINAL STUDY WITH ELECTRON PARAMAGNETIC RESONANCE ACETOXYMETHOXYCARBONYL TETRAMETHYL PYRROLIDINYLOXYL OXIMETRY PROBE POTENTIAL VIVO MEASUREMENT TISSUE OXYGENATION MOUSE BRAIN XANTHINE OXIDASE ACTIVATES MATRIX METALLOPROTEINASE CULTURED VASCULAR SMOOTH MUSCLE CELLS THROUGH FREE RADICAL MECHANISMS EXPRESSION REDUCTION ENHANCES FREE ZINC ACCUMULATION ASTROCYTES AFTER ISCHEMIC STROKE

We now have a data frame where each row represents a unique author along with a concatenation of every title they have authored. Next we will need to create the Author-Term matrix, this can be done using the tm package.

Creat the Author-Term Matrix

Now that we have a data frame containing authors and concatenated titles, we can easily make an author-term matrix which contains the term frequencies for each author according to how frequently the terms were used in article titles.

In [7]:
# create the corpus using the concatentated article title text for each author
corpus = VCorpus(VectorSource(pubmed$title.text))
# remove stopwords
mycorp = tm_map(corpus, removeWords, stopwords('english'))
In [8]:
# create the author term matrix 
dtm = DocumentTermMatrix(mycorp)
rownames(dtm) = pubmed$Author

Create Term Frequency Inverse Document Frequency Matrix

The term frequency-inverse document frequency is a measure that reflects how important a word is to a particular document. It is essentially the number of time a word appears in a document offset by the number of times the word appears throughout the corpus in general. In our case, it is the number of times an author uses a word in the title of their article offset by the number of times other authors use the same word in their articles.

In [9]:
# compute the tf-idf matrix
tfidf = weightTfIdf(dtm, normalize = T)

Since the TF-IDF weight reflects a words importance to a particular document (or in this case author) we can list the top 10 terms (ranked by TFIDF weight) for each author to see which terms are most descriptive of their research.

In [10]:
# Find the top 10 terms by TF-IDF for each author
top.terms = apply(tfidf, 1, function(x) colnames(tfidf)[order(x, decreasing = T)[1:10]])
top.terms[,1:5]
Out[10]:
LIU KJBURCHIEL SWHUDSON LGGLEW RHRAISCH DW
arsenite aromatic epidermalfatty adverse
ischemia polycyclicarsenite nigeria methods
cerebral hydrocarbonsfactor sickle reactions
hyperoxia benzoapyrenegrowth nigerian drug
normobaric cell receptor northern prescribing
zinc calcium ovarian serum pediatric
brain followingcells children food
focal cells arsenicfulani events
ischemic dimethylbenzaanthracenezinc disease oncology
oxygen line keratinocytes acids administrations

From the output above we can see that Dr. Jim Liu's research seems to focus on arsenite, ischemia, and the brain whereas Dr. Dennis Raisch's research is focused on adverse, events, and reactions this seems appropriate given Dr. Raisch's experience in mining adverse drug events from the FDA Adverse Event Reporting System.

Cluster Authors by TFIDF Weighted Article Terms

Now that we've compute the TFIDF weights for each word for each author, we can use hierarchical clustering to group authors based on title term similarity. Authors who use similar words in the titles of their papers will tend to cluster together. This will allow us to see interesting patters such which authors are publishing on similar topics. One thing to note is that if there are many more variables (i.e. terms) than documents (i.e. authors) non-sensical groups may emerge. In our case we have a total of 310 authors and 3,959 different terms/variables therefore we do not really have to worry about this problem.

In [11]:
hc = hclust(dist(tfidf))
color = pubmed$affiliation
color[color=="Pharmacy"] = "#64706c"
color[color=="Medicine"] = "#935347"
In [12]:
svg('unigrams.svg', width=10, height=10)
plot(as.phylo(hc), main="PubMed Author Similarity by Unigrams", 
     tip.color = color, type="fan", cex=0.7, label.offset = 0.05)
dev.off()
Out[12]:
pdf: 2

unigrams

Inspecting the Groups

From the dendrogram above we can see clear groups emerge. Some groups cluster pretty tighthly while other groups are less tightly connected. Some authors are identical, suggesting that their word vectors are identical. This seems to suggest that these authors co-occur on the same papers. In this next section we will write a function to extract the top 10 terms for a particular author.

In [13]:
# function for returning top 10 terms for particular author
get_top_terms = function(author, data, n = 10){
    colnames(data)[order(as.matrix(data[author,]), decreasing=T)[1:n]]
}

Let look at the top terms for some tightly clustered authors; for example GARVER WS, and JELINEK D

GARVER WS and JELINEK D

In [14]:
get_top_terms('GARVER WS', tfidf)
Out[14]:
  1. 'niemann'
  2. 'pick'
  3. 'diet'
  4. 'weight'
  5. 'dosage'
  6. 'cblj'
  7. 'interaction'
  8. 'decreased'
  9. 'gene'
  10. 'confirmation'
In [15]:
get_top_terms('JELINEK D', tfidf)
Out[15]:
  1. 'niemann'
  2. 'pick'
  3. 'diet'
  4. 'weight'
  5. 'dosage'
  6. 'cblj'
  7. 'interaction'
  8. 'decreased'
  9. 'gene'
  10. 'confirmation'

As expected, these authors have identical word vectors suggesting that they have similar research intersts. In fact, it is likely that these two authors are co-authors.

ANTONCULVER H and ABEN KK

In [16]:
get_top_terms('ANTONCULVER H', tfidf)
get_top_terms('ABEN KK', tfidf)
Out[16]:
  1. 'ovarian'
  2. 'cancer'
  3. 'epithelial'
  4. 'risk'
  5. 'common'
  6. 'gene'
  7. 'serous'
  8. 'genes'
  9. 'network'
  10. 'variants'
Out[16]:
  1. 'ovarian'
  2. 'cancer'
  3. 'epithelial'
  4. 'risk'
  5. 'common'
  6. 'gene'
  7. 'serous'
  8. 'genes'
  9. 'network'
  10. 'variants'

PAI MP and MERCIER RC

In [17]:
get_top_terms('PAI MP', tfidf)
get_top_terms('MERCIER RC', tfidf)
Out[17]:
  1. 'candida'
  2. 'vancomycin'
  3. 'bloodstream'
  4. 'antifungal'
  5. 'antimicrobial'
  6. 'dosing'
  7. 'hemodialysis'
  8. 'receiving'
  9. 'obese'
  10. 'combinations'
Out[17]:
  1. 'vancomycin'
  2. 'aureus'
  3. 'staphylococcus'
  4. 'resistant'
  5. 'methicillin'
  6. 'against'
  7. 'antimicrobial'
  8. 'daptomycin'
  9. 'piperacillin'
  10. 'tazobactam'

Its clear that the above authors frequently publish on infectious disease.

Bigram Clustering

So in the first example, we broke the text apart into individuals words, however it may be useful to retain some of the context. This can be accomplished by using a bigram tokenizer which will split the text into bigrams (chunks of two words). We can use the NLP package to create a bigram tokenzier

In [18]:
bigram = function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse=" "), use.names=F)
In [19]:
corpus = VCorpus(VectorSource(pubmed$title.text))
my_corpus = tm_map(corpus, removeWords, stopwords('english'))
dtm = DocumentTermMatrix(corpus, control=list(tokenize = bigram))

Compute TFIDF Weights on Bigrams

In [20]:
tfidf = weightTfIdf(dtm)
rownames(tfidf) = pubmed$Author

Inspect the Top 10 Bigrams from the Top 5 Authors

In [21]:
# Find the top 10 terms by TF-IDF for each author
top.terms = apply(tfidf, 1, function(x) colnames(tfidf)[order(x, decreasing = T)[1:10]])
top.terms[,1:5]
Out[21]:
LIU KJBURCHIEL SWHUDSON LGGLEW RHRAISCH DW
normobaric hyperoxia aromatic hydrocarbonsepidermal growth sickle cell food drug
focal cerebral polycyclic aromatic factor receptor northern nigeria drug administrations
cerebral ischemiacell line growth factor cell disease adverse drug
ischemic stroke mammary epithelialcancer cells fatty acids cost effectiveness
electron paramagnetichuman mammary ultraviolet radiationfatty acid educational methods
paramagnetic resonance following oral ovarian cancer nigerian children influencing prescribing
tissue oxygenation rhesus monkeys polyadp ribose with sickle methods influencing
ischemia reperfusion human cells contributions epidermalacid composition model methods
matrix metalloproteinaseintracellular calcium zinc finger serum phospholipids part review
oxygen species flow cytometry monomethylarsonous acidnigeria serum payment methods

Clustering Authors by Bigrams

In [22]:
hc = hclust(dist(tfidf))
color = pubmed$affiliation
color[color=="Pharmacy"] = "#64706c"
color[color=="Medicine"] = "#935347"
In [23]:
svg('bigrams.svg', width=10, height=10)
par(mai=c(1,1,1,1))
plot(as.phylo(hc), main="PubMed Author Similarity by Bigrams", 
     tip.color = color, type="fan", cex=0.7, label.offset = 0.05)
dev.off()
Out[23]:
pdf: 2

bigrams

For the most part the groups revealed by the bigram model are similar to those in the unigram model; however with the bigram model you see some of the groups appearing to have more togetherness. Below we will look at the top bigrams for some of these authors.

Inspecting Top Bigrams

In [24]:
get_top_terms('MERCIER RC', tfidf)
get_top_terms('PAI MP', tfidf)
Out[24]:
  1. 'staphylococcus aureus'
  2. 'methicillin resistant'
  3. 'resistant staphylococcus'
  4. 'aureus isolates'
  5. 'vancomycin treatment'
  6. 'against methicillin'
  7. 'piperacillin tazobactam'
  8. 'vancomycin intermediate'
  9. 'patients receiving'
  10. 'activity tigecycline'
Out[24]:
  1. 'bloodstream isolates'
  2. 'susceptibility testing'
  3. 'against simulated'
  4. 'antifungal combinations'
  5. 'combinations against'
  6. 'endocardial vegetations'
  7. 'simulated candida'
  8. 'single dose'
  9. 'morbidly obese'
  10. 'patients receiving'

Here we see that the top bigrams are similar to those from the unigram model in that they all are related to infectious disease however we gain a little more information using the bigram model. For instance, we see that Mercier appears to focus on methicillin resistance, staphylococcus aureus as well a vancomycin and piperacillin-tazobactam.

In [25]:
get_top_terms('ANDERSON JR', tfidf)
get_top_terms('NAWARSKAS JJ', tfidf)
Out[25]:
  1. 'arterial hypertension'
  2. 'directive guidance'
  3. 'pharmacist directive'
  4. 'purdue pharmacist'
  5. 'pulmonary arterial'
  6. 'chronic stable'
  7. 'guidance scale'
  8. 'heart failure'
  9. 'properties purdue'
  10. 'psychometric properties'
Out[25]:
  1. 'eluting stent'
  2. 'paclitaxel eluting'
  3. 'agent treatment'
  4. 'arterial hypertension'
  5. 'chronic stable'
  6. 'effect statins'
  7. 'fractures elderly'
  8. 'hormone therapy'
  9. 'intervention part'
  10. 'postmenopausal hormone'

Its no surprise that the cardiovascular pharmacists are publishing papers related to hypertension, heart failure, stents, and statins.

Opportinities for Collaboration

Now that we have successfully clustered authors into similar groups, we can potentially see which authors would collaborate well together. For example, lets look at Rey GM and Shah VO. I happen to know that Dr. Rey (College of Pharmacy) specializes in diabetes as well as Dr. Shah (School of Medicine). In addition, both of these authors cluster near each other in the dendrogram. A quick review of the dataset shows that these authors both published a paper together Comparison of the fatty acid composition of the serum phospholipids of controls, prediabetics and adults with type 2 diabetes suggesting these authors have collaborated in the past. What other examples of interdisciplinary collaboration are apparent in the graph? I will leave that exercise some of the more intrepid readers.

Limitations

There are some limitations of this method, most notably is the fact that non-sensical groups will start to emerge when we have many more features (terms) than there are authors. In our case, we have only 200 authors compared to 3,644 terms so we do not have to worry. Another issues is how author affiliation was assigned. In this project I assigned the author to either "Pharmacy" or "Medicine" depending on which dataset the name appeared in most frequently. For example if an author published 4 papers that showed up in the College of Pharmacy search results and only 2 times in the School of Medicine results; they received the Pharmacy affiliation. This may misclassify authors into one of the two categories. Another limitation is the names used to identify the authors could change from paper to paper. For exmample, I may publish a paper using Bernauer ML one time and use Bernauer M the next. No efforts were made to correct for this and as a result we may not have correctly grouped the authors appropriately.