Retrieving Similar Publications: UNM College of Pharmacy and School of Medicine

Created on January 31, 2016 by Michael L. Bernauer

Goal: Find similar papers using Title and Abstract text

In this notebook we will use Pandas, NLTK, Numpy, and SKLearn libraries to find similar articles published in PubMed using k-Nearest Neighbors.

Steps:

  1. Find the important keywords of each document using tf-idf
  2. Apply knn_model on tf-idf to find similar papers

Cleaning:

  • Clean text from \n and \x things like that by
  • Replacing \n and \x with white-space
  • Apply unicode
  • Make everything lower case

Note: This notebook borrows extensively from this one by Amir Amini

In [1]:
import pandas as pd
import sklearn
import numpy as np
import nltk
import re
import time
import codecs
from Bio import Medline

Downloading paper abstracts

First we must download text data that we are interested in using. To do this we will use articles indexed in pubmed.gov. For this notebook we are interested only in article published from the University of New Mexico College of Pharmacy and School of Medicine. Pubmed allows the use of filters/keywords to restrict your search to certain institutions. Retrieve articles affiliated with UNM CoP and SoM by using the following search string "university of new mexico"[AD] AND ("pharmacy"[AD] OR "medicine"[AD])

Steps:
  1. Navigate to pubmed.gov
  2. Enter "university of new mexico"[AD] AND ("pharmacy"[AD] OR "medicine"[AD]) into the search box
  3. Click 'Send to:' and choose 'File' and 'Format: MEDLINE'
  4. Click 'Create File'

As of January 31, 2016 there were a total of 5,584 articles found matching this search criteria. A file called pubmed_result.txt should have been saved to your computer. This file contains all of the articles matching the search criteria in MEDLINE format.

Lets import the article data

In [2]:
# Function that uses the Medline module from
# the Biopython library to parse and read MEDLINE
# formatted files. Results are stored in a Pandas 
# DataFrame
def read_medline_data(filename):
    recs = Medline.parse(open(filename, 'r'))
    text = pd.DataFrame(columns = ["title", "authors", "abstract"])
    count = 0
    for rec in recs:
        try:
            abstr = rec["AB"]
            title = rec["TI"]
            auths = rec["AU"]
            text = text.append(pd.DataFrame([[title, auths, abstr]],
                                     columns=['title', 'authors', 'abstract']),
                              ignore_index=True)            
        except:
            pass
    return text
In [3]:
# Read in MEDLINE formatted text
papers = read_medline_data("data/pubmed_result.txt")
In [4]:
# Show the top few papers
papers.head()
Out[4]:
title authors abstract
0 Discrepancy between Measured Serum Total Carbo... [Kim Y, Massie L, Murata GH, Tzamaloukas AH] Large differences between the concentrations o...
1 Longitudinal assessment of local and global fu... [Meier TB, Bellgowan PS, Mayer AR] Growing evidence suggests that sports-related ...
2 Changes in the Practice of Obstetrics and Gyne... [Rayburn WF, Tracy EE] A projected shortage of obstetrician-gynecolog...
3 Association Between Indoor Tanning and Melanom... [Lazovich D, Isaksson Vogel R, Weinstock MA, N... Importance: In the United States and Minnesota...
4 Biomass smoke exposure and chronic lung disease. [Assad NA, Kapoor V, Sood A] PURPOSE OF REVIEW: Approximately 3 billion peo...

Let's look at an example Abstract prior to cleaning

In [5]:
print "Title: ", papers['title'][0]
print 
print "Abstract: ", papers['abstract'][0]
Title:  Discrepancy between Measured Serum Total Carbon Dioxide Content and Bicarbonate Concentration Calculated from Arterial Blood Gases.

Abstract:  Large differences between the concentrations of serum total carbon dioxide (TCO2) and blood gas bicarbonate (HCO3 (-)) were observed in two consecutive simultaneously drawn sets of samples of serum and arterial blood gases in a patient who presented with severe carbon dioxide retention and profound acidemia. These differences could not be explained by the effect of the high partial pressure of carbon dioxide on TCO2, by variations in the dissociation constant of the carbonic acid/bicarbonate system or by faults caused by the algorithms of the blood gas apparatus that calculate HCO3 (-). A recalculation using the Henderson-Hasselbach equation revealed arterial blood gas HCO3 (-) values close to the corresponding serum TCO2 values and clarified the diagnosis of the acid-base disorder, which had been placed in doubt by the large differences between the reported TCO2 and HCO3 (-) values. Human error in the calculation of HCO3 (-) was identified as the source of these differences. Recalculation of blood gas HCO3 (-) should be the first step in identifying the source of large differences between serum TCO2 and blood gas HCO3 (-).

Clean Abstract and Title

  1. Replace \n and \x0c with white-space
  2. Convert to unicode
  3. Make everything lower case
In [6]:
# Function that cleans text by removing '\x0c' and '\n' characters
# as well as all non-alpha characters and finally converts everything
# to lower case
def clean_text(text):
    stop_words = ['\x0c', '\n']
    for i in stop_words:
        text.replace(i, ' ')
    clean_text = re.sub('[^a-zA-Z]+', ' ', text)
    return clean_text.lower()
In [7]:
# Create a column for cleaned Abstract and cleaned Title
papers['clean_abstract'] = papers['abstract'].apply(clean_text)
papers['clean_title'] = papers['title'].apply(clean_text)

Top entries in df including the newly added cleaned abstract column

In [8]:
papers.head()
Out[8]:
title authors abstract clean_abstract clean_title
0 Discrepancy between Measured Serum Total Carbo... [Kim Y, Massie L, Murata GH, Tzamaloukas AH] Large differences between the concentrations o... large differences between the concentrations o... discrepancy between measured serum total carbo...
1 Longitudinal assessment of local and global fu... [Meier TB, Bellgowan PS, Mayer AR] Growing evidence suggests that sports-related ... growing evidence suggests that sports related ... longitudinal assessment of local and global fu...
2 Changes in the Practice of Obstetrics and Gyne... [Rayburn WF, Tracy EE] A projected shortage of obstetrician-gynecolog... a projected shortage of obstetrician gynecolog... changes in the practice of obstetrics and gyne...
3 Association Between Indoor Tanning and Melanom... [Lazovich D, Isaksson Vogel R, Weinstock MA, N... Importance: In the United States and Minnesota... importance in the united states and minnesota ... association between indoor tanning and melanom...
4 Biomass smoke exposure and chronic lung disease. [Assad NA, Kapoor V, Sood A] PURPOSE OF REVIEW: Approximately 3 billion peo... purpose of review approximately billion people... biomass smoke exposure and chronic lung disease

Now lets look at the cleaned abstract

In [9]:
print "Title: ", papers['clean_title'][0]
print
print "Abstract: ", papers['clean_abstract'][0]
Title:  discrepancy between measured serum total carbon dioxide content and bicarbonate concentration calculated from arterial blood gases 

Abstract:  large differences between the concentrations of serum total carbon dioxide tco and blood gas bicarbonate hco were observed in two consecutive simultaneously drawn sets of samples of serum and arterial blood gases in a patient who presented with severe carbon dioxide retention and profound acidemia these differences could not be explained by the effect of the high partial pressure of carbon dioxide on tco by variations in the dissociation constant of the carbonic acid bicarbonate system or by faults caused by the algorithms of the blood gas apparatus that calculate hco a recalculation using the henderson hasselbach equation revealed arterial blood gas hco values close to the corresponding serum tco values and clarified the diagnosis of the acid base disorder which had been placed in doubt by the large differences between the reported tco and hco values human error in the calculation of hco was identified as the source of these differences recalculation of blood gas hco should be the first step in identifying the source of large differences between serum tco and blood gas hco 

We can see that our clean_text() function successfully removed all non-alpha characters, and converted everything to lower case.

Build tf-idf matrix based on Abstract and Title

  • Use NLTK word_tokenize() and PorterStemmer() to tokenize and stem document Title and Abstract
In [10]:
# Function that takes text, tokenizes it and 
# returns list of stemmed tokens
def tokenize_and_stem(text):
    tokens = nltk.word_tokenize(text)
    stemmer = nltk.stem.porter.PorterStemmer()
    return [i for i in [stemmer.stem(t) for t in tokens] if len(i) > 2]

Create a tf-idf vectorizer using sklearn TfidfVectorizer

  1. First we create the vectorizer specifying the paramters
    • max_df is the maximum allowable document frequency for a token this is set to 0.50 to include terms that appear in less than 50% of documents.
    • min_df is the minimum allowable document frequency for a token and is set to 0 to include all terms, even those that appear in only one document
    • max_features sets the maximum number of features allowed and is set to an arbitrarily large number (i.e. 200,000) to ensure we capture at least as many features
    • stop_words specifies the words/tokens to remove from the corpus
    • use_idf enables reweighting each feature by its inverse-document-frequency when set to true
    • tokenizer specifies which tokenizer to use, we want to tokenize and stem so we pass it our tokenized_and_stem() function we created above. The default tokenizer will tokenize words and include those greater than two characters in length.
  2. We then fit the vectorizer to our cleaned text using vectorizer.fit_transform()
  3. The output is a nxm matrix where n is the number of documents in our corpus and m is the number of features.
  4. We can inspect the features using vectorizer.get_feature_names()
In [11]:
# Import the TfidfVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

# Create vectorizer for Abstracts, max_df is set to 0.5, we only want
# to include terms that appear in less tha 50% of the documents (i.e. rare terms)
abs_tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=0, max_features=200000,
               stop_words='english', use_idf=True, tokenizer=tokenize_and_stem)

# Create vectorizer for Title, max_df is set to 0.5, we only want 
# to include terms that appear in less than 50% of the documents (i.e. rare terms)
title_tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=0, max_features=200000,
               stop_words='english', use_idf=True, tokenizer=tokenize_and_stem)
In [12]:
# Compute TF-IDF weights for Abstracts
tfidf_weights_abs = abs_tfidf_vectorizer.fit_transform(papers['clean_abstract'])
/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py:2641: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  VisibleDeprecationWarning)
In [13]:
# Compute TF-IDF weights for Title
tfidf_weights_title = title_tfidf_vectorizer.fit_transform(papers['clean_title'])
In [14]:
# Get feature names for Abstract and Title models
tfidf_features_title = title_tfidf_vectorizer.get_feature_names()
tfidf_features_abs = abs_tfidf_vectorizer.get_feature_names()

Write function to get the top-k features associated with a document

In [15]:
# Function for returning the top_k features of an Abstract
# or Title
def get_top_features(rownum, weights, features, top_k=10):
    weight_vec = weights.toarray()[rownum,:]
    top_idx = np.argsort(weight_vec)[::-1][:top_k]
    return [features[i] for i in top_idx]
In [16]:
# Top k features of Abstract 1
get_top_features(1, tfidf_weights_abs, tfidf_features_abs)
Out[16]:
[u'concuss',
 u'reho',
 u'athlet',
 u'post',
 u'connect',
 u'src',
 u'gbc',
 u'healthi',
 u'time',
 u'month']
In [17]:
# Top k features of Title 1
get_top_features(1, tfidf_weights_title, tfidf_features_title)
Out[17]:
[u'global',
 u'concuss',
 u'sport',
 u'connect',
 u'longitudin',
 u'local',
 u'follow',
 u'assess',
 u'relat',
 u'function']

Build Nearest Neighbors model using Abstract and Title TF-IDF matrices

In [18]:
# Build model to return 5 closest neighbors
from sklearn.neighbors import NearestNeighbors

# Create the k-NN model using k=5
nn_abs = NearestNeighbors(n_neighbors=5, algorithm='auto')
nn_title = NearestNeighbors(n_neighbors=5, algorithm='auto')

# Fit the models to the TF-IDF weights matrix
nn_fitted_abs = nn_abs.fit(tfidf_weights_abs)
nn_fitted_title = nn_title.fit(tfidf_weights_title)

Write a function to return the top-k nearest papers

In [19]:
def find_nearest_papers(row, kNNmodel, tfidf_weights, tfidf_features, papers):
    keywords = get_top_features(row, tfidf_weights, tfidf_features)
    dist,idx = kNNmodel.kneighbors(tfidf_weights[row,:])
    idx = list(idx[0])
    return {'papers':papers.ix[idx], 'keywords':keywords}

Return papers based on Abstract similarity

Now that we have a function to return similar papers, we can use it to find papers with similar abstracts. We can return Authors, Title, or Abstract of similar matches

In [20]:
find_nearest_papers(1, nn_fitted_abs, tfidf_weights_abs, tfidf_features_abs, papers)['papers']
Out[20]:
title authors abstract clean_abstract clean_title
1 Longitudinal assessment of local and global fu... [Meier TB, Bellgowan PS, Mayer AR] Growing evidence suggests that sports-related ... growing evidence suggests that sports related ... longitudinal assessment of local and global fu...
59 Longitudinal assessment of white matter abnorm... [Meier TB, Bergamino M, Bellgowan PS, Teague T... There is great interest in developing physiolo... there is great interest in developing physiolo... longitudinal assessment of white matter abnorm...
359 Thinner Cortex in Collegiate Football Players ... [Meier TB, Bellgowan PS, Bergamino M, Ling JM,... Emerging evidence suggests that a history of s... emerging evidence suggests that a history of s... thinner cortex in collegiate football players ...
252 Mood symptoms correlate with kynurenine pathwa... [Singh R, Savitz J, Teague TK, Polanski DW, Ma... OBJECTIVE: An imbalance of neuroactive kynuren... objective an imbalance of neuroactive kynureni... mood symptoms correlate with kynurenine pathwa...
518 Recovery of cerebral blood flow following spor... [Meier TB, Bellgowan PS, Singh R, Kuplicki R, ... IMPORTANCE: Animal models suggest that reduced... importance animal models suggest that reduced ... recovery of cerebral blood flow following spor...

Return papers based on Title similarity

Now that we have a function to return similar papers, we can use it to find papers with similar Titles. We can return Authors, Title, or Abstract of similar matches

Return papers with similar Titles

In [21]:
find_nearest_papers(1, nn_fitted_title, tfidf_weights_title, tfidf_features_title, papers)['papers']
Out[21]:
title authors abstract clean_abstract clean_title
1 Longitudinal assessment of local and global fu... [Meier TB, Bellgowan PS, Mayer AR] Growing evidence suggests that sports-related ... growing evidence suggests that sports related ... longitudinal assessment of local and global fu...
59 Longitudinal assessment of white matter abnorm... [Meier TB, Bergamino M, Bellgowan PS, Teague T... There is great interest in developing physiolo... there is great interest in developing physiolo... longitudinal assessment of white matter abnorm...
518 Recovery of cerebral blood flow following spor... [Meier TB, Bellgowan PS, Singh R, Kuplicki R, ... IMPORTANCE: Animal models suggest that reduced... importance animal models suggest that reduced ... recovery of cerebral blood flow following spor...
252 Mood symptoms correlate with kynurenine pathwa... [Singh R, Savitz J, Teague TK, Polanski DW, Ma... OBJECTIVE: An imbalance of neuroactive kynuren... objective an imbalance of neuroactive kynureni... mood symptoms correlate with kynurenine pathwa...
30 The Role of Nutritional Supplements in Sports ... [Ashbaugh A, McGrew C] There has been considerable research conducted... there has been considerable research conducted... the role of nutritional supplements in sports ...

Lets find similar articles to "Guidance for the practical management of the direct oral anticoagulants (DOACs) in VTE treatment" using Abstract similarity

In [22]:
title = "Guidance for the practical management of the direct oral anticoagulants (DOACs) in VTE treatment."
papers[papers['title']==title]
Out[22]:
title authors abstract clean_abstract clean_title
14 Guidance for the practical management of the d... [Burnett AE, Mahan CE, Vazquez SR, Oertel LB, ... Venous thromboembolism (VTE) is a serious medi... venous thromboembolism vte is a serious medica... guidance for the practical management of the d...

From the output above we see that this paper is indexed by the number 14, we will use this as the paper ID in our function

Lets show the Keywords that are being used to cluster the documents

In [23]:
nearest_papers = find_nearest_papers(14, nn_fitted_abs, tfidf_weights_abs, tfidf_features_abs, papers)
for i in nearest_papers['keywords']: print "Keywords: ", i
Keywords:  doac
Keywords:  vte
Keywords:  guidanc
Keywords:  anticoagul
Keywords:  manuscript
Keywords:  statement
Keywords:  forum
Keywords:  question
Keywords:  treatment
Keywords:  thromboembol

Now let's show the the Abstracts that were retrieved using those keywords

In [24]:
# Show the abstracts of similar papers
for i in nearest_papers['papers']['abstract']: print "Abstract: "+i+"\n"
Abstract: Venous thromboembolism (VTE) is a serious medical condition associated with significant morbidity and mortality, and an incidence that is expected to double in the next forty years. The advent of direct oral anticoagulants (DOACs) has catalyzed significant changes in the therapeutic landscape of VTE treatment. As such, it is imperative that clinicians become familiar with and appropriately implement new treatment paradigms. This manuscript, initiated by the Anticoagulation Forum, provides clinical guidance for VTE treatment with the DOACs. When possible, guidance statements are supported by existing published evidence and guidelines. In instances where evidence or guidelines are lacking, guidance statements represent the consensus opinion of all authors of this manuscript and are endorsed by the Board of Directors of the Anticoagulation Forum.The authors of this manuscript first developed a list of pivotal practical questions related to real-world clinical scenarios involving the use of DOACs for VTE treatment. We then performed a PubMed search for topics and key words including, but not limited to, apixaban, antidote, bridging, cancer, care transitions, dabigatran, direct oral anticoagulant, deep vein thrombosis, edoxaban, interactions, measurement, perioperative, pregnancy, pulmonary embolism, reversal, rivaroxaban, switching, \thrombophilia, venous thromboembolism, and warfarin to answer these questions. Non- English publications and publications > 10 years old were excluded. In an effort to provide practical information about the use of DOACs for VTE treatment, answers to each question are provided in the form of guidance statements, with the intent of high utility and applicability for frontline clinicians across a multitude of care settings.

Abstract: In principle, the answer to this question is obvious: "as long as the risk of continued therapy is outweighed by the benefit." In practice, determining an individual patient's risk of recurrent venous thromboembolism (VTE) without warfarin or other vitamin K antagonists is difficult. However, there are many factors (both intrinsic and environmental) that can alter the risk of VTE recurrence. This paper will discuss evidence and considerations (including the issue of bleeding risk) that may be relevant to decisions about duration of anticoagulant therapy for patients with VTE.

Abstract: This review describes recent evidence relevant to the treatment of deep vein thrombosis (DVT) and pulmonary embolism (PE). Because venous thromboembolism (VTE) is a spectrum of disease that includes both of these disorders, many of the therapeutic options are common to both. At the time of diagnosis, effective treatment options for patients with VTE include unfractionated heparin, low molecular weight heparins (e.g., dalteparin, enoxaparin, tinzaparin), and pentasaccharides (e.g., fondaparinux). Many patients with VTE, especially DVT, can receive most or all of their initial treatment as outpatients. Other treatment strategies such as vena caval filter placement and mechanical (or chemical) clot dissolution are discussed briefly. Anticoagulation with warfarin (or other oral vitamin K antagonists) is a highly effective strategy for the long-term prevention of VTE recurrence in most patients. In addition to presenting evidence relevant to the optimal duration of warfarin therapy, we highlight circumstances under which extended therapy with a parenteral agent such as a low molecular weight heparin might be preferable.

Abstract: Anticoagulant drugs are the foundation of therapy for patients with VTE. While effective therapeutic agents, anticoagulants can also result in hemorrhage and other side effects. Thus, anticoagulant therapy selection should be guided by the risks, benefits and pharmacologic characteristics of each agent for each patient. Safe use of anticoagulants requires not only an in-depth knowledge of their pharmacologic properties but also a comprehensive approach to patient management and education. This paper will summarize the key pharmacologic properties of the anticoagulant agents used in the treatment of patients with VTE.

Abstract: Stroke and venous thromboembolism (VTE) have a large impact on the United States (US) healthcare system. It is estimated that up to 1.7million new and recurrent stroke and VTE events are occurring in the US on an annual basis with the combined cost approaching over $200billion per year. A significant amount of stroke and VTE are preventable from appropriate antithrombotic use in at-risk patients and the Center for Medicaid and Medicare Services, the Joint Commission, the National Quality Forum and other key quality and regulatory entities have prioritized minimizing the impact of morbidity, mortality and avoidable costs related to these diseases. This review provides a brief history, overview, and update for the development of quality measures, quality systems, and regulatory and policy changes as related to stroke and VTE within the US healthcare system.

We see that all of the retrieved Abstracts are pretty similar in that they are about VTE, anticoagulation, thromboembolism, etc..

Conclusion

Applying k-Nearest Neighbors to TF-IDF weights matrix seems to be pretty effective at returning similar articles. The parameters that were chosen to build the TF-IDF models and k-Nearest Neighbors models were somewhat arbitrary. It would be resonable to assume that the accuracy of document retrieval could be improved if more time was invested in selecting optimal tuning parameters. Nevertheless, the parameters chosen in this notebook seem to work pretty well.