Created on January 31, 2016 by Michael L. Bernauer
In this notebook we will use Pandas
, NLTK
, Numpy
, and SKLearn
libraries to find similar articles published in PubMed using k-Nearest Neighbors.
knn_model
on tf-idf
to find similar papersNote: This notebook borrows extensively from this one by Amir Amini
import pandas as pd
import sklearn
import numpy as np
import nltk
import re
import time
import codecs
from Bio import Medline
First we must download text data that we are interested in using. To do this we will use articles indexed in pubmed.gov. For this notebook we are interested only in article published from the University of New Mexico College of Pharmacy and School of Medicine. Pubmed allows the use of filters/keywords to restrict your search to certain institutions. Retrieve articles affiliated with UNM CoP and SoM by using the following search string "university of new mexico"[AD] AND ("pharmacy"[AD] OR "medicine"[AD])
"university of new mexico"[AD] AND ("pharmacy"[AD] OR "medicine"[AD])
into the search boxAs of January 31, 2016 there were a total of 5,584 articles found matching this search criteria. A file called pubmed_result.txt
should have been saved to your computer. This file contains all of the articles matching the search criteria in MEDLINE format.
# Function that uses the Medline module from
# the Biopython library to parse and read MEDLINE
# formatted files. Results are stored in a Pandas
# DataFrame
def read_medline_data(filename):
recs = Medline.parse(open(filename, 'r'))
text = pd.DataFrame(columns = ["title", "authors", "abstract"])
count = 0
for rec in recs:
try:
abstr = rec["AB"]
title = rec["TI"]
auths = rec["AU"]
text = text.append(pd.DataFrame([[title, auths, abstr]],
columns=['title', 'authors', 'abstract']),
ignore_index=True)
except:
pass
return text
# Read in MEDLINE formatted text
papers = read_medline_data("data/pubmed_result.txt")
# Show the top few papers
papers.head()
print "Title: ", papers['title'][0]
print
print "Abstract: ", papers['abstract'][0]
# Function that cleans text by removing '\x0c' and '\n' characters
# as well as all non-alpha characters and finally converts everything
# to lower case
def clean_text(text):
stop_words = ['\x0c', '\n']
for i in stop_words:
text.replace(i, ' ')
clean_text = re.sub('[^a-zA-Z]+', ' ', text)
return clean_text.lower()
# Create a column for cleaned Abstract and cleaned Title
papers['clean_abstract'] = papers['abstract'].apply(clean_text)
papers['clean_title'] = papers['title'].apply(clean_text)
df
including the newly added cleaned abstract column¶papers.head()
print "Title: ", papers['clean_title'][0]
print
print "Abstract: ", papers['clean_abstract'][0]
We can see that our clean_text()
function successfully removed all non-alpha characters, and converted everything to lower case.
word_tokenize()
and PorterStemmer()
to tokenize and stem document Title and Abstract# Function that takes text, tokenizes it and
# returns list of stemmed tokens
def tokenize_and_stem(text):
tokens = nltk.word_tokenize(text)
stemmer = nltk.stem.porter.PorterStemmer()
return [i for i in [stemmer.stem(t) for t in tokens] if len(i) > 2]
sklearn
TfidfVectorizer
¶max_df
is the maximum allowable document frequency for a token this is set to 0.50
to include terms that appear in less than 50% of documents.min_df
is the minimum allowable document frequency for a token and is set to 0
to include all terms, even those that appear in only one documentmax_features
sets the maximum number of features allowed and is set to an arbitrarily large number (i.e. 200,000) to ensure we capture at least as many featuresstop_words
specifies the words/tokens to remove from the corpususe_idf
enables reweighting each feature by its inverse-document-frequency when set to true
tokenizer
specifies which tokenizer to use, we want to tokenize and stem so we pass it our tokenized_and_stem()
function we created above. The default tokenizer will tokenize words and include those greater than two characters in length.vectorizer
to our cleaned text using vectorizer.fit_transform()
nxm
matrix where n
is the number of documents in our corpus and m
is the number of features.vectorizer.get_feature_names()
# Import the TfidfVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
# Create vectorizer for Abstracts, max_df is set to 0.5, we only want
# to include terms that appear in less tha 50% of the documents (i.e. rare terms)
abs_tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=0, max_features=200000,
stop_words='english', use_idf=True, tokenizer=tokenize_and_stem)
# Create vectorizer for Title, max_df is set to 0.5, we only want
# to include terms that appear in less than 50% of the documents (i.e. rare terms)
title_tfidf_vectorizer = TfidfVectorizer(max_df=0.5, min_df=0, max_features=200000,
stop_words='english', use_idf=True, tokenizer=tokenize_and_stem)
# Compute TF-IDF weights for Abstracts
tfidf_weights_abs = abs_tfidf_vectorizer.fit_transform(papers['clean_abstract'])
# Compute TF-IDF weights for Title
tfidf_weights_title = title_tfidf_vectorizer.fit_transform(papers['clean_title'])
# Get feature names for Abstract and Title models
tfidf_features_title = title_tfidf_vectorizer.get_feature_names()
tfidf_features_abs = abs_tfidf_vectorizer.get_feature_names()
# Function for returning the top_k features of an Abstract
# or Title
def get_top_features(rownum, weights, features, top_k=10):
weight_vec = weights.toarray()[rownum,:]
top_idx = np.argsort(weight_vec)[::-1][:top_k]
return [features[i] for i in top_idx]
# Top k features of Abstract 1
get_top_features(1, tfidf_weights_abs, tfidf_features_abs)
# Top k features of Title 1
get_top_features(1, tfidf_weights_title, tfidf_features_title)
# Build model to return 5 closest neighbors
from sklearn.neighbors import NearestNeighbors
# Create the k-NN model using k=5
nn_abs = NearestNeighbors(n_neighbors=5, algorithm='auto')
nn_title = NearestNeighbors(n_neighbors=5, algorithm='auto')
# Fit the models to the TF-IDF weights matrix
nn_fitted_abs = nn_abs.fit(tfidf_weights_abs)
nn_fitted_title = nn_title.fit(tfidf_weights_title)
def find_nearest_papers(row, kNNmodel, tfidf_weights, tfidf_features, papers):
keywords = get_top_features(row, tfidf_weights, tfidf_features)
dist,idx = kNNmodel.kneighbors(tfidf_weights[row,:])
idx = list(idx[0])
return {'papers':papers.ix[idx], 'keywords':keywords}
Now that we have a function to return similar papers, we can use it to find papers with similar abstracts. We can return Authors, Title, or Abstract of similar matches
find_nearest_papers(1, nn_fitted_abs, tfidf_weights_abs, tfidf_features_abs, papers)['papers']
Now that we have a function to return similar papers, we can use it to find papers with similar Titles. We can return Authors, Title, or Abstract of similar matches
find_nearest_papers(1, nn_fitted_title, tfidf_weights_title, tfidf_features_title, papers)['papers']
title = "Guidance for the practical management of the direct oral anticoagulants (DOACs) in VTE treatment."
papers[papers['title']==title]
From the output above we see that this paper is indexed by the number 14
, we will use this as the paper ID in our function
nearest_papers = find_nearest_papers(14, nn_fitted_abs, tfidf_weights_abs, tfidf_features_abs, papers)
for i in nearest_papers['keywords']: print "Keywords: ", i
# Show the abstracts of similar papers
for i in nearest_papers['papers']['abstract']: print "Abstract: "+i+"\n"
We see that all of the retrieved Abstracts are pretty similar in that they are about VTE, anticoagulation, thromboembolism, etc..
Applying k-Nearest Neighbors to TF-IDF weights matrix seems to be pretty effective at returning similar articles. The parameters that were chosen to build the TF-IDF models and k-Nearest Neighbors models were somewhat arbitrary. It would be resonable to assume that the accuracy of document retrieval could be improved if more time was invested in selecting optimal tuning parameters. Nevertheless, the parameters chosen in this notebook seem to work pretty well.