A navie bayes classifier takes a collection of features (i.e. words from a text) and assign a lable based on maximum a posteriori probability. The naive assumption assumes the features are indepdendent of one another. A classic application of the naive bayes classifier is document filtration and spam detection. This is done by taking an email, splitting it into features and calculating the conditional probability of each feature given a particular category (i.e. spam/not-spam) and assigning the label with the highest probability. It is easy to image other scenarios where this method may be useful. In this notebook we will use naive bayes classification to build a simple search engine, find similar documents and to determine authorship of scientific abstracts indexed in PubMed.
First, we need to calculate the probability for each author in the corpus, $P(author)$. This can be done by counting the total number of terms in the corpus contributed by a particular author divided by the total number of terms in the corpus.
$$P(author) = \frac{terms_{author}}{terms_{corpus}}$$Next we compute the conditional probability of a particular term and author, $P(term|author)$. This is done by counting the total number of times an author used a particular word divided by the total number of words associated with that author.
$$P(term | author) = \frac{TermCount_{author}}{TotalWordCount_{author}}$$Now, given an unclassified document, we can determine the joint probability of the terms for each author
$$P(document|author) = \prod\limits_{i=1}^{i=n}P(term_{i}|author)$$We've shown how to calculate $P(document|author)$ but what we really need is to calculate $P(author|document)$. We can then use Bayes theorem to flip our probability around so that we end up with $P(author|document)$. Bayes theorem is as follows
$$P(A|B) = P(B/A) \times \frac{P(A)}{P(B)}$$which is analagous to
$$P(author | document) = P(document | author) \times \frac{P(author)}{P(document)}$$We can see that in order to compute $P(author|document)$ we must first determine $P(document|author)$ and $P(author)$ both of which were described above. $P(document)$ is simply the joint probability of each term appearing together in a single document and will not influence our ranking of authors. As a result, we can safely omit it. Consequently, our algorithm will not return a true probability, however this will still allow us to assign the label that is most likely associated with the given text. The calculation is as follows
$$P(author|document) \approx P(document|author) \times P(author)$$In the following section we will implement the naive bayes classifier and use it to assign authorship to scientific abstracts indexed in PubMed. We will then look at ways in which we can use the algorithm to retreive abstracts that match a particular query as well a way to find similar abstracts.
require(dplyr)
require(tidyr)
require(data.table)
The data we will be using in this notebook comes from abstracts published in PubMed. The abstracts were downloaded from PubMed as Medline files. The authors, abstract and title were then extracted using this python script and saved as a .csv file which can be downloaded here. Lets read in the data and take a look
pubmed = fread('cop.csv', header=F)
colnames(pubmed) = c('author', 'abstract', 'journal')
head(pubmed,2)
Above we can see the dataset pubmed
contains a column for authors, abstract and the article journal title. We now need to come up with a way to break apart the abstract into features (terms). Lets remove punctuations from the abstracts and convert words to lowercase and split the text on whitespace using clean_text()
and tokenize()
. clean_function()
converts the text to lowercase and removes all non-alpha and space characters, tokenize()
splits the document on whitespace
# funcitons for cleaning and tokenizing
clean_text = function(text) gsub('[^A-Za-z ]', '', tolower(text))
tokenize = function(text) strsplit(gsub(' {1,}', ' ', text), ' ')
# tokenize the abstract
pubmed$abstract = sapply(pubmed$abstract, function(x) tokenize(clean_text(x)))
head(pubmed)
Next we reformat the dataset by unnesting the abstract terms and authors. This creates a dataframe where each row contains a unique author-term instance
tmp = pubmed %>% unnest(abstract)
head(tmp, 1)
Now that we have a properly formatted dataset we need to calculate the value necessary to compute the probabilities Steps to compute probability
Once we derive the terms above we can easily compute $P(author | terms)$ according to the formula below:
$$P(author|terms) \propto P(author) \times \prod \limits_{i=1}^{i=n} P(term_{i} | author)$$where $\prod \limits_{i=1}^{i=n} P(term_{i} | author)$ is the joint conditional probability of all terms associated with the author. For example suppose Author A
only contributed contributed the following words to the corpus: i love coffee
then the join conditional probability would be calculated as follows
# relabel the class and feature columns
train_data = tmp %>% mutate(class = author, feature = abstract) %>% select(class, feature)
# compute P(author), P(term|author) and finally log(P(author|term))
get_prob_data = function(train_data){
total_feats = dim(train_data)[1]
train_data %>% group_by(class) %>%
mutate(total_class_feats = n(),
p_class = total_class_feats/total_feats) %>%
group_by(class, feature) %>%
summarize(log_prob = log(mean(p_class*(n()/total_class_feats))))
}
prob_data = get_prob_data(train_data)
head(prob_data, 3)
Because we are computing the joint probability of potentially hundreds of terms, the resulting probability will potentially be very small and we run the risk of arriving at a number that is too small to be accurately represented in R (undeflow error). To avoid this we can take the log of the probability and sum these together to generate the joint log probababilities. Also, because of the way we are storing the author-term pairs (i.e. we are only storing the terms that have actually been observed with an author) we need a way to include probabilities for terms that have not been observed with the author. In other words, we need a way to penalize an author's score if a query term has not been observed with the author. To do this, we will simply use the mean log probability for all terms averaged over all authors
# function for returning the ranked classes for a text
naive_bayes = function(text, data, k=10, pseudo_prob = 1e-10){
pseudo_prob = log(pseudo_prob)
tokens = tokenize(clean_text(text))[[1]]
n = length(tokens)
filter(data, feature %in% tokens) %>%
group_by(class) %>%
summarize(score = sum(log_prob) + pseudo_prob*(n-n())) %>%
arrange(desc(score)) %>%
head(k)
}
Now that we have a function that will return a ranked list of authors given a text we can test it out
naive_bayes('methicillin-resistant staphylococcus aureus', prob_data, k=5)
We can see that given the words methicillin-resistant staphylococcus aureus
the most likely authors include Mercier, Kollef, Meadows, Lodise and Hall. Lets inspect a few of the abstracts contributed by Mercier
filter(pubmed, author == "Mercier RC")
It appears that there are 7 abstracts from Mercier. Abstracts 2, 3, 4, 5 and 7 appear to mention MRSA or Staphylococcus. Let's try a few more phrases
naive_bayes('biomarkers of disease', prob_data, k=5)
naive_bayes('VTE prophylaxis', prob_data, k=5)
filter(pubmed, author == 'Spyropoulos AC')
naive_bayes('adverse event database', prob_data, k=5)
naive_bayes('adverse event reporting system', prob_data, k=5)