Naive Bayes Classifier

A navie bayes classifier takes a collection of features (i.e. words from a text) and assign a lable based on maximum a posteriori probability. The naive assumption assumes the features are indepdendent of one another. A classic application of the naive bayes classifier is document filtration and spam detection. This is done by taking an email, splitting it into features and calculating the conditional probability of each feature given a particular category (i.e. spam/not-spam) and assigning the label with the highest probability. It is easy to image other scenarios where this method may be useful. In this notebook we will use naive bayes classification to build a simple search engine, find similar documents and to determine authorship of scientific abstracts indexed in PubMed.

Building a Naive Bayes Classifier

First, we need to calculate the probability for each author in the corpus, $P(author)$. This can be done by counting the total number of terms in the corpus contributed by a particular author divided by the total number of terms in the corpus.

$$P(author) = \frac{terms_{author}}{terms_{corpus}}$$

Next we compute the conditional probability of a particular term and author, $P(term|author)$. This is done by counting the total number of times an author used a particular word divided by the total number of words associated with that author.

$$P(term | author) = \frac{TermCount_{author}}{TotalWordCount_{author}}$$

Now, given an unclassified document, we can determine the joint probability of the terms for each author

$$P(document|author) = \prod\limits_{i=1}^{i=n}P(term_{i}|author)$$

We've shown how to calculate $P(document|author)$ but what we really need is to calculate $P(author|document)$. We can then use Bayes theorem to flip our probability around so that we end up with $P(author|document)$. Bayes theorem is as follows

$$P(A|B) = P(B/A) \times \frac{P(A)}{P(B)}$$

which is analagous to

$$P(author | document) = P(document | author) \times \frac{P(author)}{P(document)}$$

We can see that in order to compute $P(author|document)$ we must first determine $P(document|author)$ and $P(author)$ both of which were described above. $P(document)$ is simply the joint probability of each term appearing together in a single document and will not influence our ranking of authors. As a result, we can safely omit it. Consequently, our algorithm will not return a true probability, however this will still allow us to assign the label that is most likely associated with the given text. The calculation is as follows

$$P(author|document) \approx P(document|author) \times P(author)$$

In the following section we will implement the naive bayes classifier and use it to assign authorship to scientific abstracts indexed in PubMed. We will then look at ways in which we can use the algorithm to retreive abstracts that match a particular query as well a way to find similar abstracts.

In [2]:
require(dplyr)
require(tidyr)
require(data.table)

Build the Dataset

The data we will be using in this notebook comes from abstracts published in PubMed. The abstracts were downloaded from PubMed as Medline files. The authors, abstract and title were then extracted using this python script and saved as a .csv file which can be downloaded here. Lets read in the data and take a look

In [3]:
pubmed = fread('cop.csv', header=F)
colnames(pubmed) = c('author', 'abstract', 'journal')
head(pubmed,2)
Out[3]:
authorabstractjournal
1Phillips JPBACKGROUND: Although incidental findings (IF) are commonly encountered in neuroimaging research, there is no consensus regarding what to do with them. Whether researchers are obligated to review scans for IF, or if such findings should be disclosed to research participants at all, is controversial. Objective data are required to inform reasonable research policy; unfortunately, such data are lacking in the published literature. This manuscript summarizes the development of a radiology review and disclosure system in place at a neuroimaging research institute and its impact on key stakeholders. METHODS: The evolution of a universal radiology review system is described, from inception to its current status. Financial information is reviewed, and stakeholder impact is characterized through surveys and interviews. RESULTS: Consistent with prior reports, 34% of research participants had an incidental finding identified, of which 2.5% required urgent medical attention. A total of 87% of research participants wanted their magnetic resonance imaging (MRI) results regardless of clinical significance and 91% considered getting an MRI report a benefit of study participation. A total of 63% of participants who were encouraged to see a doctor about their incidental finding actually followed up with a physician. Reasons provided for not following-up included already knowing the finding existed (14%), not being able to afford seeing a physician (29%), or being reassured after speaking with the institute's Medical Director (43%). Of those participants who followed the recommendation to see a physician, nine (38%) required further diagnostic testing. No participants, including those who pursued further testing, regretted receiving their MRI report, although two participants expressed concern about the excessive personal cost. The current cost of the radiology review system is about $23 per scan. CONCLUSIONS: It is possible to provide universal radiology review of research scans through a system that is cost-effective, minimizes investigator burden, and does not overwhelm local healthcare resources.Brain and behavior
2Anderson JRProprotein convertase subtilisin kexin type 9 (PCSK9) inhibitors are novel agents indicated for the treatment of hyperlipidemia. Inhibition of PCSK9 produces an increase in surface low-density lipoprotein (LDL)-receptors and increases removal of LDL from the circulation. Alirocumab (Praluent; Sanofi/Regeneron; Bridgewater, NJ) and evolocumab (Repatha ; Amgen; Thousand Oaks, CA) are currently available and approved for use in patients with heterozygous familial hypercholesterolemia, homozygous familial hypercholesterolemia, and clinical atherosclerotic cardiovascular disease. Bococizumab (RN316; Pfizer; New York, NY) is currently being studied in similar indications, with an estimated approval date in late 2016. The pharmacodynamic effects of PCSK9 inhibitors have been extensively studied in various patient populations. They have been shown to produce significant reductions in LDL and are well-tolerated in clinical studies, but they are very costly when compared to statins, the current mainstay of hyperlipidemia treatment. Clinical outcome studies are underway, but not yet available; however, meta-analyses have pointed to a reduction in cardiovascular death and cardiovascular events with the use of PCSK9 inhibitors. This review will discuss the novel mechanism of action of PCSK9 inhibitors, the present results of clinical studies, and the clinical considerations of these agents in current therapy.Cardiology in review

Above we can see the dataset pubmed contains a column for authors, abstract and the article journal title. We now need to come up with a way to break apart the abstract into features (terms). Lets remove punctuations from the abstracts and convert words to lowercase and split the text on whitespace using clean_text() and tokenize(). clean_function() converts the text to lowercase and removes all non-alpha and space characters, tokenize() splits the document on whitespace

In [4]:
# funcitons for cleaning and tokenizing
clean_text = function(text) gsub('[^A-Za-z ]', '', tolower(text))
tokenize = function(text) strsplit(gsub(' {1,}', ' ', text), ' ')
In [5]:
# tokenize the abstract
pubmed$abstract = sapply(pubmed$abstract, function(x) tokenize(clean_text(x)))
head(pubmed)
Out[5]:
authorabstractjournal
1Phillips JPbackground,although,incidental,findings,if,are,Brain and behavior
2Anderson JRproprotein,convertase,subtilisin,kexin,type,pcsk,Cardiology in review
3Wortman SBhepatitis,c,virus,hcv,is,the,Pharmacotherapy
4Mapel DWthis,review,identifies,and,evaluates,the,Expert review of pharmacoeconomics & outcomes research
5Burchiel SWdevelopment,of,blood,cells,through,hematopoiesis,Current protocols in toxicology / editorial board, Mahin D. Maines (editor-in-chief) ... [et al.]
6Feng CJcancer,cells,are,more,susceptible,to,Journal of inorganic biochemistry

Next we reformat the dataset by unnesting the abstract terms and authors. This creates a dataframe where each row contains a unique author-term instance

In [6]:
tmp = pubmed %>% unnest(abstract)
head(tmp, 1)
Out[6]:
authorjournalabstract
1Phillips JPBrain and behaviorbackground

Now that we have a properly formatted dataset we need to calculate the value necessary to compute the probabilities Steps to compute probability

  1. Calculate $P(term | author)$
    • For a particular term, count the number of times the author mentioned that term and divide by the total number of words associated with the author
  2. Caclulate $P(author)$
    • Calculate the total number of terms associated with the author and divide by the total number of terms in the corpus.

Once we derive the terms above we can easily compute $P(author | terms)$ according to the formula below:

$$P(author|terms) \propto P(author) \times \prod \limits_{i=1}^{i=n} P(term_{i} | author)$$

where $\prod \limits_{i=1}^{i=n} P(term_{i} | author)$ is the joint conditional probability of all terms associated with the author. For example suppose Author A only contributed contributed the following words to the corpus: i love coffee then the join conditional probability would be calculated as follows

$$P(terms \mid AuthorA) = \prod \limits_{i=1}^{i=n} P(term_{i} \mid AuthorA) = P(i \mid AuthorA) \times P(love \mid AuthorA) \times P(coffee \mid AuthorA)$$
In [7]:
# relabel the class and feature columns
train_data = tmp %>% mutate(class = author, feature = abstract) %>% select(class, feature)
In [8]:
# compute P(author), P(term|author) and finally log(P(author|term))
get_prob_data = function(train_data){
    total_feats = dim(train_data)[1]
    train_data %>% group_by(class) %>%
        mutate(total_class_feats = n(),
               p_class = total_class_feats/total_feats) %>%
        group_by(class, feature) %>%
        summarize(log_prob = log(mean(p_class*(n()/total_class_feats))))
}
In [9]:
prob_data = get_prob_data(train_data)
head(prob_data, 3)
Out[9]:
classfeaturelog_prob
1Afshari CAahnonresponsive-11.68877
2Afshari CAahresponsive-11.68877
3Afshari CAan-11.68877

Because we are computing the joint probability of potentially hundreds of terms, the resulting probability will potentially be very small and we run the risk of arriving at a number that is too small to be accurately represented in R (undeflow error). To avoid this we can take the log of the probability and sum these together to generate the joint log probababilities. Also, because of the way we are storing the author-term pairs (i.e. we are only storing the terms that have actually been observed with an author) we need a way to include probabilities for terms that have not been observed with the author. In other words, we need a way to penalize an author's score if a query term has not been observed with the author. To do this, we will simply use the mean log probability for all terms averaged over all authors

In [10]:
# function for returning the ranked classes for a text
naive_bayes = function(text, data, k=10, pseudo_prob = 1e-10){
    pseudo_prob = log(pseudo_prob)
    tokens = tokenize(clean_text(text))[[1]]
    n = length(tokens)
    filter(data, feature %in% tokens) %>%
        group_by(class) %>%
        summarize(score = sum(log_prob) + pseudo_prob*(n-n())) %>%
        arrange(desc(score)) %>%
        head(k)
}

Now that we have a function that will return a ranked list of authors given a text we can test it out

Document Filtering and Recommendations using Naive Bayes Classifier

Document Recommendations

Now that we have constructed our classifier we can use it retrieve articles matching a given query

In [11]:
naive_bayes('methicillin-resistant staphylococcus aureus', prob_data, k=5)
Out[11]:
classscore
1Mercier RC-29.49035
2Kollef M-33.27454
3Meadows C-33.45687
4Lodise T-35.0663
5Hall PR-42.02136

We can see that given the words methicillin-resistant staphylococcus aureus the most likely authors include Mercier, Kollef, Meadows, Lodise and Hall. Lets inspect a few of the abstracts contributed by Mercier

In [14]:
filter(pubmed, author == "Mercier RC")
Out[14]:
authorabstractjournal
1Mercier RCbackground,trimethoprimsulfamethoxazole,tmpsmx,is,the,recommended,Journal of managed care & specialty pharmacy
2Mercier RCpurpose,synergy,between,betalactams,and,vancomycin,Clinical therapeutics
3Mercier RCvancomycin,van,is,often,used,to,Antimicrobial agents and chemotherapy
4Mercier RCvancomycin,with,piperacillintazobactam,is,used,as,Antimicrobial agents and chemotherapy
5Mercier RCbackground,therapeutic,use,of,vancomycin,is,The Journal of antimicrobial chemotherapy
6Mercier RCthe,present,study,characterized,the,singledose,Antimicrobial agents and chemotherapy
7Mercier RCbackground,hemodialysis,vascular,access,infections,are,American journal of nephrology

It appears that there are 7 abstracts from Mercier. Abstracts 2, 3, 4, 5 and 7 appear to mention MRSA or Staphylococcus. Let's try a few more phrases

In [15]:
naive_bayes('biomarkers of disease', prob_data, k=5)
Out[15]:
classscore
1Walker MK-27.27643
2Campen MJ-29.55287
3Casas JP-30.62365
4Deming P-32.17593
5Sood R-33.12039
In [16]:
naive_bayes('VTE prophylaxis', prob_data, k=5)
Out[16]:
classscore
1Spyropoulos AC-19.12904
2Ansell J-33.10518
3Mahan CE-33.32832
4Vo-Nguyen T-33.32832
5Wittkowsky A-34.02147
In [17]:
filter(pubmed, author == 'Spyropoulos AC')
Out[17]:
authorabstractjournal
1Spyropoulos AChealthcare,reform,is,upon,the,united,Thrombosis and haemostasis
2Spyropoulos ACtwo,concepts,relating,to,venous,thromboembolism,Clinical and applied thrombosis/hemostasis : official journal of the International Academy of Clinical and Applied Thrombosis/Hemostasis
3Spyropoulos ACpreventable,venous,thromboembolism,vte,and,appropriate,Thrombosis and haemostasis
4Spyropoulos ACadvances,in,antithrombotic,therapy,began,when,Thrombosis research
In [21]:
naive_bayes('adverse event database', prob_data, k=5)
Out[21]:
classscore
1Holodniy M-33.27454
2Raisch DW-34.37316
3Greenwald BM-43.07118
4West DP-43.69534
5Koster SA-45.01709
In [41]:
naive_bayes('adverse event reporting system', prob_data, k=5)
Out[41]:
classscore
1Expert opinion on drug safety-40.39896
2The Annals of pharmacotherapy-41.05129
3Pharmacotherapy-43.57702
4Anti-cancer drugs-46.06192
5Pediatrics-54.50864
In [ ]: