Naive Bayes Classifier¶

A navie bayes classifier takes a collection of features (i.e. words from a text) and assign a lable based on maximum a posteriori probability. The naive assumption assumes the features are indepdendent of one another. A classic application of the naive bayes classifier is document filtration and spam detection. This is done by taking an email, splitting it into features and calculating the conditional probability of each feature given a particular category (i.e. spam/not-spam) and assigning the label with the highest probability. It is easy to image other scenarios where this method may be useful. In this notebook we will use naive bayes classification to build a simple search engine, find similar documents and to determine authorship of scientific abstracts indexed in PubMed.

Building a Naive Bayes Classifier¶

First, we need to calculate the probability for each author in the corpus, $P(author)$. This can be done by counting the total number of terms in the corpus contributed by a particular author divided by the total number of terms in the corpus.

$$P(author) = \frac{terms_{author}}{terms_{corpus}}$$

Next we compute the conditional probability of a particular term and author, $P(term|author)$. This is done by counting the total number of times an author used a particular word divided by the total number of words associated with that author.

$$P(term | author) = \frac{TermCount_{author}}{TotalWordCount_{author}}$$

Now, given an unclassified document, we can determine the joint probability of the terms for each author

$$P(document|author) = \prod\limits_{i=1}^{i=n}P(term_{i}|author)$$

We've shown how to calculate $P(document|author)$ but what we really need is to calculate $P(author|document)$. We can then use Bayes theorem to flip our probability around so that we end up with $P(author|document)$. Bayes theorem is as follows

$$P(A|B) = P(B/A) \times \frac{P(A)}{P(B)}$$

which is analagous to

$$P(author | document) = P(document | author) \times \frac{P(author)}{P(document)}$$

We can see that in order to compute $P(author|document)$ we must first determine $P(document|author)$ and $P(author)$ both of which were described above. $P(document)$ is simply the joint probability of each term appearing together in a single document and will not influence our ranking of authors. As a result, we can safely omit it. Consequently, our algorithm will not return a true probability, however this will still allow us to assign the label that is most likely associated with the given text. The calculation is as follows

$$P(author|document) \approx P(document|author) \times P(author)$$

In the following section we will implement the naive bayes classifier and use it to assign authorship to scientific abstracts indexed in PubMed. We will then look at ways in which we can use the algorithm to retreive abstracts that match a particular query as well a way to find similar abstracts.

require(dplyr)
require(tidyr)
require(data.table)

Build the Dataset¶

The data we will be using in this notebook comes from abstracts published in PubMed. The abstracts were downloaded from PubMed as Medline files. The authors, abstract and title were then extracted using this python script and saved as a .csv file which can be downloaded here. Lets read in the data and take a look

pubmed = fread('cop.csv', header=F)
colnames(pubmed) = c('author', 'abstract', 'journal')
head(pubmed,2)

Above we can see the dataset pubmed contains a column for authors, abstract and the article journal title. We now need to come up with a way to break apart the abstract into features (terms). Lets remove punctuations from the abstracts and convert words to lowercase and split the text on whitespace using clean_text() and tokenize(). clean_function() converts the text to lowercase and removes all non-alpha and space characters, tokenize() splits the document on whitespace

# funcitons for cleaning and tokenizing
clean_text = function(text) gsub('[^A-Za-z ]', '', tolower(text))
tokenize = function(text) strsplit(gsub(' {1,}', ' ', text), ' ')

# tokenize the abstract
pubmed$abstract = sapply(pubmed$abstract, function(x) tokenize(clean_text(x)))
head(pubmed)

Next we reformat the dataset by unnesting the abstract terms and authors. This creates a dataframe where each row contains a unique author-term instance

tmp = pubmed %>% unnest(abstract)
head(tmp, 1)

Now that we have a properly formatted dataset we need to calculate the value necessary to compute the probabilities Steps to compute probability

Calculate $P(term | author)$
- For a particular term, count the number of times the author mentioned that term and divide by the total number of words associated with the author
Caclulate $P(author)$
- Calculate the total number of terms associated with the author and divide by the total number of terms in the corpus.

Once we derive the terms above we can easily compute $P(author | terms)$ according to the formula below:

$$P(author|terms) \propto P(author) \times \prod \limits_{i=1}^{i=n} P(term_{i} | author)$$

where $\prod \limits_{i=1}^{i=n} P(term_{i} | author)$ is the joint conditional probability of all terms associated with the author. For example suppose Author A only contributed contributed the following words to the corpus: i love coffee then the join conditional probability would be calculated as follows

$$P(terms \mid AuthorA) = \prod \limits_{i=1}^{i=n} P(term_{i} \mid AuthorA) = P(i \mid AuthorA) \times P(love \mid AuthorA) \times P(coffee \mid AuthorA)$$

# relabel the class and feature columns
train_data = tmp %>% mutate(class = author, feature = abstract) %>% select(class, feature)

# compute P(author), P(term|author) and finally log(P(author|term))
get_prob_data = function(train_data){
    total_feats = dim(train_data)[1]
    train_data %>% group_by(class) %>%
        mutate(total_class_feats = n(),
               p_class = total_class_feats/total_feats) %>%
        group_by(class, feature) %>%
        summarize(log_prob = log(mean(p_class*(n()/total_class_feats))))
}

prob_data = get_prob_data(train_data)
head(prob_data, 3)

Because we are computing the joint probability of potentially hundreds of terms, the resulting probability will potentially be very small and we run the risk of arriving at a number that is too small to be accurately represented in R (undeflow error). To avoid this we can take the log of the probability and sum these together to generate the joint log probababilities. Also, because of the way we are storing the author-term pairs (i.e. we are only storing the terms that have actually been observed with an author) we need a way to include probabilities for terms that have not been observed with the author. In other words, we need a way to penalize an author's score if a query term has not been observed with the author. To do this, we will simply use the mean log probability for all terms averaged over all authors

# function for returning the ranked classes for a text
naive_bayes = function(text, data, k=10, pseudo_prob = 1e-10){
    pseudo_prob = log(pseudo_prob)
    tokens = tokenize(clean_text(text))[[1]]
    n = length(tokens)
    filter(data, feature %in% tokens) %>%
        group_by(class) %>%
        summarize(score = sum(log_prob) + pseudo_prob*(n-n())) %>%
        arrange(desc(score)) %>%
        head(k)
}

Now that we have a function that will return a ranked list of authors given a text we can test it out

Document Filtering and Recommendations using Naive Bayes Classifier¶

Document Recommendations¶

Now that we have constructed our classifier we can use it retrieve articles matching a given query

naive_bayes('methicillin-resistant staphylococcus aureus', prob_data, k=5)

We can see that given the words methicillin-resistant staphylococcus aureus the most likely authors include Mercier, Kollef, Meadows, Lodise and Hall. Lets inspect a few of the abstracts contributed by Mercier

filter(pubmed, author == "Mercier RC")

It appears that there are 7 abstracts from Mercier. Abstracts 2, 3, 4, 5 and 7 appear to mention MRSA or Staphylococcus. Let's try a few more phrases

naive_bayes('biomarkers of disease', prob_data, k=5)

naive_bayes('VTE prophylaxis', prob_data, k=5)

filter(pubmed, author == 'Spyropoulos AC')

naive_bayes('adverse event database', prob_data, k=5)

naive_bayes('adverse event reporting system', prob_data, k=5)

	author	abstract	journal
1	Phillips JP	BACKGROUND: Although incidental findings (IF) are commonly encountered in neuroimaging research, there is no consensus regarding what to do with them. Whether researchers are obligated to review scans for IF, or if such findings should be disclosed to research participants at all, is controversial. Objective data are required to inform reasonable research policy; unfortunately, such data are lacking in the published literature. This manuscript summarizes the development of a radiology review and disclosure system in place at a neuroimaging research institute and its impact on key stakeholders. METHODS: The evolution of a universal radiology review system is described, from inception to its current status. Financial information is reviewed, and stakeholder impact is characterized through surveys and interviews. RESULTS: Consistent with prior reports, 34% of research participants had an incidental finding identified, of which 2.5% required urgent medical attention. A total of 87% of research participants wanted their magnetic resonance imaging (MRI) results regardless of clinical significance and 91% considered getting an MRI report a benefit of study participation. A total of 63% of participants who were encouraged to see a doctor about their incidental finding actually followed up with a physician. Reasons provided for not following-up included already knowing the finding existed (14%), not being able to afford seeing a physician (29%), or being reassured after speaking with the institute's Medical Director (43%). Of those participants who followed the recommendation to see a physician, nine (38%) required further diagnostic testing. No participants, including those who pursued further testing, regretted receiving their MRI report, although two participants expressed concern about the excessive personal cost. The current cost of the radiology review system is about $23 per scan. CONCLUSIONS: It is possible to provide universal radiology review of research scans through a system that is cost-effective, minimizes investigator burden, and does not overwhelm local healthcare resources.	Brain and behavior
2	Anderson JR	Proprotein convertase subtilisin kexin type 9 (PCSK9) inhibitors are novel agents indicated for the treatment of hyperlipidemia. Inhibition of PCSK9 produces an increase in surface low-density lipoprotein (LDL)-receptors and increases removal of LDL from the circulation. Alirocumab (Praluent; Sanofi/Regeneron; Bridgewater, NJ) and evolocumab (Repatha ; Amgen; Thousand Oaks, CA) are currently available and approved for use in patients with heterozygous familial hypercholesterolemia, homozygous familial hypercholesterolemia, and clinical atherosclerotic cardiovascular disease. Bococizumab (RN316; Pfizer; New York, NY) is currently being studied in similar indications, with an estimated approval date in late 2016. The pharmacodynamic effects of PCSK9 inhibitors have been extensively studied in various patient populations. They have been shown to produce significant reductions in LDL and are well-tolerated in clinical studies, but they are very costly when compared to statins, the current mainstay of hyperlipidemia treatment. Clinical outcome studies are underway, but not yet available; however, meta-analyses have pointed to a reduction in cardiovascular death and cardiovascular events with the use of PCSK9 inhibitors. This review will discuss the novel mechanism of action of PCSK9 inhibitors, the present results of clinical studies, and the clinical considerations of these agents in current therapy.	Cardiology in review

	author	abstract	journal
1	Phillips JP	background,although,incidental,findings,if,are,	Brain and behavior
2	Anderson JR	proprotein,convertase,subtilisin,kexin,type,pcsk,	Cardiology in review
3	Wortman SB	hepatitis,c,virus,hcv,is,the,	Pharmacotherapy
4	Mapel DW	this,review,identifies,and,evaluates,the,	Expert review of pharmacoeconomics & outcomes research
5	Burchiel SW	development,of,blood,cells,through,hematopoiesis,	Current protocols in toxicology / editorial board, Mahin D. Maines (editor-in-chief) ... [et al.]
6	Feng CJ	cancer,cells,are,more,susceptible,to,	Journal of inorganic biochemistry

	class	feature	log_prob
1	Afshari CA	ahnonresponsive	-11.68877
2	Afshari CA	ahresponsive	-11.68877
3	Afshari CA	an	-11.68877

	class	score
1	Mercier RC	-29.49035
2	Kollef M	-33.27454
3	Meadows C	-33.45687
4	Lodise T	-35.0663
5	Hall PR	-42.02136

	author	abstract	journal
1	Mercier RC	background,trimethoprimsulfamethoxazole,tmpsmx,is,the,recommended,	Journal of managed care & specialty pharmacy
2	Mercier RC	purpose,synergy,between,betalactams,and,vancomycin,	Clinical therapeutics
3	Mercier RC	vancomycin,van,is,often,used,to,	Antimicrobial agents and chemotherapy
4	Mercier RC	vancomycin,with,piperacillintazobactam,is,used,as,	Antimicrobial agents and chemotherapy
5	Mercier RC	background,therapeutic,use,of,vancomycin,is,	The Journal of antimicrobial chemotherapy
6	Mercier RC	the,present,study,characterized,the,singledose,	Antimicrobial agents and chemotherapy
7	Mercier RC	background,hemodialysis,vascular,access,infections,are,	American journal of nephrology

	class	score
1	Walker MK	-27.27643
2	Campen MJ	-29.55287
3	Casas JP	-30.62365
4	Deming P	-32.17593
5	Sood R	-33.12039

	class	score
1	Spyropoulos AC	-19.12904
2	Ansell J	-33.10518
3	Mahan CE	-33.32832
4	Vo-Nguyen T	-33.32832
5	Wittkowsky A	-34.02147

	author	abstract	journal
1	Spyropoulos AC	healthcare,reform,is,upon,the,united,	Thrombosis and haemostasis
2	Spyropoulos AC	two,concepts,relating,to,venous,thromboembolism,	Clinical and applied thrombosis/hemostasis : official journal of the International Academy of Clinical and Applied Thrombosis/Hemostasis
3	Spyropoulos AC	preventable,venous,thromboembolism,vte,and,appropriate,	Thrombosis and haemostasis
4	Spyropoulos AC	advances,in,antithrombotic,therapy,began,when,	Thrombosis research

	class	score
1	Holodniy M	-33.27454
2	Raisch DW	-34.37316
3	Greenwald BM	-43.07118
4	West DP	-43.69534
5	Koster SA	-45.01709

	class	score
1	Expert opinion on drug safety	-40.39896
2	The Annals of pharmacotherapy	-41.05129
3	Pharmacotherapy	-43.57702
4	Anti-cancer drugs	-46.06192
5	Pediatrics	-54.50864