Automatic Readability Index and Naive Bayse Classification

by Michael L. Bernauer

The Automatic Readability Index (ARI) is a measure designed to relfect the readability of a text. It does this providing the estimated grade level needed to be able to fully understand the text. The formula is as follows:

$$ARI = 4.71 \times \frac{nchars}{nwords} + 0.5 \times \frac{nwords}{nsents} - 21.43$$

Where nchars is the number of letters and numbers, nwords is the number of spaces and nsents is the number of sentences. We can think of $\frac{nchars}{nwords}$ as being the average word length and $\frac{nwords}{nsents}$ as being the average sentence length. From the formula we can see that the estimated grade level is proportional to both the average word length and average sentence length. This makes sense as larger words and longer sentences usually correspond to more complex sentence structure.

Let's start by implementing this function in code:

Implementing a Function to Compute ARI

In [83]:
require(dplyr)
require(ggplot2)
require(data.table)
require(reshape2)
require(tidyr)
require(tm)
In [84]:
# function for computing the automated readability index
automated_readability_index = function(n_chars, n_words, n_sents) {
    4.71*(n_chars/n_words) + 0.5*(n_words/n_sents) - 21.43
}

Before we can calculate the ARI we must first be able to compute n_chars, n_words, and n_sents. As mentioned above, n_words is the number of spaces in a text. This can easily be calculated using regular expressions. n_chars is the number of alphanumeric characters which can be counted using the regexpr pattern [A-Za-z0-9] which will match on all upper and lower case characters as well as numbers 0-9. Finally, lets assume sentences are terminated by the characters . ! and ? followed by a space ' '. We can determine the number of sentences by counting the number of sentence terminating strings and adding 1. Lets write a function that will extract the n_chars, n_words and n_sents from text.

In [85]:
# function for obtaining counts required to compute ARI
extract_counts = function(text){
    n_words = length(gregexpr(" ", text)[[1]])
    n_chars = length(gregexpr("[A-Za-z0-9]", text)[[1]])
    n_sents = length(gregexpr("\\. |\\? |! ", text)[[1]]) + 1
    list(n_chars = n_chars, n_words = n_words, n_sents = n_sents)
}

In the function above, 1 is added to n_sents; try and think why this may be. Now that we have our function to extract text stats, lets provide a little more abstraction be creating a function that uses both the extract_counts() function and the automated_readability_index() function to return the ARI for a body to text`b

In [5]:
# function for computing readability
get_ARI = function(x){
    counts = extract_counts(x)
    automated_readability_index(counts$n_chars, counts$n_words, counts$n_sents)
}

Now might be a good time to load some text to test our ARI function on. We will be testing the ARI function on a set article abstracts obtained from PubMed. I've downloaded two separate Medline files, one containing articles from the UNM School of Medicine and the other containing articles from the UNM College of Pharmacy. I converted the Medline files to .csv using this Python script script which extracts the authors, title and abstract of each article

In [6]:
# Load pubmed authors, abstract and titles
cop = fread('cop.csv', header=F)
som = fread('som.csv', header=F)
column_names = c('authors', 'abstract', 'title')
colnames(cop) = column_names
colnames(som) = column_names
head(cop,2)
Out[6]:
authorsabstracttitle
1Shoemaker JM,Cole C,Petree LE,Helitzer DL,Holdsworth MT,Gluck JP,Phillips JPBACKGROUND: Although incidental findings (IF) are commonly encountered in neuroimaging research, there is no consensus regarding what to do with them. Whether researchers are obligated to review scans for IF, or if such findings should be disclosed to research participants at all, is controversial. Objective data are required to inform reasonable research policy; unfortunately, such data are lacking in the published literature. This manuscript summarizes the development of a radiology review and disclosure system in place at a neuroimaging research institute and its impact on key stakeholders. METHODS: The evolution of a universal radiology review system is described, from inception to its current status. Financial information is reviewed, and stakeholder impact is characterized through surveys and interviews. RESULTS: Consistent with prior reports, 34% of research participants had an incidental finding identified, of which 2.5% required urgent medical attention. A total of 87% of research participants wanted their magnetic resonance imaging (MRI) results regardless of clinical significance and 91% considered getting an MRI report a benefit of study participation. A total of 63% of participants who were encouraged to see a doctor about their incidental finding actually followed up with a physician. Reasons provided for not following-up included already knowing the finding existed (14%), not being able to afford seeing a physician (29%), or being reassured after speaking with the institute's Medical Director (43%). Of those participants who followed the recommendation to see a physician, nine (38%) required further diagnostic testing. No participants, including those who pursued further testing, regretted receiving their MRI report, although two participants expressed concern about the excessive personal cost. The current cost of the radiology review system is about $23 per scan. CONCLUSIONS: It is possible to provide universal radiology review of research scans through a system that is cost-effective, minimizes investigator burden, and does not overwhelm local healthcare resources.Evolution of universal review and disclosure of MRI reports to research participants.
2Sible AM,Nawarskas JJ,Anderson JRProprotein convertase subtilisin kexin type 9 (PCSK9) inhibitors are novel agents indicated for the treatment of hyperlipidemia. Inhibition of PCSK9 produces an increase in surface low-density lipoprotein (LDL)-receptors and increases removal of LDL from the circulation. Alirocumab (Praluent; Sanofi/Regeneron; Bridgewater, NJ) and evolocumab (Repatha ; Amgen; Thousand Oaks, CA) are currently available and approved for use in patients with heterozygous familial hypercholesterolemia, homozygous familial hypercholesterolemia, and clinical atherosclerotic cardiovascular disease. Bococizumab (RN316; Pfizer; New York, NY) is currently being studied in similar indications, with an estimated approval date in late 2016. The pharmacodynamic effects of PCSK9 inhibitors have been extensively studied in various patient populations. They have been shown to produce significant reductions in LDL and are well-tolerated in clinical studies, but they are very costly when compared to statins, the current mainstay of hyperlipidemia treatment. Clinical outcome studies are underway, but not yet available; however, meta-analyses have pointed to a reduction in cardiovascular death and cardiovascular events with the use of PCSK9 inhibitors. This review will discuss the novel mechanism of action of PCSK9 inhibitors, the present results of clinical studies, and the clinical considerations of these agents in current therapy.PCSK9 Inhibitors: An Innovative Approach to Treating Hyperlipidemia.

Lets use the mutate function from the dplyr package compute the ARI on each abstract and the combine the College of Pharmacy and School of Medicine results into one dataframe

In [86]:
# calculate ARI for each abstract in cop and som datasets
cop = cop %>% group_by(abstract) %>% mutate(ARI = get_ARI(abstract), program = "UNM College of Pharmacy")
som = som %>% group_by(abstract) %>% mutate(ARI = get_ARI(abstract), program = "UNM School of Medicine")
# combine into single dataframe
df = rbind(cop, som)

Now lets compare the mean ARI between the COP and SOM

In [87]:
# compute mean ARI for cop and som
mean(cop$ARI)
mean(som$ARI)
Out[87]:
17.7362889937058
Out[87]:
17.3491226747394

We can see that the mean ARI is very similar between the two programs. Lets plot the histograms to get a sense of the ARI distribution

In [92]:
# change the figure height in the notebook
options(repr.plot.height=4)
# create ARI histograms for both programs
#svg('ari-histogram.svg', width=6, height=4)
ggplot(df, aes(x=ARI, fill=program)) +
  geom_histogram(data=filter(df, program=="UNM School of Medicine"), alpha=0.5, bins=50) +
  geom_histogram(data=filter(df, program=="UNM College of Pharmacy"), alpha=0.5, bins=50) +
  geom_vline(xintercept = mean(cop$ARI), linetype=3, size=0.75, color='red') +
  geom_vline(xintercept = mean(som$ARI), linetype=3, size=0.75, color='steelblue') +
  scale_fill_manual("", breaks=c("UNM School of Medicine", "UNM College of Pharmacy"), values = c("red", "steelblue")) +
  theme_light() +
  theme(plot.title=element_text(size=11),
        axis.title.x = element_text(size=10),
        legend.key = element_blank()) +
  
  labs(title="Automated Readability Index: PubMed Abstracts",
       y="",
       x="Estimated Grade Level (ARI)")
#dev.off()

We see that the Automated Readability Index for both the College of Pharmacy and School of Medicine are virtually identical and require a fairly high reading level to be understood (COP: 17.74, SOM: 17.35).

In the example above we've seen how we can use an algorithm to estimate the reading level of a body of text. In the next notebook, we will implement a Naive Bayesian Classifier to predict author attribution for a particular text. Bayes classifiers are often used for text classificaiton; a classic example is the classificaiton of emails as either 'spam' or not spam. As we will see next the Naive Bayes Classifier can be used to classify a piece of text as either belonging to an author or not.