The Automatic Readability Index (ARI) is a measure designed to relfect the readability of a text. It does this providing the estimated grade level needed to be able to fully understand the text. The formula is as follows:
$$ARI = 4.71 \times \frac{nchars}{nwords} + 0.5 \times \frac{nwords}{nsents} - 21.43$$Where nchars
is the number of letters and numbers, nwords
is the number of spaces and nsents
is the number of sentences. We can think of $\frac{nchars}{nwords}$ as being the average
word length and $\frac{nwords}{nsents}$ as being the average sentence length. From the formula we can see that the estimated grade level is proportional to both the average word length and average sentence length. This makes sense as larger words and longer sentences usually correspond to more complex sentence structure.
Let's start by implementing this function in code:
require(dplyr)
require(ggplot2)
require(data.table)
require(reshape2)
require(tidyr)
require(tm)
# function for computing the automated readability index
automated_readability_index = function(n_chars, n_words, n_sents) {
4.71*(n_chars/n_words) + 0.5*(n_words/n_sents) - 21.43
}
Before we can calculate the ARI we must first be able to compute n_chars, n_words
, and n_sents
. As mentioned above, n_words
is the number of spaces in a text. This can easily be calculated using regular expressions. n_chars
is the number of alphanumeric characters which can be counted using the regexpr pattern [A-Za-z0-9]
which will match on all upper and lower case characters as well as numbers 0-9. Finally, lets assume sentences are terminated by the characters . !
and ?
followed by a space ' '
. We can determine the number of sentences by counting the number of sentence terminating strings and adding 1. Lets write a function that will extract the n_chars, n_words
and n_sents
from text.
# function for obtaining counts required to compute ARI
extract_counts = function(text){
n_words = length(gregexpr(" ", text)[[1]])
n_chars = length(gregexpr("[A-Za-z0-9]", text)[[1]])
n_sents = length(gregexpr("\\. |\\? |! ", text)[[1]]) + 1
list(n_chars = n_chars, n_words = n_words, n_sents = n_sents)
}
In the function above, 1 is added to n_sents
; try and think why this may be. Now that we have our function to extract text stats, lets provide a little more abstraction be creating a function that uses both the extract_counts()
function and the automated_readability_index()
function to return the ARI for a body to text`b
# function for computing readability
get_ARI = function(x){
counts = extract_counts(x)
automated_readability_index(counts$n_chars, counts$n_words, counts$n_sents)
}
Now might be a good time to load some text to test our ARI function on. We will be testing the ARI function on a set article abstracts obtained from PubMed. I've downloaded two separate Medline files, one containing articles from the UNM School of Medicine and the other containing articles from the UNM College of Pharmacy. I converted the Medline files to .csv using this Python script script which extracts the authors, title and abstract of each article
# Load pubmed authors, abstract and titles
cop = fread('cop.csv', header=F)
som = fread('som.csv', header=F)
column_names = c('authors', 'abstract', 'title')
colnames(cop) = column_names
colnames(som) = column_names
head(cop,2)
Lets use the mutate
function from the dplyr
package compute the ARI on each abstract and the combine the College of Pharmacy and School of Medicine results into one dataframe
# calculate ARI for each abstract in cop and som datasets
cop = cop %>% group_by(abstract) %>% mutate(ARI = get_ARI(abstract), program = "UNM College of Pharmacy")
som = som %>% group_by(abstract) %>% mutate(ARI = get_ARI(abstract), program = "UNM School of Medicine")
# combine into single dataframe
df = rbind(cop, som)
Now lets compare the mean ARI between the COP and SOM
# compute mean ARI for cop and som
mean(cop$ARI)
mean(som$ARI)
We can see that the mean ARI is very similar between the two programs. Lets plot the histograms to get a sense of the ARI distribution
# change the figure height in the notebook
options(repr.plot.height=4)
# create ARI histograms for both programs
#svg('ari-histogram.svg', width=6, height=4)
ggplot(df, aes(x=ARI, fill=program)) +
geom_histogram(data=filter(df, program=="UNM School of Medicine"), alpha=0.5, bins=50) +
geom_histogram(data=filter(df, program=="UNM College of Pharmacy"), alpha=0.5, bins=50) +
geom_vline(xintercept = mean(cop$ARI), linetype=3, size=0.75, color='red') +
geom_vline(xintercept = mean(som$ARI), linetype=3, size=0.75, color='steelblue') +
scale_fill_manual("", breaks=c("UNM School of Medicine", "UNM College of Pharmacy"), values = c("red", "steelblue")) +
theme_light() +
theme(plot.title=element_text(size=11),
axis.title.x = element_text(size=10),
legend.key = element_blank()) +
labs(title="Automated Readability Index: PubMed Abstracts",
y="",
x="Estimated Grade Level (ARI)")
#dev.off()
We see that the Automated Readability Index for both the College of Pharmacy and School of Medicine are virtually identical and require a fairly high reading level to be understood (COP: 17.74, SOM: 17.35).
In the example above we've seen how we can use an algorithm to estimate the reading level of a body of text. In the next notebook, we will implement a Naive Bayesian Classifier to predict author attribution for a particular text. Bayes classifiers are often used for text classificaiton; a classic example is the classificaiton of emails as either 'spam' or not spam. As we will see next the Naive Bayes Classifier can be used to classify a piece of text as either belonging to an author or not.