In this notebook we are going to attempt to reproduce this blog post using data from the Centers for Medicare and Medicaid Services (CMS) and Python.
We will first start by downloading the Public Use File which is available from here[http://download.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Downloads/PartD_Prescriber_PUF_NPI_DRUG_13.zip] the file is quite large (489MB) so make sure you have enough space on your machine. We will then use Pandas which is python library useful for manipulating data. Pandas is similar to R in the sense that has a similar data frame object that is used for storing tables and conducting a variety of data munging tasks such as transformations and computing aggregate statistics using the 'Split-Apply-Combine' method. Finally, we will use a variety of different plotting libraries to visualize our results.
# import libraries
import pandas as pd
Because the file is large (~23 million rows x 19 columns) we will only select a few key columns to reduce the amount of RAM required to hold the dataframe. The data comes aggregated by NPI and Drug, thus each NPI may appear multiple times, once for each drug that provide has written a prescription for.
columns = ['NPI',
'NPPES_PROVIDER_STATE',
'DRUG_NAME',
'GENERIC_NAME',
'SPECIALTY_DESC',
'TOTAL_CLAIM_COUNT',
'TOTAL_DRUG_COST',
'BENE_COUNT']
# load the data for select columns using the 'c' engine for speed
df = pd.read_csv("PARTD_PRESCRIBER_PUF_NPI_DRUG_13.tab", sep='\t', usecols=columns, engine='c')
Lets get some info on our dataframe (i.e. size, datatypes)
df.info()
We see there are 23,650,520 rows and 8 columns of data which take up about 1.6GB of RAM. We also see that the data types are a mix if integers, floats and 'objects'.
Alright, lets first start by computing the most popular drugs according to the total number of claims for that drug. For this example we will use the generic name of the drug
method = {'TOTAL_CLAIM_COUNT':'sum'}
top_claims = df.groupby('GENERIC_NAME',
as_index=False).agg(method).sort_values(by='TOTAL_CLAIM_COUNT',
ascending=False).head(50)
top_claims.head(10)
We see that the top drugs based on total number of claims include statins, $\beta$-blockers, calcium-channel blockers, proton pump inhibitors, diuretics, anit-hyperglycemics and pain relievers, all of which you would expect given the burden these diseases on the US population
Now lets compute the most costly drugs by total spend. For this we'll be grouping on both DRUG_NAME
which includes the brand name if brand name was prescribed.
method = {'TOTAL_DRUG_COST':'sum'}
top_brands = df.groupby(['DRUG_NAME',
'GENERIC_NAME'],
as_index=False).agg(method).sort_values(by='TOTAL_DRUG_COST',
ascending=False).head(50)
top_brands.head(10)
We see that the top drugs by total spend again include proton-pump inhibitors, statins, bronchodilators, antihyperglycemic drugs and even some antipsychotics and antidepressants. From this table alone, it is impossible to tell whether or not this is due to the price of the medications or because they are prescribed much more frequently than other medications.
A beneficiary is someone who's enrolled in Medicare Part D and who is recieving these medications as a result of their prescription benefits. Lets rank the drugs by total beneficiary count
method = {'BENE_COUNT':'sum'}
top_bene = df.groupby(['GENERIC_NAME'],
as_index=False).agg(method).sort_values(by='BENE_COUNT',
ascending=False).head(50)
top_bene.head(10)
Hydrocodone/APAP tops the list and exceeds lisinopril in total number of beneficiaries by an order of magnitude. This appears reasonable as Hydrocodone/APAP is not necessarily restrictied to handful of disease stats like lisinopril which is primarily used in patients with hypertension and/or kidney failure. As a result, far more people in the US are eligable to recieve a prescription for Hydrocodone/APAP than lisinopril as almost everyone experiences mild-to-moderate pain from time to time.
Since we are simply aggregating stats for categorical variables, an appropriate way to visualize these valuels would be to plot them using a bar-plot. In the next few sections we will plot the aggregate stats we computed earlier to provide a visual aid to better understand them. For this section we will be using a library called ggplot
which is also implemented in R. ggplot draws extensively from 'grammar of graphics' which establishes principles for building visually compelling figures.
ggplot allows figures to be built incrementally by adding on layers. The first function ggplot()
establishes mappings from the data to aesthetic
parameters. geom_bar()
tells ggplot we wan to use a barplot, theme_bw()
adds the black and white theme to our figure while theme()
exposes some of the formatting paramters. Finally, we add labels to our figure using the labs()
function.
from ggplot import *
# plot figures in Jupyter notebook
%matplotlib inline
p = ggplot(aes(x='factor(GENERIC_NAME)', y='TOTAL_CLAIM_COUNT'), data=top_claims.head(40)) + \
geom_bar(stat='identity', fill='steelblue') + \
theme_bw()+ theme(axis_text_x = element_text(angle=45, hjust=1)) + \
labs(title="CMS: Top Drug by Total Number of Claims", x=" ", y="Total claim count")
p
p = ggplot(aes(x='DRUG_NAME', y='TOTAL_DRUG_COST'), data=top_brands.head(40)) + \
geom_bar(stat='identity', fill='steelblue') + \
theme_bw() + theme(axis_text_x = element_text(angle=45, hjust=1)) + \
labs(title="CMS: Top Drugs by Total Spend", x=" ", y="Total spend")
p
p = ggplot(aes(x='GENERIC_NAME', y='BENE_COUNT'), data=top_bene.head(40)) + \
geom_bar(stat='identity', fill='steelblue') + \
theme_bw() + theme(axis_text_x = element_text(angle=45, hjust=1)) + \
labs(title="CMS: Top Drugs by Total Beneficiary Count", x=" ", y="Total beneficiary count")
p
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('whitegrid')
plt.figure(figsize=(13,8))
bp = sns.barplot(x="GENERIC_NAME", y="TOTAL_CLAIM_COUNT", data=top_claims, color="steelblue")
plt.title("CMS: Top Drugs by Total Claim Count")
plt.xlabel("")
plt.ylabel("Total claim count")
ax = bp.set_xticklabels(top_claims.GENERIC_NAME, rotation=45, ha='right', size=10)
sns.set_style('whitegrid')
plt.figure(figsize=(13,8))
bp = sns.barplot(x="DRUG_NAME", y="TOTAL_DRUG_COST", data=top_brands, color="steelblue")
plt.title("CMS: Top Drugs by Total Spend")
plt.xlabel("")
plt.ylabel("Total drug spend")
ax = bp.set_xticklabels(top_brands.DRUG_NAME, rotation=45, ha='right', size=10)
sns.set_style('whitegrid')
plt.figure(figsize=(13,8))
bp = sns.barplot(x="GENERIC_NAME", y="BENE_COUNT", data=top_bene, color="steelblue")
plt.title("CMS: Top Drugs by Beneficiary Count")
plt.xlabel("")
plt.ylabel("Total beneficiary count")
ax = bp.set_xticklabels(top_bene.GENERIC_NAME, rotation=45, ha='right', size=10)
import numpy as np
plt.style.use('seaborn-notebook')
col = plt.rcParams['axes.color_cycle'][0]
fig, ax = plt.subplots(1, 1, figsize=(13, 8))
n = np.arange(0, top_claims.shape[0], 1)
plt.bar(n, top_claims.TOTAL_CLAIM_COUNT, color=col)
ax.set_xticks(n+0.75)
ax.set_xticklabels(top_claims.GENERIC_NAME, rotation=45, ha='right')
plt.title('CMS: Top Drugs by Total Number of Claims')
plt.xlabel('')
plt.ylabel('Total number of claims')
plt.show()
fig, ax = plt.subplots(1, 1, figsize=(13, 8))
n = np.arange(0, top_brands.shape[0], 1)
plt.bar(n, top_brands.TOTAL_DRUG_COST, color=col)
ax.set_xticks(n+0.75)
ax.set_xticklabels(top_brands.DRUG_NAME, rotation=45, ha='right')
plt.title('CMS: Top Drugs by Total Spend')
plt.xlabel('')
plt.ylabel('Total spend')
plt.show()
fig, ax = plt.subplots(1,1,figsize=(13,8))
n = np.arange(0, top_bene.shape[0], 1)
plt.bar(n, top_bene.BENE_COUNT, color=col)
ax.set_xticks(n + 0.75)
ax.set_xticklabels(top_bene.GENERIC_NAME, rotation=45, ha='right')
plt.title('CMS: Top Drugs by Total Beneficiary Count')
plt.xlabel('')
plt.ylabel('Total beneficiary count')
plt.show()
In this next section we will use scatterplots to visualize the relationships between some of the aggregate stats grouped by drug.
# total claims vs total cost
aggfnc = {'TOTAL_CLAIM_COUNT':'sum', 'TOTAL_DRUG_COST':'sum', 'BENE_COUNT':'sum'}
claims_cost = df.groupby('GENERIC_NAME', as_index=False).agg(aggfnc)
#plt.style.use('seaborn-notebook')
#col = plt.rcParams['axes.color_cycle'][0]
plt.figure(figsize=(16,5))
plt.subplots_adjust(wspace=0.3)
plt.suptitle('Scatter Plots: Aggregate Statistics', fontsize=14, fontweight='bold')
plt.subplot(131)
plt.scatter(claims_cost.TOTAL_CLAIM_COUNT, claims_cost.TOTAL_DRUG_COST, color = col, alpha=0.5)
plt.title('Drug Cost vs Drug Claim', size=10)
plt.ylabel('Total drug cost')
plt.ylim()
plt.xlim()
plt.grid(True)
plt.xlabel('Claim count')
plt.subplot(132)
plt.scatter(claims_cost.BENE_COUNT, claims_cost.TOTAL_CLAIM_COUNT, color = col, alpha=0.5)
plt.title('Claim Count vs Beneficiary Count', size=10)
plt.ylabel('Total claim count')
plt.grid(True)
plt.xlabel('Beneficiary count')
plt.subplot(133)
plt.scatter(claims_cost.BENE_COUNT, claims_cost.TOTAL_DRUG_COST, color=col, alpha=0.5)
plt.title('Drug Cost vs Benificiary Count', size=10)
plt.xlabel('Beneficiary count')
plt.ylabel('Total drug cost')
plt.show()
Although we only used two different libraries you can see that several options exist for plotting and visualizing data in python. Additional libraries that we didn't touch on here include Bokeh and Plotly. Plotly is primarily used for creating interactive figures and the API is quite different from what we've seen here, however the plots themselves are extremely tunable. Each of these libraries have their strengths and weaknesses. Libraries such as ggplot
, and seaborn
appear to be wrappers arround matplotlib
as a result I believe they may have some adventages when generating more complex figures; for example creating facets and adding splines or regressions.