Data Summarization in Python

In this notebook we are going to attempt to reproduce this blog post using data from the Centers for Medicare and Medicaid Services (CMS) and Python.

We will first start by downloading the Public Use File which is available from here[http://download.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Downloads/PartD_Prescriber_PUF_NPI_DRUG_13.zip] the file is quite large (489MB) so make sure you have enough space on your machine. We will then use Pandas which is python library useful for manipulating data. Pandas is similar to R in the sense that has a similar data frame object that is used for storing tables and conducting a variety of data munging tasks such as transformations and computing aggregate statistics using the 'Split-Apply-Combine' method. Finally, we will use a variety of different plotting libraries to visualize our results.

Objectives:

  1. Compute top drugs by total number of claims
  2. Compute top most costly drugs
  3. Compute top drugs by total beneficiary count
In [1]:
# import libraries
import pandas as pd

Loading the data

Because the file is large (~23 million rows x 19 columns) we will only select a few key columns to reduce the amount of RAM required to hold the dataframe. The data comes aggregated by NPI and Drug, thus each NPI may appear multiple times, once for each drug that provide has written a prescription for.

In [2]:
columns = ['NPI', 
           'NPPES_PROVIDER_STATE', 
           'DRUG_NAME', 
           'GENERIC_NAME', 
           'SPECIALTY_DESC', 
           'TOTAL_CLAIM_COUNT',
           'TOTAL_DRUG_COST', 
           'BENE_COUNT']

# load the data for select columns using the 'c' engine for speed
df = pd.read_csv("PARTD_PRESCRIBER_PUF_NPI_DRUG_13.tab", sep='\t', usecols=columns, engine='c')

Lets get some info on our dataframe (i.e. size, datatypes)

In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 23650520 entries, 0 to 23650519
Data columns (total 8 columns):
NPI                     int64
NPPES_PROVIDER_STATE    object
SPECIALTY_DESC          object
DRUG_NAME               object
GENERIC_NAME            object
BENE_COUNT              float64
TOTAL_CLAIM_COUNT       int64
TOTAL_DRUG_COST         float64
dtypes: float64(2), int64(2), object(4)
memory usage: 1.6+ GB

We see there are 23,650,520 rows and 8 columns of data which take up about 1.6GB of RAM. We also see that the data types are a mix if integers, floats and 'objects'.

Top drugs by total number of claims

Alright, lets first start by computing the most popular drugs according to the total number of claims for that drug. For this example we will use the generic name of the drug

In [4]:
method = {'TOTAL_CLAIM_COUNT':'sum'}
top_claims = df.groupby('GENERIC_NAME', 
                        as_index=False).agg(method).sort_values(by='TOTAL_CLAIM_COUNT', 
                                                                ascending=False).head(50)
In [5]:
top_claims.head(10)
Out[5]:
GENERIC_NAME TOTAL_CLAIM_COUNT
819 LEVOTHYROXINE SODIUM 40040908
837 LISINOPRIL 36213182
1347 SIMVASTATIN 36185092
78 AMLODIPINE BESYLATE 33951483
678 HYDROCODONE/ACETAMINOPHEN 33521291
1068 OMEPRAZOLE 31500544
112 ATORVASTATIN CALCIUM 26205342
620 FUROSEMIDE 25707232
887 METFORMIN HCL 25065694
924 METOPROLOL TARTRATE 20368244

We see that the top drugs based on total number of claims include statins, $\beta$-blockers, calcium-channel blockers, proton pump inhibitors, diuretics, anit-hyperglycemics and pain relievers, all of which you would expect given the burden these diseases on the US population

Top drugs by total spend

Now lets compute the most costly drugs by total spend. For this we'll be grouping on both DRUG_NAME which includes the brand name if brand name was prescribed.

In [6]:
method = {'TOTAL_DRUG_COST':'sum'}
top_brands = df.groupby(['DRUG_NAME',
                         'GENERIC_NAME'],
                        as_index=False).agg(method).sort_values(by='TOTAL_DRUG_COST', 
                                                                ascending=False).head(50)
In [7]:
top_brands.head(10)
Out[7]:
DRUG_NAME GENERIC_NAME TOTAL_DRUG_COST
1752 NEXIUM ESOMEPRAZOLE MAGNESIUM 2.320291e+09
615 CRESTOR ROSUVASTATIN CALCIUM 2.089238e+09
50 ADVAIR DISKUS FLUTICASONE/SALMETEROL 2.069282e+09
4 ABILIFY ARIPIPRAZOLE 1.878588e+09
2335 SPIRIVA TIOTROPIUM BROMIDE 1.778776e+09
636 CYMBALTA DULOXETINE HCL 1.777237e+09
1707 NAMENDA MEMANTINE HCL 1.452010e+09
1314 JANUVIA SITAGLIPTIN PHOSPHATE 1.308959e+09
1385 LANTUS SOLOSTAR INSULIN GLARGINE,HUM.REC.ANLOG 1.210243e+09
2200 REVLIMID LENALIDOMIDE 1.182224e+09

We see that the top drugs by total spend again include proton-pump inhibitors, statins, bronchodilators, antihyperglycemic drugs and even some antipsychotics and antidepressants. From this table alone, it is impossible to tell whether or not this is due to the price of the medications or because they are prescribed much more frequently than other medications.

Top drugs by total beneficiary count

A beneficiary is someone who's enrolled in Medicare Part D and who is recieving these medications as a result of their prescription benefits. Lets rank the drugs by total beneficiary count

In [8]:
method = {'BENE_COUNT':'sum'}
top_bene = df.groupby(['GENERIC_NAME'],
                     as_index=False).agg(method).sort_values(by='BENE_COUNT',
                                                            ascending=False).head(50)
In [9]:
top_bene.head(10)
Out[9]:
GENERIC_NAME BENE_COUNT
678 HYDROCODONE/ACETAMINOPHEN 10844363
837 LISINOPRIL 8096692
1347 SIMVASTATIN 7810499
819 LEVOTHYROXINE SODIUM 7408339
78 AMLODIPINE BESYLATE 7048423
1068 OMEPRAZOLE 6902028
112 ATORVASTATIN CALCIUM 5689020
620 FUROSEMIDE 5609265
887 METFORMIN HCL 4953388
125 AZITHROMYCIN 4835918

Hydrocodone/APAP tops the list and exceeds lisinopril in total number of beneficiaries by an order of magnitude. This appears reasonable as Hydrocodone/APAP is not necessarily restrictied to handful of disease stats like lisinopril which is primarily used in patients with hypertension and/or kidney failure. As a result, far more people in the US are eligable to recieve a prescription for Hydrocodone/APAP than lisinopril as almost everyone experiences mild-to-moderate pain from time to time.

Visualizing the results:

ggplot

Since we are simply aggregating stats for categorical variables, an appropriate way to visualize these valuels would be to plot them using a bar-plot. In the next few sections we will plot the aggregate stats we computed earlier to provide a visual aid to better understand them. For this section we will be using a library called ggplot which is also implemented in R. ggplot draws extensively from 'grammar of graphics' which establishes principles for building visually compelling figures.

Top drugs by total number of claims

ggplot allows figures to be built incrementally by adding on layers. The first function ggplot() establishes mappings from the data to aesthetic parameters. geom_bar() tells ggplot we wan to use a barplot, theme_bw() adds the black and white theme to our figure while theme() exposes some of the formatting paramters. Finally, we add labels to our figure using the labs() function.

In [10]:
from ggplot import *
# plot figures in Jupyter notebook
%matplotlib inline

p = ggplot(aes(x='factor(GENERIC_NAME)', y='TOTAL_CLAIM_COUNT'), data=top_claims.head(40)) + \
    geom_bar(stat='identity', fill='steelblue') + \
    theme_bw()+ theme(axis_text_x = element_text(angle=45, hjust=1)) + \
    labs(title="CMS: Top Drug by Total Number of Claims", x=" ", y="Total claim count")
p
/usr/local/lib/python2.7/dist-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
Out[10]:
<ggplot: (8728230478909)>

Top drugs by total spend

In [11]:
p = ggplot(aes(x='DRUG_NAME', y='TOTAL_DRUG_COST'), data=top_brands.head(40)) + \
    geom_bar(stat='identity', fill='steelblue') + \
    theme_bw() + theme(axis_text_x = element_text(angle=45, hjust=1)) + \
    labs(title="CMS: Top Drugs by Total Spend", x=" ", y="Total spend")
p
Out[11]:
<ggplot: (8728246488425)>

Top drugs by total beneficiary count

In [12]:
p = ggplot(aes(x='GENERIC_NAME', y='BENE_COUNT'), data=top_bene.head(40)) + \
    geom_bar(stat='identity', fill='steelblue') + \
    theme_bw() + theme(axis_text_x = element_text(angle=45, hjust=1)) + \
    labs(title="CMS: Top Drugs by Total Beneficiary Count", x=" ", y="Total beneficiary count")
p
Out[12]:
<ggplot: (8728252189733)>

Seaborn

Top drugs by total number of claims

In [13]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('whitegrid')
plt.figure(figsize=(13,8))
bp = sns.barplot(x="GENERIC_NAME", y="TOTAL_CLAIM_COUNT", data=top_claims, color="steelblue")
plt.title("CMS: Top Drugs by Total Claim Count")
plt.xlabel("")
plt.ylabel("Total claim count")
ax = bp.set_xticklabels(top_claims.GENERIC_NAME, rotation=45, ha='right', size=10)
/usr/local/lib/python2.7/dist-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)

Top drugs by total spend

In [14]:
sns.set_style('whitegrid')
plt.figure(figsize=(13,8))
bp = sns.barplot(x="DRUG_NAME", y="TOTAL_DRUG_COST", data=top_brands, color="steelblue")
plt.title("CMS: Top Drugs by Total Spend")
plt.xlabel("")
plt.ylabel("Total drug spend")
ax = bp.set_xticklabels(top_brands.DRUG_NAME, rotation=45, ha='right', size=10)

Top drugs by total beneficiary count

In [15]:
sns.set_style('whitegrid')
plt.figure(figsize=(13,8))
bp = sns.barplot(x="GENERIC_NAME", y="BENE_COUNT", data=top_bene, color="steelblue")
plt.title("CMS: Top Drugs by Beneficiary Count")
plt.xlabel("")
plt.ylabel("Total beneficiary count")
ax = bp.set_xticklabels(top_bene.GENERIC_NAME, rotation=45, ha='right', size=10)

Matplotlib

Top drugs by total number of claims

In [16]:
import numpy as np
plt.style.use('seaborn-notebook')
col = plt.rcParams['axes.color_cycle'][0]
fig, ax = plt.subplots(1, 1, figsize=(13, 8))
n = np.arange(0, top_claims.shape[0], 1)
plt.bar(n, top_claims.TOTAL_CLAIM_COUNT, color=col)
ax.set_xticks(n+0.75)
ax.set_xticklabels(top_claims.GENERIC_NAME, rotation=45, ha='right')
plt.title('CMS: Top Drugs by Total Number of Claims')
plt.xlabel('')
plt.ylabel('Total number of claims')
plt.show()
/usr/local/lib/python2.7/dist-packages/matplotlib/__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Top drugs by total spend

In [17]:
fig, ax = plt.subplots(1, 1, figsize=(13, 8))
n = np.arange(0, top_brands.shape[0], 1)
plt.bar(n, top_brands.TOTAL_DRUG_COST, color=col)
ax.set_xticks(n+0.75)
ax.set_xticklabels(top_brands.DRUG_NAME, rotation=45, ha='right')
plt.title('CMS: Top Drugs by Total Spend')
plt.xlabel('')
plt.ylabel('Total spend')
plt.show()
In [18]:
fig, ax = plt.subplots(1,1,figsize=(13,8))
n = np.arange(0, top_bene.shape[0], 1)
plt.bar(n, top_bene.BENE_COUNT, color=col)
ax.set_xticks(n + 0.75)
ax.set_xticklabels(top_bene.GENERIC_NAME, rotation=45, ha='right')
plt.title('CMS: Top Drugs by Total Beneficiary Count')
plt.xlabel('')
plt.ylabel('Total beneficiary count')
plt.show()

Scatter Plots

In this next section we will use scatterplots to visualize the relationships between some of the aggregate stats grouped by drug.

In [19]:
# total claims vs total cost
aggfnc = {'TOTAL_CLAIM_COUNT':'sum', 'TOTAL_DRUG_COST':'sum', 'BENE_COUNT':'sum'}
claims_cost = df.groupby('GENERIC_NAME', as_index=False).agg(aggfnc)
In [21]:
#plt.style.use('seaborn-notebook')
#col = plt.rcParams['axes.color_cycle'][0]
plt.figure(figsize=(16,5))
plt.subplots_adjust(wspace=0.3)
plt.suptitle('Scatter Plots: Aggregate Statistics', fontsize=14, fontweight='bold')

plt.subplot(131)
plt.scatter(claims_cost.TOTAL_CLAIM_COUNT, claims_cost.TOTAL_DRUG_COST, color = col, alpha=0.5)
plt.title('Drug Cost vs Drug Claim', size=10)
plt.ylabel('Total drug cost')
plt.ylim()
plt.xlim()
plt.grid(True)
plt.xlabel('Claim count')

plt.subplot(132)
plt.scatter(claims_cost.BENE_COUNT, claims_cost.TOTAL_CLAIM_COUNT, color = col, alpha=0.5)
plt.title('Claim Count vs Beneficiary Count', size=10)
plt.ylabel('Total claim count')
plt.grid(True)
plt.xlabel('Beneficiary count')

plt.subplot(133)
plt.scatter(claims_cost.BENE_COUNT, claims_cost.TOTAL_DRUG_COST, color=col, alpha=0.5)
plt.title('Drug Cost vs Benificiary Count', size=10)
plt.xlabel('Beneficiary count')
plt.ylabel('Total drug cost')
plt.show()

Conclusion

Although we only used two different libraries you can see that several options exist for plotting and visualizing data in python. Additional libraries that we didn't touch on here include Bokeh and Plotly. Plotly is primarily used for creating interactive figures and the API is quite different from what we've seen here, however the plots themselves are extremely tunable. Each of these libraries have their strengths and weaknesses. Libraries such as ggplot, and seaborn appear to be wrappers arround matplotlib as a result I believe they may have some adventages when generating more complex figures; for example creating facets and adding splines or regressions.