Data Summarization in Python¶

In this notebook we are going to attempt to reproduce this blog post using data from the Centers for Medicare and Medicaid Services (CMS) and Python.

We will first start by downloading the Public Use File which is available from here[http://download.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Downloads/PartD_Prescriber_PUF_NPI_DRUG_13.zip] the file is quite large (489MB) so make sure you have enough space on your machine. We will then use Pandas which is python library useful for manipulating data. Pandas is similar to R in the sense that has a similar data frame object that is used for storing tables and conducting a variety of data munging tasks such as transformations and computing aggregate statistics using the 'Split-Apply-Combine' method. Finally, we will use a variety of different plotting libraries to visualize our results.

Objectives:¶

Compute top drugs by total number of claims
Compute top most costly drugs
Compute top drugs by total beneficiary count

# import libraries
import pandas as pd

Loading the data¶

Because the file is large (~23 million rows x 19 columns) we will only select a few key columns to reduce the amount of RAM required to hold the dataframe. The data comes aggregated by NPI and Drug, thus each NPI may appear multiple times, once for each drug that provide has written a prescription for.

columns = ['NPI', 
           'NPPES_PROVIDER_STATE', 
           'DRUG_NAME', 
           'GENERIC_NAME', 
           'SPECIALTY_DESC', 
           'TOTAL_CLAIM_COUNT',
           'TOTAL_DRUG_COST', 
           'BENE_COUNT']

# load the data for select columns using the 'c' engine for speed
df = pd.read_csv("PARTD_PRESCRIBER_PUF_NPI_DRUG_13.tab", sep='\t', usecols=columns, engine='c')

Lets get some info on our dataframe (i.e. size, datatypes)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23650520 entries, 0 to 23650519
Data columns (total 8 columns):
NPI                     int64
NPPES_PROVIDER_STATE    object
SPECIALTY_DESC          object
DRUG_NAME               object
GENERIC_NAME            object
BENE_COUNT              float64
TOTAL_CLAIM_COUNT       int64
TOTAL_DRUG_COST         float64
dtypes: float64(2), int64(2), object(4)
memory usage: 1.6+ GB

We see there are 23,650,520 rows and 8 columns of data which take up about 1.6GB of RAM. We also see that the data types are a mix if integers, floats and 'objects'.

Top drugs by total number of claims¶

Alright, lets first start by computing the most popular drugs according to the total number of claims for that drug. For this example we will use the generic name of the drug

method = {'TOTAL_CLAIM_COUNT':'sum'}
top_claims = df.groupby('GENERIC_NAME', 
                        as_index=False).agg(method).sort_values(by='TOTAL_CLAIM_COUNT', 
                                                                ascending=False).head(50)

top_claims.head(10)

We see that the top drugs based on total number of claims include statins, $\beta$-blockers, calcium-channel blockers, proton pump inhibitors, diuretics, anit-hyperglycemics and pain relievers, all of which you would expect given the burden these diseases on the US population

Top drugs by total spend¶

Now lets compute the most costly drugs by total spend. For this we'll be grouping on both DRUG_NAME which includes the brand name if brand name was prescribed.

method = {'TOTAL_DRUG_COST':'sum'}
top_brands = df.groupby(['DRUG_NAME',
                         'GENERIC_NAME'],
                        as_index=False).agg(method).sort_values(by='TOTAL_DRUG_COST', 
                                                                ascending=False).head(50)

top_brands.head(10)

We see that the top drugs by total spend again include proton-pump inhibitors, statins, bronchodilators, antihyperglycemic drugs and even some antipsychotics and antidepressants. From this table alone, it is impossible to tell whether or not this is due to the price of the medications or because they are prescribed much more frequently than other medications.

Top drugs by total beneficiary count¶

A beneficiary is someone who's enrolled in Medicare Part D and who is recieving these medications as a result of their prescription benefits. Lets rank the drugs by total beneficiary count

method = {'BENE_COUNT':'sum'}
top_bene = df.groupby(['GENERIC_NAME'],
                     as_index=False).agg(method).sort_values(by='BENE_COUNT',
                                                            ascending=False).head(50)

top_bene.head(10)

Hydrocodone/APAP tops the list and exceeds lisinopril in total number of beneficiaries by an order of magnitude. This appears reasonable as Hydrocodone/APAP is not necessarily restrictied to handful of disease stats like lisinopril which is primarily used in patients with hypertension and/or kidney failure. As a result, far more people in the US are eligable to recieve a prescription for Hydrocodone/APAP than lisinopril as almost everyone experiences mild-to-moderate pain from time to time.

Visualizing the results:¶

ggplot¶

Since we are simply aggregating stats for categorical variables, an appropriate way to visualize these valuels would be to plot them using a bar-plot. In the next few sections we will plot the aggregate stats we computed earlier to provide a visual aid to better understand them. For this section we will be using a library called ggplot which is also implemented in R. ggplot draws extensively from 'grammar of graphics' which establishes principles for building visually compelling figures.

Top drugs by total number of claims¶

ggplot allows figures to be built incrementally by adding on layers. The first function ggplot() establishes mappings from the data to aesthetic parameters. geom_bar() tells ggplot we wan to use a barplot, theme_bw() adds the black and white theme to our figure while theme() exposes some of the formatting paramters. Finally, we add labels to our figure using the labs() function.

from ggplot import *
# plot figures in Jupyter notebook
%matplotlib inline

p = ggplot(aes(x='factor(GENERIC_NAME)', y='TOTAL_CLAIM_COUNT'), data=top_claims.head(40)) + \
    geom_bar(stat='identity', fill='steelblue') + \
    theme_bw()+ theme(axis_text_x = element_text(angle=45, hjust=1)) + \
    labs(title="CMS: Top Drug by Total Number of Claims", x=" ", y="Total claim count")
p

/usr/local/lib/python2.7/dist-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

<ggplot: (8728230478909)>

Top drugs by total spend¶

p = ggplot(aes(x='DRUG_NAME', y='TOTAL_DRUG_COST'), data=top_brands.head(40)) + \
    geom_bar(stat='identity', fill='steelblue') + \
    theme_bw() + theme(axis_text_x = element_text(angle=45, hjust=1)) + \
    labs(title="CMS: Top Drugs by Total Spend", x=" ", y="Total spend")
p

<ggplot: (8728246488425)>

Top drugs by total beneficiary count¶

p = ggplot(aes(x='GENERIC_NAME', y='BENE_COUNT'), data=top_bene.head(40)) + \
    geom_bar(stat='identity', fill='steelblue') + \
    theme_bw() + theme(axis_text_x = element_text(angle=45, hjust=1)) + \
    labs(title="CMS: Top Drugs by Total Beneficiary Count", x=" ", y="Total beneficiary count")
p

<ggplot: (8728252189733)>

Seaborn¶

Top drugs by total number of claims¶

import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('whitegrid')
plt.figure(figsize=(13,8))
bp = sns.barplot(x="GENERIC_NAME", y="TOTAL_CLAIM_COUNT", data=top_claims, color="steelblue")
plt.title("CMS: Top Drugs by Total Claim Count")
plt.xlabel("")
plt.ylabel("Total claim count")
ax = bp.set_xticklabels(top_claims.GENERIC_NAME, rotation=45, ha='right', size=10)

/usr/local/lib/python2.7/dist-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)

Top drugs by total spend¶

sns.set_style('whitegrid')
plt.figure(figsize=(13,8))
bp = sns.barplot(x="DRUG_NAME", y="TOTAL_DRUG_COST", data=top_brands, color="steelblue")
plt.title("CMS: Top Drugs by Total Spend")
plt.xlabel("")
plt.ylabel("Total drug spend")
ax = bp.set_xticklabels(top_brands.DRUG_NAME, rotation=45, ha='right', size=10)

Top drugs by total beneficiary count¶

sns.set_style('whitegrid')
plt.figure(figsize=(13,8))
bp = sns.barplot(x="GENERIC_NAME", y="BENE_COUNT", data=top_bene, color="steelblue")
plt.title("CMS: Top Drugs by Beneficiary Count")
plt.xlabel("")
plt.ylabel("Total beneficiary count")
ax = bp.set_xticklabels(top_bene.GENERIC_NAME, rotation=45, ha='right', size=10)

Matplotlib¶

Top drugs by total number of claims¶

import numpy as np
plt.style.use('seaborn-notebook')
col = plt.rcParams['axes.color_cycle'][0]
fig, ax = plt.subplots(1, 1, figsize=(13, 8))
n = np.arange(0, top_claims.shape[0], 1)
plt.bar(n, top_claims.TOTAL_CLAIM_COUNT, color=col)
ax.set_xticks(n+0.75)
ax.set_xticklabels(top_claims.GENERIC_NAME, rotation=45, ha='right')
plt.title('CMS: Top Drugs by Total Number of Claims')
plt.xlabel('')
plt.ylabel('Total number of claims')
plt.show()

/usr/local/lib/python2.7/dist-packages/matplotlib/__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Top drugs by total spend¶

fig, ax = plt.subplots(1, 1, figsize=(13, 8))
n = np.arange(0, top_brands.shape[0], 1)
plt.bar(n, top_brands.TOTAL_DRUG_COST, color=col)
ax.set_xticks(n+0.75)
ax.set_xticklabels(top_brands.DRUG_NAME, rotation=45, ha='right')
plt.title('CMS: Top Drugs by Total Spend')
plt.xlabel('')
plt.ylabel('Total spend')
plt.show()

fig, ax = plt.subplots(1,1,figsize=(13,8))
n = np.arange(0, top_bene.shape[0], 1)
plt.bar(n, top_bene.BENE_COUNT, color=col)
ax.set_xticks(n + 0.75)
ax.set_xticklabels(top_bene.GENERIC_NAME, rotation=45, ha='right')
plt.title('CMS: Top Drugs by Total Beneficiary Count')
plt.xlabel('')
plt.ylabel('Total beneficiary count')
plt.show()

Scatter Plots¶

In this next section we will use scatterplots to visualize the relationships between some of the aggregate stats grouped by drug.

# total claims vs total cost
aggfnc = {'TOTAL_CLAIM_COUNT':'sum', 'TOTAL_DRUG_COST':'sum', 'BENE_COUNT':'sum'}
claims_cost = df.groupby('GENERIC_NAME', as_index=False).agg(aggfnc)

#plt.style.use('seaborn-notebook')
#col = plt.rcParams['axes.color_cycle'][0]
plt.figure(figsize=(16,5))
plt.subplots_adjust(wspace=0.3)
plt.suptitle('Scatter Plots: Aggregate Statistics', fontsize=14, fontweight='bold')

plt.subplot(131)
plt.scatter(claims_cost.TOTAL_CLAIM_COUNT, claims_cost.TOTAL_DRUG_COST, color = col, alpha=0.5)
plt.title('Drug Cost vs Drug Claim', size=10)
plt.ylabel('Total drug cost')
plt.ylim()
plt.xlim()
plt.grid(True)
plt.xlabel('Claim count')

plt.subplot(132)
plt.scatter(claims_cost.BENE_COUNT, claims_cost.TOTAL_CLAIM_COUNT, color = col, alpha=0.5)
plt.title('Claim Count vs Beneficiary Count', size=10)
plt.ylabel('Total claim count')
plt.grid(True)
plt.xlabel('Beneficiary count')

plt.subplot(133)
plt.scatter(claims_cost.BENE_COUNT, claims_cost.TOTAL_DRUG_COST, color=col, alpha=0.5)
plt.title('Drug Cost vs Benificiary Count', size=10)
plt.xlabel('Beneficiary count')
plt.ylabel('Total drug cost')
plt.show()

Conclusion¶

Although we only used two different libraries you can see that several options exist for plotting and visualizing data in python. Additional libraries that we didn't touch on here include Bokeh and Plotly. Plotly is primarily used for creating interactive figures and the API is quite different from what we've seen here, however the plots themselves are extremely tunable. Each of these libraries have their strengths and weaknesses. Libraries such as ggplot, and seaborn appear to be wrappers arround matplotlib as a result I believe they may have some adventages when generating more complex figures; for example creating facets and adding splines or regressions.

	GENERIC_NAME	TOTAL_CLAIM_COUNT
819	LEVOTHYROXINE SODIUM	40040908
837	LISINOPRIL	36213182
1347	SIMVASTATIN	36185092
78	AMLODIPINE BESYLATE	33951483
678	HYDROCODONE/ACETAMINOPHEN	33521291
1068	OMEPRAZOLE	31500544
112	ATORVASTATIN CALCIUM	26205342
620	FUROSEMIDE	25707232
887	METFORMIN HCL	25065694
924	METOPROLOL TARTRATE	20368244

	DRUG_NAME	GENERIC_NAME	TOTAL_DRUG_COST
1752	NEXIUM	ESOMEPRAZOLE MAGNESIUM	2.320291e+09
615	CRESTOR	ROSUVASTATIN CALCIUM	2.089238e+09
50	ADVAIR DISKUS	FLUTICASONE/SALMETEROL	2.069282e+09
4	ABILIFY	ARIPIPRAZOLE	1.878588e+09
2335	SPIRIVA	TIOTROPIUM BROMIDE	1.778776e+09
636	CYMBALTA	DULOXETINE HCL	1.777237e+09
1707	NAMENDA	MEMANTINE HCL	1.452010e+09
1314	JANUVIA	SITAGLIPTIN PHOSPHATE	1.308959e+09
1385	LANTUS SOLOSTAR	INSULIN GLARGINE,HUM.REC.ANLOG	1.210243e+09
2200	REVLIMID	LENALIDOMIDE	1.182224e+09

	GENERIC_NAME	BENE_COUNT
678	HYDROCODONE/ACETAMINOPHEN	10844363
837	LISINOPRIL	8096692
1347	SIMVASTATIN	7810499
819	LEVOTHYROXINE SODIUM	7408339
78	AMLODIPINE BESYLATE	7048423
1068	OMEPRAZOLE	6902028
112	ATORVASTATIN CALCIUM	5689020
620	FUROSEMIDE	5609265
887	METFORMIN HCL	4953388
125	AZITHROMYCIN	4835918