Data Cleaning and Analysis

Data Cleaning and Analysis Assignment

Brandon Hill


I want to find look at data on the cost of Emergency Room visits in the United States. I want to look at this data over time and by location. 

Finding the data: 

“Costs of Emergency Department Visits in the United States, 2017”

Source: Collected by The Healthcare Cost and Utilization Project (HCUP) from the 2017 Nationwide Emergency Department Sample (NEDS)

Total Emergency Department (ED) charges are converted to costs using HCUP Cost-to-Charge Ratios based on hospital accounting reports from the Centers for mEdicare & Medicaid Services (CMS).


From 1996 – to 2013, Emergency Department visits have increased at a higher rate than inpatient care. More than 50 percent of inpatient stays in 2017 included evidence of ED services prior to admission. 

Why are both ED visits and inpatient care growing during this time period? Is that growth rate consistent through 2022?   

Trends in the cost of ED services by patient and hospital characteristics have not been examined. 

Does this mean there is no data on cost disparity on demographic lines like race or income? 

13% of all ED visits in the 2017 NEDS are missing information about ED charges, for which costs are estimated using the average cost. 

Data contains: Aggregate costs in billions of dollars and percentage of total aggregate costs, average cost per visit and number of visits by millions and percentage of total visits. This is broken down into the following categories, each of which is a different way of sorting the same data: age group, sex, primary expected payer, community-level income, location of patient’s residence, ED discharge disposition (whether the patient was released, transferred or admitted).  

Cleaning the data

Data was imported to a Google Sheet using the Table Capture plugin. Before cleaning, this is what the data looked like: 

To move the data into Tableau, I need to remove the title (row 1) and the caption (row 34), as well as condense rows 2 and 3 into one row that labels the data for the categories in that row. I did this by splitting B3 and C3 into “aggregate cost (billions)” and “aggregate cost (percentage).” I did the same for total number of visits in millions and by percentage. I considered further condensing the subcategory labels, such as ‘age group” and “primary expected payer” since there is no data for those rows, but determined this was unnecessary it would be pretty straightforward for me to sort the right categories and I could properly indicate the category in the visualization. 

Analyzing the data

Since the data includes total numbers and percentage of totals for both aggregate cost and number of visits, I think it makes the most sense to view the data in two ways. Firstly, by looking at total costs and total numbers and bar graphs, and secondly by looking at the percentages for each category by a pie chart. 

With my first bar chart for total costs, I wonder if there is a better way to visually represent the particular groupings that should be compared to each other. For example, it doesn’t do much to compare the cost of routine discharges to the total costs that females incur in ED visits. Rather than displaying all of the information on the visualization at once, I would prefer a drop-down menu that would allow you to select and display only one set of patient dispositions at a time. Next, I want to normalize the data by dividing the total costs by the number of visits. Conveniently, this number is already provided so I don’t have to make a calculated field. 

I chose to include the total in this chart because since each category is normalized, the total (or average of the averages) wouldn’t majorly throw off the scale. It actually provides a useful number to determine outliers and see which categories are more or less costly than others. However, instead of displaying this as a bar at the top, it would be more useful as a line going down the chart and through the bars. 

After completing these two visualizations, there are some observations that can be made about the data. While the cost of ED visits that result in the transferring of the patient, only make up a small amount of the costs of any category, when normalized we can see that ED visits that result in a transfer are actually bar far the most expensive to the patient on average. They just don’t occur as often (only 3 percent of total visits) and so don’t add up a large portion of the total costs (6.2). This makes me realize that the ratio of percent of the total aggregate cost to the percentage of total visits could be a useful statistic to determine which demographics are experiencing proportionately higher and lower costs per visit. Would this data serve any purpose that’s not already covered by showing the average cost in dollars? 

Now on to the pie chart. Because we’re no longer working with money, and instead of working with percentages of total numbers, I will not be using green. 

Right off the bat, I’m struggling with making Tableau differentiate between the subcategories of patient characteristics. This could be solved by creating a different sheet for each pie chart but there must be some way to group each sub category into its own pie chart. 

Overall observations after analysis for each patient catagory: 

Age Group: The 18-44 age group takes up a much higher portion of total ED visits, but are on average much cheaper than ED visits for older age groups. 

Sex: Women as a demographic experience much higher numbers of ED visits. I wonder if this has to do with ED visits associated with pregnancy and childbirth. However, the average cost of visits by sex is pretty much consistent. 

Primary Expected Payer: Unsurprisingly, those without insurance make up a much smaller proportion of the total number of ED visits. In order to determine if this is due to most people having coverage, this data would need to be normalized using data on uninsured people as a proportion of the population. Also, Medicare patients have the highest average cost, which at first made me consider how healthcare businesses might take advantage of charging the government, but when you also consider the higher average cost for older patients, and that older patients are more likely to be on Medicare, then the number seems like less of an outlier. 

Community-level income: The most noticeable trend here is that low-income people visit the ED more than twice as often as the highest income. There are many stories that could be written exploring this. Dietary affordability, the likelihood of dangerous jobs or dangerous neighborhoods, stress factors and more. 

Location of Patient’s Residence: Surprisingly, average costs are pretty consistent no matter where you live. While over 50 percent of visits occur in large metropolitan areas, this isn’t exactly shocking because they have much higher populations. It would be interesting to normalize this data by population to see whether location actually affects the rate of ED visits. 

ED discharge disposition: A vast majority of visits result in a routine discharge (80.9 percent). Inpatient discharges seem to be cheaper, which may indicate that healthcare professionals pretty quickly determine whether an ED patient needs to be admitted and charges are transferred to inpatient charges. Also outlying, the average cost of transfers and other dispositions (which includes home care, death and discharged against medical advice) likely indicates that more serious ED visits are more costly.  

Leave a Reply

Your email address will not be published.