Analyzing the Used Car Market in 2021

Automate the shopping process with Python scripts & web scraping

Graham Sahagian
Geek Culture

--

Picture of an antique car
Photo by Chris Haws on Unsplash

It’s no secret that consumer good prices have been peaking across the board in recent months. Everything from raw materials to real estate has seen massive price hikes in 2021. Used cars are no exception. Consumers have no other option but to buy at these exorbitant prices or delay their purchase and risk absorbing even larger losses. I myself am in the market for a new used car (a bit of an oxymoron, I know) and decided to inform the purchase process a bit by automating my search and using data to my advantage.

from BLS

As a disclaimer, this article is for educational purposes and not meant as an endorsement for web scraping or unauthorized data collection.

Getting Started

If you’re looking to buy a used car without much knowledge of them, this article is for you — especially if you like using data to inform your decisions.

In order to get a lot of accurate used car data quickly, we can use a simple web scraper in Python to pull listings from CarFax.com and dump them into an excel file. The primary goal of the scraper is to quickly gather organized data that we can then analyze to find key insights into the local used car market. We can then use this information to shop cars faster and more effectively, selecting specific cars for further research.

Thankfully, the data from CarFax was structured in JSON, which meant accessing the data would be relatively easy and the data itself would be clean. You can find the repo for the scraper here. To copy the repo just:

$ git clone https://github.com/grsahagian/carfax-scraper

Follow the instructions in the Readme to get the scraper running on your system. You may have to replace some of the headers as the scraper is only set up for use with Google Chrome on a Mac OS.

Visualizing the Data

Here’s the fun part. Now we visualize our data to see what trends stick out, see where we might need to drill down further, and finally draw some conclusions regarding the local car market.

First off, visualizing the distribution of our dataset with various histograms. The following graphs show the distribution of our 7168 used cars by make & model, mileage, list price, and model year. I’m primarily concerned with how mileage and year of the model influence list price. In addition to the car make and model, these two factors are the primary value drivers of a used car.

**Keep in mind all data and subsequent trends were gathered from Carfax.com; significant differences in prices and trends may be found from other sources.

Not a whole lot of surprises here. There are a lot of Honda and Toyota sedans (part of the reason I picked these models) in the area.

In terms of mileage, the distribution centers around the 21,000 to 28,000 range with a long tail ending at approximately 346,000 miles. I’m surprised anyone would list a car with >300,000 miles on it; not sure how much longer a vehicle can run past that point.

This histogram is interesting in that there is a peak at $8,000 (to $10,000) then a trough in the $10,000 (to $12,000) bin. The distribution centers around the $18,000 bin, again with a positive skew.

Figure 4

There’s a significant increase in the number of cars from 2018 and 2019 when compared to the previous years (each with more than double the listings from the next highest year). This time the distribution shows a negative skew, which we could have garnered without a visualization due to the nature of depreciating asset sales. One possible explanation for the drastic change in available cars from 2017 to 2018 is increased demand in the used car market from the onset of the COVID-19 global pandemic. Demand for consumer goods spiked as people began quarantining, spending time outside, and purchasing goods instead of going on vacations. The distribution would likely look more smooth without this external factor.

Below the relationship between mileage and list price is graphed with two specific car makes & models picked out (Figure X & Y). Each point in the graph is a used car listing; clicking on a point shows the car’s make, model, mileage, year, and a link to listing on CarFax.

We could have elected to remove the rest of the dataset and only view data for each specific car make & model, however, I think leaving the rest in provides us with a better picture of how each data subset (car make & model) relates to the market as a whole.

Analysis Continued:

I noticed there was a significant deviation in the price for a given mileage. The variance may be due to a number of factors external to the online listing but noted anecdotally that older cars, regardless of mileage, were cheaper than their newer counterparts.

In order to confirm (or deny) my hypothesis I took the same price by mileage graph and labeled each listing by year with a color gradient — older models are lighter and newer models, darker. In order to see the trend, I selected a single car make & model to display at a time. Below is the graph for the Toyota Highlander and Hyundai Elantra.

Toyota Highlander

The relationship appears linear so I fit the graph with a linear regression line to more clearly show how increased mileage influences a car's list price. Below is the equation for the line above. The R-Squared and P-value tell us that there is a relatively significant relationship between the independent variable (mileage) and dependant variable (list price). Some of the variances, as I discussed can be attributed to the year the car was built.

Y = -0.196X + 41866
# R-Squared: 0.771
# P-Value < 0.0001

As soon as the Toyota Highlander is purchased off the lot and resold it’s worth just shy of $42,000; every 10,000 miles it's driven, the car’s value drops just under $2000 (according to Carfax listings circa September 2021).

Hyundai Elantra

The linear regression equation for the Hyundai Elantra:

Y = -0.098X + 19974
# R-Squared: 0.703
# P-Value < 0.0001

When a Hyundai Elantra is purchased new and re-sold, its value starts at around $20,000; its value decreasing $980 for every 10,000 miles driven.

The Elantra depreciates at nearly half the rate of the Highlander, a pretty significant advantage there.

The regression slope coefficient, indicating the implied depreciation, is instrumental in assessing a car model’s overall reliability and longevity.

Another interesting piece of information I was able to garner from the data was the relationship between year and price. While the relationship between mileage and price appears to be linear, the relationship between year and price appears to be exponential (see graph below).

In other words, a car experiences the most depreciation immediately following its release, subsequently depreciating at a decreasing rate as it approaches scrap value.

Other Considerations

Qualitative factors such as the overall condition of the car & prior accidents can also significantly influence the car’s value however these factors will have to be taken into account at a later stage in the vetting process as they aren’t immediately accessible from the car’s listing page. Some models have bad years, these batches will be more error-prone compared to other years of the same model; this is another factor to be wary of.

Another element beyond the scope of this analysis is the various costs tacked on to the car after the fact: various processing fees, cost of tax and title, and cost of insurance. These additional costs can easily bump a car's price up 2 to 4 thousand dollars.

Key Takeaways

Our data and subsequent analysis provide us an excellent benchmark to work from when purchasing a used car of any make & model. Given any used car listing, one can assess the price of the car with the corresponding linear regression functions we found earlier (for said car make & model).

Our purchase process now looks like this:

(1) browse and select a car make and model according to cost tolerance and feature preferences,

(2) select car(s) within make & model according to mileage preference (for me: 60,000 – 100,000 miles) and look for newer cars (model year) close to or preferably underneath the regression line.

(3) inquire with the car dealership to find out the overall condition and quality of said used car. Repeat steps 1 to 3 until you find a few gems and give them a test drive.

All unannotated graphs created by the author

Tools Used

  • Python 3.8 for data collection & organization
  • Tableau for visualization
  • MS Excel for data storage and exploration

Additional References & Resources

*Special thanks to Mike on Github for help developing the web scraper

--

--

Graham Sahagian
Geek Culture

Learning and teaching how to do useful stuff, usually with code