Motivation
A contractor friend mentioned that one of the hardest parts of his job is estimating whether a renovation will pay off — specifically, what a house will be worth after the work is done, given what it looked like before. Formal ARV appraisals are slow and expensive; gut feel is unreliable. The goal of this project is to build a data-driven model that can give a fast estimate of after-renovation value, given the house's features before and after renovation.
The Ames, Iowa housing dataset is a natural fit: it's publicly available, covers mid-2000s sales, and includes 80+ features spanning everything from lot size and neighborhood to garage quality, basement finish, and kitchen condition. It's a well-studied dataset in the ML community, but using it to answer a real business question — from someone who will actually use the output — makes it more interesting than a textbook exercise.
Exploratory Data Analysis
Before modeling, I ran an in-depth EDA to understand the data's structure and quirks. A few findings that will directly shape the modeling approach:
- Categorical sparsity: Many categorical features (e.g., pool quality, alley type, fence type) have the vast majority of observations in a single category. Straightforward one-hot encoding would create near-zero-variance columns; target encoding or grouping rare categories will be necessary.
- Extreme skew in continuous features: Features like pool area, lot frontage, and low-season sale months have heavily right-skewed distributions. A few outliers could dominate linear models; tree-based approaches handle this more naturally.
- Complex feature interactions: Garage quality and garage size interact in ways that affect price differently than either feature does independently. A large garage in poor condition may actually hurt value; a small garage in excellent condition may help it. Additive models will miss this — tree-based models capture it automatically.
What's Next
The modeling phase will build on the EDA findings:
- Model selection: Start with gradient-boosted trees (XGBoost / LightGBM) to handle skew, sparsity, and interactions without extensive preprocessing
- Feature engineering: Construct renovation delta features — the change in condition ratings, square footage additions, or material upgrades between pre- and post-renovation snapshots
- Target adjustment: Control the sale price target for inflation and the 2008 financial crisis, so 2006 predictions can be translated into today's terms
- Validation: Test whether historical Ames relationships hold in today's market; validate model outputs against real before/after renovation data from my contractor friend