Improving Property Tax Assessments

skills

R

Machine Learning

HTML, CSS, & JavaScript

Automated valuation update on Philly's outdated model for economic and real estate financial improvement.

Oct 2025

Improving Property Tax Assessments

skills

R

Machine Learning

HTML, CSS, & JavaScript

Automated valuation update on Philly's outdated model for economic and real estate financial improvement.

Oct 2025

Improving Property Tax Assessments

skills

R

Machine Learning

HTML, CSS, & JavaScript

Automated valuation update on Philly's outdated model for economic and real estate financial improvement.

Oct 2025

about

This project develops a hedonic price model to estimate the fair market value of residential properties in Philadelphia, Pennsylvania, utilizing sales data from 2023–2024. While real estate valuation often relies heavily on simple structural attributes (e.g. square footage), this analysis quantifies the intangible value of location. The primary objective is to build a robust Ordinary Least Squares (OLS) regression workflow that disentangles the marginal price effects of physical characteristics from spatial amenities (transit access, park proximity) and socioeconomic neighborhood contexts, providing a tool for mass appraisal and market analysis.

The analysis is conducted in R, utilizing the sf package for geospatial vector processing and caret for model validation.

Data Engineering: The dataset fuses 26,344 property sales records from OpenDataPhilly with socioeconomic data from the US Census (ACS 2023) and amenity data from OpenStreetMap and SEPTA.

Spatial Featurization: Beyond simple distances, the workflow employs k-Nearest Neighbors (kNN) to calculate amenity density (restaurants, banks, pharmacies) and generates buffer zones to count transit stops within a 0.5-mile radius.

Statistical Transformations: To address the highly right-skewed distribution of sale prices (Median: $250,000; Max: $15.4M), the target variable and key continuous predictors (square footage, distances) were log-transformed to normalize residuals and linearize relationships.

Modeling Strategy: The analysis progressed through four iterations, culminating in a model that includes structural features, census demographics, spatial amenities, interaction terms, and neighborhood fixed effects to capture localized market sub-cultures.

The final model achieved a cross-validated R2 of 0.825, explaining approximately 83% of the variance in home prices.

Predictive Accuracy: The model yielded a mean absolute error (MAE) of $72,328, though the root mean squarederror (RMSE) was significantly higher at $137,219, indicating that the model struggles to predict high-value luxury outliers accurately.

Key Drivers: Stepwise selection identified square footage, number of bathrooms, median household income, and fireplaces as the strongest predictors of value.

Spatial Residuals: Diagnostic mapping revealed geographic patterns in prediction error. The model significantly under-predicted values in University City (likely due to tax-exempt land influence from UPenn/Drexel) and over-predicted values in disinvested neighborhoods like Parkside and Wynnefield.

Diagnostics: Residual analysis displayed a "funnel shape" (heteroskedasticity), suggesting that prediction error increases as property value increases, a common trait in real estate modeling that suggests segmenting luxury markets might improve future performance.

This workflow serves as a foundational tool for automated valuation models (AVMs) used by tax assessors and real estate developers.

Mass Appraisal: By identifying neighborhoods where the model consistently over-predicts value (e.g. Parkside), city officials can identify areas at risk of regressive taxation, where lower-income homeowners might be paying taxes on assessed values higher than their actual market worth.

Site Selection: The significant coefficients for kNN amenity density and SEPTA accessibility quantify the premium buyers place on walkability and transit, providing data-driven insights for developers scouting locations for mixed-use projects.

Pitch Slides

Role

Tess Vu

Alex Stauffer

Joshua Rigsby

Jun Luu

Mark Deng

Client

Philadelphia Office of Property Assessment

Keywords

AVM, Big Data, Hedonic Price Model, OLS, Practical Applications, Regression, Urban Analytics