House Prices - Advanced Regression Techniques

This notebook comes from my personal work on a Kaggle competition

Data exploration

In [94]:
import numpy as np
import pandas as pd
import sklearn
In [95]:
df_train = pd.read_csv("data/train.csv").drop(columns=["Id"])
df_test = pd.read_csv("data/test.csv").set_index("Id")
features_num = df_train.select_dtypes(include=np.number).columns
features_cat = df_train.columns.difference(features_num)
features_num = features_num.drop('SalePrice')
In [96]:
corr = df_train.corr().query("SalePrice > 0.5")  # most relevant features
corr.loc[["SalePrice"], corr.index].sort_values(by="SalePrice", axis=1, ascending=False)
Out[96]:
SalePrice OverallQual GrLivArea GarageCars GarageArea TotalBsmtSF 1stFlrSF FullBath TotRmsAbvGrd YearBuilt YearRemodAdd
SalePrice 1.0 0.790982 0.708624 0.640409 0.623431 0.613581 0.605852 0.560664 0.533723 0.522897 0.507101
In [69]:
df_train[corr.index].hist(bins=30, figsize=(20, 10));

Features preprocessing

Deal with skewed data

The SalePrice histogram looks right-skewed. Indeed, the Shapiro-Wilk rejects the null hypothesis that the data is normally distributed:

In [97]:
import scipy 

scipy.stats.shapiro(df_train["SalePrice"]) # p-value < 0.05
Out[97]:
ShapiroResult(statistic=0.869671642780304, pvalue=3.206247534576162e-33)
In [98]:
print(df_train[features_num].skew().sort_values(ascending=False).to_frame().rename(columns=lambda x:"Skewness")[:5])
features_skewed = (df_train[features_num].skew() > 0.75).index
df_train[features_skewed] = np.log1p(df_train[features_skewed]) # log transform to reduce skewness 
df_test[features_skewed] = np.log1p(df_test[features_skewed])
               Skewness
MiscVal       24.476794
PoolArea      14.828374
LotArea       12.207688
3SsnPorch     10.304342
LowQualFinSF   9.011341
In [99]:
pd.DataFrame({"price": df_train["SalePrice"], "log(price + 1)": np.log1p(df_train["SalePrice"])}) \
  .hist(figsize=(12, 4));
df_train["SalePrice"] = np.log1p(df_train["SalePrice"])  # reduce skewness

Normalize data

In [100]:
from sklearn import preprocessing

std = preprocessing.RobustScaler()
df_train[features_num] = std.fit_transform(df_train[features_num])
df_test[features_num] = std.transform(df_test[features_num])

Fill NA

In [101]:
print(df_train[features_num].isnull().values.sum())
df_train[features_num] = df_train[features_num].fillna(df_train[features_num].mean())
print(df_train[features_num].isnull().values.sum())
df_test[features_num] = df_test[features_num].fillna(df_test[features_num].mean())
348
0

Convert categorical variables

In [102]:
train = pd.concat([df_train[features_num], pd.get_dummies(df_train[features_cat])], axis=1)
test = pd.concat([df_test[features_num], pd.get_dummies(df_test[features_cat])], axis=1)
In [103]:
col = train.columns.difference(test.columns); col  # some categorical values does not appear in the test set
Out[103]:
Index(['Condition2_RRAe', 'Condition2_RRAn', 'Condition2_RRNn',
       'Electrical_Mix', 'Exterior1st_ImStucc', 'Exterior1st_Stone',
       'Exterior2nd_Other', 'GarageQual_Ex', 'Heating_Floor', 'Heating_OthW',
       'HouseStyle_2.5Fin', 'MiscFeature_TenC', 'PoolQC_Fa',
       'RoofMatl_ClyTile', 'RoofMatl_Membran', 'RoofMatl_Metal',
       'RoofMatl_Roll', 'Utilities_NoSeWa'],
      dtype='object')
In [104]:
test[train.columns.difference(test.columns)] = 0
train_Y = df_train["SalePrice"]

Utility functions

In [105]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

def scorer(estimator, X, y):
    return mean_squared_error(estimator.predict(X), y)**.5

def error(estimator, n_splits=5):
    return cross_val_score(estimator, train, train_Y, 
                           scoring=scorer, cv=n_splits).mean()

def submit(estimator):
    e = estimator.fit(train, df_train["SalePrice"])
    pd.DataFrame({"Id": df_test.index, "SalePrice": np.expm1(e.predict(test) )}) \
      .to_csv("submission.csv", index=False)
    error = scorer(estimator, train, df_train["SalePrice"])
    print(f"Error on train set: {error}")

Linear regression

Simple linear regression

In [106]:
print(f"CV Error: {error(sklearn.linear_model.LinearRegression())}")
submit(sklearn.linear_model.LinearRegression())  # overfitting a lot
CV Error: 298646942.7842525
Error on train set: 0.09243823419181545

Ridge regression

In [107]:
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"

alphas = [1, 3, 5, 7, 8, 9, 10, 12, 20, 50]
errors = [error(sklearn.linear_model.Ridge(alpha=a)).mean() for a in alphas]
fig = px.line(pd.DataFrame({"alpha": alphas, "error": errors}).set_index("alpha"))
fig.show()