# Regression with scikit-learn¶

Ha Khanh Nguyen (hknguyen)

## 1. Linear Regression¶

• Linear regression is probably the most well-known machine learning method. Most of you have likely already seen an example of simple linear regression previously.
• We won't go into the details of what is a linear regression model and what it does. If you're interested, please read here.
• What we will discuss is HOW to perform linear regression in Python with scikit-learn.

### 1.1 Example: Predicting Housing Prices in Boston¶

• scikit-learn provides us a number of famous datasets to "play with" and get to know the method. You can view all of them here.
• In this example, we will look at the Boston Housing Dataset.
• The returned object is a dictionary with A LOT of information:
• 'data': the predictor/feature
• 'target': the response variable
• 'feature_names': the name of the features
• 'DESCR': dataset description
• As always, let's take a look at this data:

### 1.2 Fitting a Linear Regression Model¶

• For classification models, model.score() returns the accuracy of the model (how many observations did we correctly predict the labels).
• For regression models, model.score() returns the coefficient of determination $R^2$, the proportion of the variability in the response variable explained by the model.
• $R^2$ is one of the metrics we can use for model evaluation.
• Another metric we can use is called RMSE (root mean squared error).
• Even though the function name is mean_squared_error, when called with squared=False, it returns RMSE.
• Notes: The $R^2$ is a proportion, hence its value is between 0 and 1. On the other hand, RMSE $\geq 0$, but note that it is in the unit of the response variable. For example, here, the RMSE is approx. 5090 US dollars.

### 1.3 Model Parameters¶

• With linear regression, the estimated coefficients in the model are very important! Here is how to retrieve them from the LinearRegression object:
• And the estimated intercept $\hat{\beta_0}$:

### 1.4 Linear Regressions with Selective Features¶

• In modeling, sometimes using fewer features produces better results!
• Variable selection is a well-studied topic. We can't cover it in this unit, but what we will do is learn how to fit linear regression model using only a selection of features (instead of all of them like above).
• Based on the plots we did above, let's say we decide to exclude ZN, INDUS, TAX, B from our model!
• Fit the linear regression model with all features except ZN, INDUS, TAX, and B.

### 1.5 statsmodels Ordinary Least Squares¶

• "statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration." - statsmodels.org
• The function call and function output resembles those of R!
• You might ask: what is the estimated intercept $\hat{\beta_0}$? By default, statsmodels does not include a constant term in the model, but we can add it using statsmodels.tools.add_constant() function.
• Here, the $R^2$ is provided in the model summary! What's about RMSE?
• Even better, if you want to fit your model using the "R-way", statsmodels can do that too!
• First, we need a DataFrame with features AND response variable!
• Now, let's fit the model!

## 2. $k$-Nearest Neighbors¶

• We have seen $k$-nearest neighbors at work with a classification problem.
• For regression, it works similarly! That is, find the $k$ nearest points in the training data to the new observation, compute the average of the response variable of those points!
• Let's use a simple dataset for this example so we can visualize it better:
• mglearn is a package provided by the authors of Introduction to Machine Learning with Python.
• They provide a number of datasets and functions to produce example plots.
• Now let's fit the $k$-nearest neighbor regressor to the Boston housing dataset!
• Maybe the model will perform better with fewer features?

## 3. Decision Trees¶

• Decision Tree is known for classification task. But turns out, it can be used for regression too!
• One thing to note is that the DecisionTreeRegressor (and all other tree-based regression models) is not able to extrapolate, or make predictions outside of the range of the training data.
• Maybe the model will perform better with fewer features?
• A 100% $R^2$ might an overfit! We should try limiting the depth of the tree.
• Now, there are A LOT of decisions to make when building a Decision Tree model! You will learn that in a ML course. It's a much more complicated process than what we just did above!
• See the documentation of DecisionTreeRegressor here.

### Feature importance in trees¶

• The DecisionTreeClassifer object has the feature_importances_ attribute which gives us an idea of which features are influential in our model.

References: