# Classification with scikit-learn¶

Ha Khanh Nguyen (hknguyen)

## 1. k-Nearest Neighbors¶

• We have already seen an example of $k$-nearest neighbor classifier in Lecture 14.1.
• Let's look at a different example this time!
• The images below corresponding to the $k$-nearest neighbors classifiers fitted to a very simple dataset with 2 features.
• $k$-nearest neighbors classifier with $k = 1$:
• $k$-nearest neighbors classifier with $k = 3$:

### 1.1 Example: UCI Heart Disease Dataset¶

• By now, you're probably tired of this dataset! But it's a classic dataset for classification, so of course we have to take a look at it!
• Source: UCI Heart Disease Data Set.
• The response variable here would be num, the diagnosis.
• In this problem, we only care whether the patient has heart disease or not, so we will transform all values in num > 0 to 1.

### 1.2 Exploratory Data Analysis¶

• In other words, let's plot this data!
• A number of variables in this dataset are categorical: sex, cp, fbs, restecg, exang, slope, ca, and thal.
• For categorical variables, we will need to create dummy variables (indicator variables).
• What are dummy variables?
• A dummy variable is a numeric variable that represents categorical data, such as gender, race, etc.
• Note that sex, fbs, and exang are already dummy variables.
• The only possible values are 0 or 1.
• Categorical is a pandas dtype that we didn't cover previously.
• If you're familiar with R, this is some sort of an equivalence of as.factor().

### 1.3 Fit the Model¶

• Now, we're ready to fit the k-nearest neighbor model to it!
• But first, let's divide the data into training and testing data!
• Then, initialize the model with the number of neighbors.
• Now, fit!
• Compute accuracy:

## 2. Logistic Regression¶

• This is a very famous classification method! The math behind it is not trivial, so we will not cover it here.
• Let's just quickly fit this model to the Heart disease dataset and sees how it does!

## 3. Decision Tree¶

• Decision tree has very nice and easy to understand "solution". Meaning, usually you can draw out a flow chart on how to label each observation.

### Feature importance in trees¶

• The DecisionTreeClassifer object has the feature_importances_ attribute which gives us an idea of which features are influential in our model.

References: