*Ha Khanh Nguyen (hknguyen)*

- "Machine learning (ML) is about learning functions from data." - David Dalpiaz, STAT 432.
**The purpose of this unit in STAT 430 is NOT teaching you machine learning!!**- We focus instead on teaching you HOW to perform machine learning methods in Python.
- To learn the details of machine learning methods (why it works, how it works, etc.), STAT 432 is highly recommended!

- Machine learning methods are divided into groups based on the
*tasks*they accomplish.- Supervised learning
- Unsupervised learning

- In supervised learning, we want to “predict” a specific response variable (target or outcome variable).
- Divided into:
**Regression**: the response variable is*numeric***Classification**: the response variable is*categorical*

- In the regression task, we want to predict
**numeric**response variables. The non-response variables (also known as features or predictors) can be either categorical or numeric.- In this example,
`x1`

,`x2`

, and`x3`

are features and`y`

is the response variable.

- In this example,

x1 | x2 | x3 | y |
---|---|---|---|

A | -0.66 | 0.48 | 14.09 |

A | 1.55 | 0.97 | 2.92 |

A | -1.19 | -0.81 | 15.00 |

A | 0.15 | 0.28 | 9.29 |

B | -1.09 | -0.16 | 17.57 |

B | 1.61 | 1.94 | 2.12 |

B | 0.04 | 1.72 | 8.92 |

A | 1.31 | 0.36 | 4.40 |

C | 0.98 | 0.30 | 4.40 |

C | 0.88 | -0.39 | 4.52 |

- Classification is similar to regression, except it considers
**categorical**response variables.

x1 | x2 | x3 | y |
---|---|---|---|

Q | -0.66 | 0.48 | B |

Q | 1.55 | 0.97 | C |

Q | -1.19 | -0.81 | B |

Q | 0.15 | 0.28 | A |

P | -1.09 | -0.16 | B |

P | 1.61 | 1.94 | B |

P | 0.04 | 1.72 | C |

P | 1.31 | 0.36 | C |

Q | 0.98 | 0.30 | B |

P | 0.88 | -0.39 | B |

- Unsupervised learning is a very broad task that is rather difficult to define. Essentially, it is learning without a response variable. To get a better idea about what unsupervised learning is, consider some specific tasks.

- Clustering is essentially the task of grouping the observations of a dataset.

- Estimate the density.

In [1]:

```
import pandas as pd
faithful = pd.read_csv('https://stat430.hknguyen.org/files/datasets/faithful.csv')
```

In [2]:

```
import seaborn as sns
sns.set_theme()
sns.scatterplot(data=faithful, x='waiting', y='eruptions')
```

Out[2]:

<AxesSubplot:xlabel='waiting', ylabel='eruptions'>

- Model evaluation is a HUGE topic. We will not have the time to cover it in this class.
- One of the most fundamental pieces of model evaluation is
**training data**and**testing data**.- In order to
**learn a function**from the data, we need to "build" the model based on some data. This is called**training data**, the data used to build the model. - We want to use our model/function to predict/label new observations. But how will we know whether our model is accurate or not? That means we also need some data to test our model! That is called
**testing data**!

- In order to
- However, testing data cannot be the same as training data. So often, at the beginning of a project, we divide the data into training and testing data.

- Dataset: Iris Data Set
- Goal: use the measurements of Iris' petal length/width and sepal length/width to decide if an Iris is
*Setosa*,*Versicolor*, or*Virginica*.- The end-goal is a function (also called model) that can be used to label a new observation.

- Before we can perform any of these cool methods, we first need to install the required packages/libraries:

`conda install scikit-learn`

- If that doesn't work, try
`pip install scikit-learn`

.

- The first step is always getting to know the data you're working with.

In [3]:

```
iris = pd.read_csv('https://stat430.hknguyen.org/files/datasets/iris.csv')
```

In [4]:

```
iris
```

Out[4]:

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|

0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |

1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |

2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |

3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |

4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |

... | ... | ... | ... | ... | ... |

145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |

146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |

147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |

148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |

149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |

150 rows × 5 columns

- It's highly recommended that you plot these variables to get an idea of what the data looks like and the variables' association with one another.

In [5]:

```
sns.pairplot(data=iris, hue='Species')
```

Out[5]:

<seaborn.axisgrid.PairGrid at 0x7ffe2d1f0130>

- There are many classification algorithms in scikit-learn that we could use. We will start with $k$-nearest neighbors classifier (algorithm) since it is very intuitive!
**$k$-nearest neighbor**: to make a prediction for a new data point, the algorithm finds**$k$ points**in the training set that is closest to the new point. Then it assigns the most common label of these $k$ training points to the new data point.- We will start small with $k = 1$.

In [6]:

```
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
```

- Before we "train" the model on the data, we need to split the data into training and testing data:

In [7]:

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.iloc[:, 0:4], iris['Species'], random_state=0)
```

In [8]:

```
knn.fit(X_train, y_train)
```

Out[8]:

KNeighborsClassifier(n_neighbors=1)

- Now we can use the model
`knn`

to make predictions on new data!

In [9]:

```
# assume this is a new observation
import numpy as np
X_new = np.array([[5, 2.9, 1, 0.2]])
```

In [10]:

```
knn.predict(X_new)
```

Out[10]:

array(['setosa'], dtype=object)

- Note that the
`predict()`

function takes**an array of observations (not a single observation!)**.- Try supply a single observation and you will get an error! (Definitely try it out though!)

- scikit-learn loves NumPy (many things in Python feel the same way). But it works with normal list as well.

In [11]:

```
X_new = [[5, 2.9, 1, 0.2]]
knn.predict(X_new)
```

Out[11]:

array(['setosa'], dtype=object)

- Using the testing data, we can estimate the accuracy of our model, meaning the percentage of observations our model labels correctly.

In [12]:

```
y_pred = knn.predict(X_test)
```

In [13]:

```
y_pred
```

Out[13]:

array(['virginica', 'versicolor', 'setosa', 'virginica', 'setosa', 'virginica', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'virginica', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'setosa', 'versicolor', 'versicolor', 'setosa', 'setosa', 'virginica', 'versicolor', 'setosa', 'setosa', 'virginica', 'setosa', 'setosa', 'versicolor', 'versicolor', 'setosa', 'virginica', 'versicolor', 'setosa', 'virginica', 'virginica', 'versicolor', 'setosa', 'virginica'], dtype=object)

In [14]:

```
# computing accuracy
np.mean(y_pred == y_test)
```

Out[14]:

0.9736842105263158

- We can also use the
`score()`

method of the`knn`

object, which will compute the test set accuracy for us:

In [15]:

```
knn.score(X_test, y_test)
```

Out[15]:

0.9736842105263158

**References:**

- David Dalpiaz's Basics of Statistical Learning.
- Andreas C. Muller & Sarah Guido's Introduction to Machine Learning with Python: A Guide for Data Scientists