# An Extremely Brief Intro to Machine Learning in Python¶

Ha Khanh Nguyen (hknguyen)

## 1. What is Machine Learning?¶

• "Machine learning (ML) is about learning functions from data." - David Dalpiaz, STAT 432.
• The purpose of this unit in STAT 430 is NOT teaching you machine learning!!
• We focus instead on teaching you HOW to perform machine learning methods in Python.
• To learn the details of machine learning methods (why it works, how it works, etc.), STAT 432 is highly recommended!

• Machine learning methods are divided into groups based on the tasks they accomplish.
• Supervised learning
• Unsupervised learning

### 2.1 Supervised Learning¶

• In supervised learning, we want to “predict” a specific response variable (target or outcome variable).
• Divided into:
• Regression: the response variable is numeric
• Classification: the response variable is categorical

#### 2.1.1 Regression¶

• In the regression task, we want to predict numeric response variables. The non-response variables (also known as features or predictors) can be either categorical or numeric.
• In this example, x1, x2, and x3 are features and y is the response variable.
x1 x2 x3 y
A -0.66 0.48 14.09
A 1.55 0.97 2.92
A -1.19 -0.81 15.00
A 0.15 0.28 9.29
B -1.09 -0.16 17.57
B 1.61 1.94 2.12
B 0.04 1.72 8.92
A 1.31 0.36 4.40
C 0.98 0.30 4.40
C 0.88 -0.39 4.52

#### 2.1.2 Classification¶

• Classification is similar to regression, except it considers categorical response variables.
x1 x2 x3 y
Q -0.66 0.48 B
Q 1.55 0.97 C
Q -1.19 -0.81 B
Q 0.15 0.28 A
P -1.09 -0.16 B
P 1.61 1.94 B
P 0.04 1.72 C
P 1.31 0.36 C
Q 0.98 0.30 B
P 0.88 -0.39 B

### 2.2 Unsupervised Learning¶

• Unsupervised learning is a very broad task that is rather difficult to define. Essentially, it is learning without a response variable. To get a better idea about what unsupervised learning is, consider some specific tasks.

#### 2.2.1 Clustering¶

• Clustering is essentially the task of grouping the observations of a dataset.

#### 2.2.2 Density Estimation¶

• Estimate the density.

## 3. Evaluating the Functions/Models¶

• Model evaluation is a HUGE topic. We will not have the time to cover it in this class.
• One of the most fundamental pieces of model evaluation is training data and testing data.
• In order to learn a function from the data, we need to "build" the model based on some data. This is called training data, the data used to build the model.
• We want to use our model/function to predict/label new observations. But how will we know whether our model is accurate or not? That means we also need some data to test our model! That is called testing data!
• However, testing data cannot be the same as training data. So often, at the beginning of a project, we divide the data into training and testing data.

## 4. First Application: Classifying Iris Species¶

• Dataset: Iris Data Set
• Goal: use the measurements of Iris' petal length/width and sepal length/width to decide if an Iris is Setosa, Versicolor, or Virginica.
• The end-goal is a function (also called model) that can be used to label a new observation.

### 4.1 Setting up your system for ML¶

• Before we can perform any of these cool methods, we first need to install the required packages/libraries:
conda install scikit-learn
• If that doesn't work, try pip install scikit-learn.

### 4.2 Take a look at the data¶

• The first step is always getting to know the data you're working with.
• It's highly recommended that you plot these variables to get an idea of what the data looks like and the variables' association with one another.

### 4.3 Building your first model: k-Nearest Neighbors¶

• There are many classification algorithms in scikit-learn that we could use. We will start with $k$-nearest neighbors classifier (algorithm) since it is very intuitive!
• $k$-nearest neighbor: to make a prediction for a new data point, the algorithm finds $k$ points in the training set that is closest to the new point. Then it assigns the most common label of these $k$ training points to the new data point.
• We will start small with $k = 1$.
• Before we "train" the model on the data, we need to split the data into training and testing data:

### 4.4 Making predictions¶

• Now we can use the model knn to make predictions on new data!
• Note that the predict() function takes an array of observations (not a single observation!).
• Try supply a single observation and you will get an error! (Definitely try it out though!)
• scikit-learn loves NumPy (many things in Python feel the same way). But it works with normal list as well.

### 4.5 Evaluating the model¶

• Using the testing data, we can estimate the accuracy of our model, meaning the percentage of observations our model labels correctly.
• We can also use the score() method of the knn object, which will compute the test set accuracy for us:

References: