2 Modeling 2: Intro to classifiers with Python using sklearn

In class we'll spend some time learning about using logistic regression for binary classification problems - i.e. when our response variable has two possible outcomes (e.g. customer defaults on loan or does not default on loan). We'll explore other simple classification approaches such as k-Nearest Neighbors and basic classification trees. Trees, forests, and their many variants have proved to be some of the most robust and effective techniques for classification problems.

2.1 Readings

ISLR - Sec 3.5 (kNN), Sec 4.1-4.3 (Classification, logistic regression), Ch 8 (trees)
PDSH - Ch 5 p331-375, p433-445, p462-470

Also, the sci-kit learn documentation has useful info on logistic regression, tree based models and nearest neighbor models.

2.2 Downloads and other resources

download_stat_ml_sklearn.zip

2.3 Activities

We will work through a number of Jupyter notebooks (and Quarto documents) as we learn to build basic classifiers using both Python and R. Everything is available in the Downloads file above.

2.3.1 Intro to classification problems and the k-Nearest Neighbor technique

In this first part we'll:

get a sense of what classification problems are all about,
get our first look at the very famous Iris dataset,
use a simple, model free technique, known as k-Nearest Neighbors, to try to classify Iris species using a few physical characteristics.

Work your way through this notebook: knn.ipynb

2.3.2 Logistic regression

Logistic regression is a variant of multiple linear regression in which the response variable is binary (two possible outcomes). It is a commonly used technique for binary classification problems. It's definitely more "mathy" than kNN. I'll try to help you develop some intuition and understanding of this technique without getting too deeply into the math/stat itself. See the Explore section at the bottom of this page for some good resources on the underlying math and stat of logistic regression.

Work your way through: logistic_regression.ipynb.

you’ll start with a short introduction to the problem, the data and an ill-advised multiple linear regression model.
next, we will review the main ideas of the logistic regression model. We'll create a null model and then use statsmodels, a Python stat library, to build our first few models. We'll see some of the challenges in interpretation of statistical models (and this is a tiny model).
We'll learn how to use scikit-learn (sklearn, for short) to build, fit and assess logistic regression models.

2.3.3 Decision trees

Now on to learning about decision trees and variants such as random forests. Unfortunately, the sklearn package still doesn’t support easy use of categorical variables in their tree based models. So, we’ll go back to R (and tidymodels) for learning about tree based methods. This actually underscores something I feel pretty strongly about. The Python vs. R debate is misguided. Both tools have their place and I use both frequently. Get comfortable in switching between them.

On a related note, it’s gotten even easier to use them both. For example, within a Jupyter notebook using a Python kernel, it’s possible to run R code using a Python package known as rpy2. Similarly, within R Studio or its successor, Positron, you can mix and match R and Python code chunks. This works via an R package known as reticulate. Exploring one or more of these packages would be an interesting addition to a (or an entire) final project.

You'll use trees.qmd with these screencasts.

We'll start with a short introduction to the problem and the data. Data partioning and a bit of data exploration are done. Then we'll build a simple decision tree. So, how do decision trees decide how to create their branches? We'll take a very brief look at this and point you to some resources to go deeper if you want.

SCREENCAST - Intro to decision trees (21:41)

Now we'll look at some more advanced tree based models.

SCREENCAST - Bagging, random forests and beyond (16:49)

2.4 Explore (OPTIONAL)

2.4.1 StatQuest YouTube Channel - Josh Starmer

StatQuest: Confusion matrix
StatQuest: Sensitivity and specificity
StatQuest: Maximum likelihood
StatQuest: Odds and Log(odds)
StatQuest: Logistic regression - there are a bunch of follow on videos with various details of logistic regression
StatQuest: Random Forests: Part 1 - Building, using and evaluation

2.4.2 Other topics

About unbalanced classes and oversampling methods
Predictive analytics at Target: the ethics of data analytics
Kappa statistic defined in plain english - Kappa is a stat used (among other things) to see how well a classifier does as compared to a random choice model but which takes into account the underlying prevalence of the classes in the data.