In this tutorial, you will learn how to perform logistic regression very easily. We’ll use the Titanic dataset. You will learn the following:
- How to import csv data
- Converting categorical data to binary
- Perform Classification using Decision Tree Classifier
- Using Random Forest Classifier
- The Using Gradient Boosting Classifier
- Examine the Confusion Matrix
You may want to practice Logistic Regression
Import the Titanic dataset using the code below. Download the Titanic Dataset here.
# Import the neccessary modules import pandas as pd import numpy as np import seaborn as sb
Read the dataset into a pandas dataframe, df
# Read the dataset into a dataframe df = pd.read_csv('D:/data/titanic.csv', sep='\t', engine='python')
Drop the Name, Ticket and Cabin Columns
If you view the dataset properties using df.info(), you will see that these columns are not numeric. So we’ll drop them.
# Drop some columns which is not relevant to the analysis (they are not numeric) cols_to_drop = ['Name', 'Ticket', 'Cabin'] df = df.drop(cols_to_drop, axis=1)
Get information about the dataset
Use the info() function to get details about the data types of the dataset
Use seaborn’s heatmap() function to check which values are null
Both are given below
df.info() sb.heatmap(df.isnull())
Interpolate for missing values
This means you can deduce the missing values by interpolating existing values.
For example, if we have the series 1, 3, 4, ?, 6, 8,….What is the missing value
Simply put, it is midpoint between 4 and 6. So the result it (4+6)/2 = 5
To interpolate missing values for Age, use the code below
# To replace missing values with interpolated values, for example Age df['Age'] = df['Age'].interpolate()
Drop records with missing values
# Drop all rows with missin data df = df.dropna()
Convert categorical values to numeric
Now if you view the dataset properties using df.info(), you will see that the Sex and Embarked columns are not numeric. We we now convert them to numeric. There is a separate tutorial on Converting categorical column to numeric here.
This takes three steps:
First create dummy variables from the categorical columns
# First, create dummy columns from the Embarked and Sex columns EmbarkedColumnDummy = pd.get_dummies(df['Embarked']) SexColumnDummy = pd.get_dummies(df['Sex'])
Note: the get_dummies() functions converts categorical variables into dummy indicator variables
Second, we add these dummy columns to the original dataset
df = pd.concat((df, EmbarkedColumnDummy, SexColumnDummy), axis=1)
Third, drop the original categorical columns
# Drop the redundant columns thus converted df = df.drop(['Sex','Embarked'],axis=1)
Now you can use df.info() and df.head() to check the dataset
You can also do a heatmap to check for null values.
Separate the Features and the Classes
We would separate the features (X) and the classes (y). The classes is the target variable we want to predict. In this case, it is the ‘Survived’ column
# Seperate the dataframe into X and y data X = df.values y = df['Survived'].values # Delete the Survived column from X X = np.delete(X,1,axis=1)
Split the Dataset into Training and Test Datasets
# Split the dataset into 70% Training and 30% Test from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)
Perform Classification Using Decision Tree Classifier
# Using simple Decision Tree classifier from sklearn import tree dt_clf = tree.DecisionTreeClassifier(max_depth=5) dt_clf.fit(X_train, y_train) dt_clf.score(X_test, y_test)
Output: 0.8157894736842105
A score of 81.6% is pretty good for our classifier! So we’ve done a good job. Let’s try another classifier
See the Confusion Matrix
Before we go to the next classifier, let’s see the confusion matrix of this classifier. So we know know many true-positives, false positive etc.
First we obtain y_pred, then we import confusion_matrix from sklearn.metrics. The code is given below
y_pred = dt_clf.predict(X_test) from sklearn.metrics import confusion_matrix confusion_matrix(y_test, y_pred)
The confusion matrix produced is:
array([[31, 4], [ 6, 6]], dtype=int64)
- True positives: 31 (upper left)
- True negatives: 6 (lower right)
- False positives: 4 (upper right)
- False negatives: 6 (lower left)
Perform Classification Using Random Forest Classifier
Now, we would see if we could get a better results using the the Random Forests Classifier.
from sklearn import ensemble rf_clf = ensemble.RandomForestClassifier(n_estimators=100) rf_clf.fit(X_train, y_train) rf_clf.score(X_test, y_test)
Output: 0.7368421052631579
Wow! It seems this classifier did not do as good with a score of 73.7%. Let’s now do one more classifier
Perform Classification Using Gradient Boosting Classifier
This classifier is available in the ensemble module which we already imported. So we don’t need to import anything
gb_clf = ensemble.GradientBoostingClassifier() gb_clf.fit(X_train, y_train) gb_clf.score(X_test, y_test)
Output: 0.6842105263157895
Hmm! Now so good! Let’s try to improve it by tuning.
Tune the Classifier
We could try to improve the gradient boosting classifier. This we can do by adding a hyperparameter that would be used for tuning. Here, we add n_estimators to be 50. The code is given below.
# Let's tune this Gradient booster. gb_clf = ensemble.GradientBoostingClassifier(n_estimators=50) gb_clf.fit(X_train,y_train) gb_clf.score(X_test, y_test)
Output: 0.7105263157894737
You can see that after tuning we have a better score than we did before tuning.
I’m going to stop here for now and hopefully this have been informative for you. If so, leave me a comment below. You could also watch the video lesson and subscribe to my channel.
I am confused.You did not use logistic regression in this video, you made use of other techniques. Are these techniques also logistic regression?
Thanks for asking this question Stephen. Actually this is a challenge faced by many, relating Logistic Regression to Classification. Logistic Regression is actually a classification model. So when we are building classifiers, the underlying principle being used in most cases is Logistic Regression.