Logistic Regression in Python with the Titanic Dataset

How to Build a Classifier in Python

In this tutorial, you will learn how to perform logistic regression very easily. We’ll use the Titanic dataset. You will learn the following:

  • How to import csv data
  • Converting categorical data to binary
  • Perform Classification using Decision Tree Classifier
  • Using Random Forest Classifier
  • The Using Gradient Boosting Classifier
  • Examine the Confusion Matrix

You may want to practice Logistic Regression

Import the Titanic dataset using the code below. Download the Titanic Dataset here.

# Import the neccessary modules
import pandas as pd
import numpy as np
import seaborn as sb

 

Read the dataset into a pandas dataframe, df

# Read the dataset into a dataframe
df = pd.read_csv('D:/data/titanic.csv', sep='\t', engine='python')

 

Drop the Name, Ticket and Cabin Columns

If you view the dataset properties using df.info(), you will see that these columns are not numeric. So we’ll drop them.

# Drop some columns which is not relevant to the analysis (they are not numeric)
cols_to_drop = ['Name', 'Ticket', 'Cabin']
df = df.drop(cols_to_drop, axis=1)

 

Get information about the dataset

Use the info() function to get details about the data types of the dataset

Use seaborn’s heatmap() function to check which values are null

Both are given below

df.info()
sb.heatmap(df.isnull())

 

Interpolate for missing values

This means you can deduce the missing values by interpolating  existing values.

For example, if we have the series 1, 3, 4, ?, 6, 8,….What is the missing value

Simply put, it is midpoint between 4 and 6. So the result it (4+6)/2 = 5

To interpolate missing values for Age, use the code below

# To replace missing values with interpolated values, for example Age
df['Age'] = df['Age'].interpolate()

 

Drop records with missing values

# Drop all rows with missin data
df = df.dropna()

 

Convert categorical values to numeric

Now if you view the dataset properties using df.info(), you will see that the Sex and Embarked columns are not numeric. We we now convert them to numeric. There is a separate tutorial on Converting categorical column to numeric here.

This takes three steps:

First create dummy variables from the categorical columns

# First, create dummy columns from the Embarked and Sex columns
EmbarkedColumnDummy = pd.get_dummies(df['Embarked'])
SexColumnDummy = pd.get_dummies(df['Sex'])

Note: the get_dummies() functions converts categorical variables into dummy indicator variables

 

Second, we add these dummy columns to the original dataset

df = pd.concat((df, EmbarkedColumnDummy, SexColumnDummy), axis=1)

 

Third, drop the original categorical columns

# Drop the redundant columns thus converted
df = df.drop(['Sex','Embarked'],axis=1)

 

Now you can use df.info() and df.head() to check the dataset

You can also do a heatmap to check for null values.

 

Separate the Features and the Classes

We would separate the features (X) and the classes (y). The classes is the target variable we want to predict. In this case, it is the ‘Survived’ column

# Seperate the dataframe into X and y data
X = df.values
y = df['Survived'].values

# Delete the Survived column from X
X = np.delete(X,1,axis=1)

 

Split the Dataset into Training and Test Datasets

# Split the dataset into 70% Training and 30% Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=0)

 

Perform Classification Using Decision Tree Classifier

# Using simple Decision Tree classifier
from sklearn import tree
dt_clf = tree.DecisionTreeClassifier(max_depth=5)
dt_clf.fit(X_train, y_train)
dt_clf.score(X_test, y_test)

Output: 0.8157894736842105

A score of 81.6% is pretty good for our classifier! So we’ve done a good job. Let’s try another classifier

 

See the Confusion Matrix

Before we go to the next classifier, let’s see the confusion matrix of this classifier. So we know know many true-positives, false positive etc.

First we obtain y_pred, then we import confusion_matrix from sklearn.metrics. The code is given below

y_pred = dt_clf.predict(X_test)

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

 

The confusion matrix produced is:

array([[31,  4],
       [ 6,  6]], dtype=int64)
  • True positives: 31 (upper left)
  • True negatives: 6 (lower right)
  • False positives: 4 (upper right)
  • False negatives: 6 (lower left)

 

Perform Classification Using Random Forest Classifier

Now, we would see if we could get a better results using the the Random Forests Classifier.

from sklearn import ensemble
rf_clf = ensemble.RandomForestClassifier(n_estimators=100)
rf_clf.fit(X_train, y_train)
rf_clf.score(X_test, y_test)

Output: 0.7368421052631579

Wow! It seems this classifier did not do as good with a score of 73.7%. Let’s now do one more classifier

 

Perform Classification Using Gradient Boosting Classifier

This classifier is available in the ensemble module which we already imported. So we don’t need to import anything

gb_clf = ensemble.GradientBoostingClassifier()
gb_clf.fit(X_train, y_train)
gb_clf.score(X_test, y_test)
Output: 0.6842105263157895

Hmm! Now so good! Let’s try to improve it by tuning.

 

Tune the Classifier

We could try to improve the gradient boosting classifier. This we can do by adding a hyperparameter that would be used for tuning. Here, we add n_estimators to be 50. The code is given below.

# Let's  tune this Gradient booster.
gb_clf = ensemble.GradientBoostingClassifier(n_estimators=50)
gb_clf.fit(X_train,y_train)
gb_clf.score(X_test, y_test)

Output: 0.7105263157894737

 

You can see that after tuning we have a better score than we did before tuning.

I’m going to stop here for now and hopefully this have been informative for you. If so, leave me a comment below. You could also watch the video lesson and subscribe to my channel.

https://www.youtube.com/kindsonthegenius