Cara menggunakan chi-square feature selection python

Pada sekarang ini ketersediaan informasi berbentuk dokumen teks sebagian besar sudah berbentuk elektronik (softcopy). Penyimpanan media teks tersebut akan mengalami perkembangan yang sangat besar pada masa yang mendatang. Yang perlu dilakukan adalah penggolongan dokumen yang berada dalam satu kumpulan dokumen (corpus) ke dalam kategori yang sasuai dengan isi dokumen tersebut. Pengklasifikasian dokumen sulit dilakukan jika menggunakan Query biasa, maka hasil kurang spesifik dapat mengakibatkan membanjirnya beberapa dokumen yang tidak relevan. Text Mining adalah suatu bidang satu bidang khusus dari Data Mining yang memberikan solusi dari permasalahan seperti pemrosesan, pengelompokkan dan menganalisis unstructured text dalam jumlah besar. Feature selection adalah suatu bentuk upaya peningkatan algoritma pembelajaran yang digunakan untuk menggolongkan dokumen ke dalam kategori tertentu dengan cara menemukan suatu bentuk pola yang relevan. Chi Squared adalah salah satu metode yang digunakan untuk proses Feature Selection. Sedangkan metode klasifikasi dokumen yang digunakan adalah metode Naïve Bayes Classifier (NBC) yang digunakan untuk memecahkan masalah berhubungan dengan proses klasifikasi.

Universitas Esa Unggul

Penulis :

Tony Nathan Setiawan

Download :

  • Abstrak
  • Penggunaan Metode Chi Squared Untuk Proses Feature Selection pada Klasifikasi Dokumen Teks

In my previous two articles, I talked about how to measure correlations between the various columns in your dataset and how to detect multicollinearity between them:

Statistics in Python — Understanding Variance, Covariance, and Correlation

Understand the relationships between your data and know the difference between Pearson Correlation Coefficient and…

towardsdatascience.com

Statistics in Python — Collinearity and Multicollinearity

Understand how to discovery multicollinearity in your dataset

towardsdatascience.com

However, these techniques are useful when the variables you are trying to compare with are continuous. How do you compare them if your variables are categorical? In this article, I will explain to you how you can test two categorical columns in your dataset to determine if they are dependent on each other (i.e. correlated). We will use a statistics test known as chi-square (commonly written as χ2).

Before we start our discussion on chi-square, here is a quick summary of the test methods that can be used for testing the various types of variables:

Using the chi-square statistics to determine if two categorical variables are correlated

The chi-square (χ2) statistics is a way to check the relationship between two categorical nominal variables.

Nominal variables contains values that have no intrinsic ordering. Examples of nominal variables are sex, race, eye color, skin color, etc. Ordinal variables, on the other hand, contains values that have ordering. Examples of ordinal variables are grade, education level, economic status, etc.

The key idea behind the chi-square test is to compare the observed values in your data to the expected values and see if they are related or not. In particular, it is a useful way to check if two categorical nominal variables are correlated. This is particularly important in machine learning where you only want features that are correlated to the target to be used for training.

There are two types of chi-square tests:

  • Chi-Square Goodness of Fit Test — test if one variable is likely to come from a given distribution.
  • Chi-Square Test of Independence — test if two variables might be correlated or not.

Check out //www.jmp.com/en_us/statistics-knowledge-portal/chi-square-test.html for a more detailed discusson of the above two chi-square tests.

When comparing to see if two categorical variables are correlated, you will use the Chi-Square Test of Independence.

Steps to Performing a Chi-Square Test

To use the chi-square test, you need to perform the following steps:

  1. Define your null hypothesis and alternate hypothesis. They are:
  • H₀ (Null Hypothesis) — that the 2 categorical variables being compared are independent of each other.
  • H₁ (Alternate Hypothesis) — that the 2 categorical variables being compared are dependent on each other.

2. Decide on the α value. This is the risk that you are willing to take in drawing the wrong conclusion. As an example, say you set α=0.05 when testing for independence. This means you are undertaking a 5% risk of concluding that two variables are independent when in reality they are not.

3. Calculate the chi-square score using the two categorical variables and use it to calculate the p-value. A low p-value means there is a high correlation between your two categorical variables (they are dependent on each other). The p-value is calculated from the chi-square score. The p-value will tell you if your tests results are significant or not.

In a chi-square analysis, the p-value is the probability of obtaining a chi-square as large or larger than that in the current experiment and yet the data will still support the hypothesis. It is the probability of deviations from what was expected being due to mere chance. In general a p-value of 0.05 or greater is considered critical, anything less means the deviations are significant and the hypothesis being tested must be rejected.

Source: //passel2.unl.edu/view/lesson/9beaa382bf7e/8

To calculate the p-value, you need two pieces of information:

  • Degrees of freedom — the number of categories minus 1
  • Chi-square score.

If the p-value obtained is:

  • < 0.05 (the α value you have chosen) you reject the H₀ (Null Hypothesis) and accept the H₁ (Alternate Hypothesis). This means the two categorical variables are dependent.
  • > 0.05 you accept the H₀ (Null Hypothesis) and reject the H₁ (Alternate Hypothesis). This means the two categorical variables are independent.

In the case of feature selection for machine learning, you would want the feature that is being compared to the target to have a low p-value (less than 0.05), as this means that the feature is dependent on (correlated to) the target.

With the chi-square score that is calculated, you can also use it to refer to a chi-square table to see if your score falls within the rejection region or the acceptance region.

All the steps above sound a little vague, and the best way to really understand how chi-square works is to look at an example. In the next section, I will use the Titanic dataset and apply the chi-square test on a few of the features and see how if they are correlated to the target.

Using chi-square test on the Titanic dataset

A good way to understand a new topic is to go through the concepts using an example. For this, I am going to use the classic Titanic dataset (//www.kaggle.com/tedllh/titanic-train).

The Titanic dataset is often used in machine learning to demonstrate how to build a machine learning model and use it to make predictions. In particular, the dataset contains several features (Pclass, Sex, Age, Embarked, etc) and one target (Survived). Several features in the dataset are categorical variables:

  • Pclass-the class of cabin that the passenger was in
  • Sex-the sex of the passenger
  • Embarked-the port of embarkation
  • Survived-if the passenger survived the disaster

Because this article explores the relationships between categorical features and targets, we are only interested in those columns that contains categorical values.

Loading the Dataset

Now that you have obtained the dataset, let’s load it up in a Pandas DataFrame:

import pandas as pd
import numpy as np
df = pd.read_csv('titanic_train.csv')
df.sample(5)

Image by author

Data Cleansing and Feature Engineering

There are some columns that are not really useful and hence we will proceed to drop them. Also, there are some missing values so let’s drop all those rows with empty values:

df.drop(columns=['PassengerId','Name', 'Ticket','Fare','Cabin'],
inplace=True)
df.dropna(inplace=True)
df

Image by author

We will also add one more column named Alone, based on the Parch (Parent or children) and Sibsp (Siblings or spouse) columns. The idea we want to explore is if being alone affects the surviability of the passenger. So Alone is 1 if both Parch and Sibsp are 0, else it is 0:

df['Alone'] = (df['Parch'] + df['SibSp']).apply(
lambda x: 1 if x == 0 else 0)
df

Image by author

Visualizing the correlations between features and target

Now that the data is cleaned, let’s try to visualize how the sex of passengers is related to their survival of the accident:

import seaborn as sns
sns.barplot(x='Sex', y='Survived', data=df, ci=None)

The Sex column contains nominal data(i.e. ranking is not important).

Image by author

From the above figure, you can see that of all the female passengers, more than 70% survived; of all the men, about 20% survived. Seems like there exists a very strong relationship between the Sex and Survived features. To confirm this, we will use the chi-square test to confirm this later on.

How about Pclass and Survived? Are they related?

sns.barplot(x='Pclass', y='Survived', data=df, ci=None)

Image by author

Perhaps unsurprisingly, it shows that the higher the Pclass that the passenger was in, the higher the survival rate of the passenger.

The next feature of interest is if the place of embarkation determines who survives and who doesn’t:

sns.barplot(x='Embarked', y='Survived', data=df, ci=None)

Image by author

From the chart it seems like more people who embarked from C (Cherbourg) survived.

C = Cherbourg; Q = Queenstown; S = Southampton

You also want to know if being alone on the trip makes one more survivable:

ax = sns.barplot(x='Alone', y='Survived', data=df, ci=None)
ax.set_xticklabels(['Not Alone','Alone'])

Image by author

You can see that if one is with their family, he/she will have a higher chances of survival.

Visualizing the correlations between each feature

Now that we have visualized the relationships between the categorical features against the target (Survived), we want to now visualize the relationships between each feature. Before you can do that, you need to convert the label values in the Sex and Embarked columns to numeric. To do that, you can make use of the LabelEncoder class in sklearn:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df['Sex'])
df['Sex'] = le.transform(df['Sex'])
sex_labels = dict(zip(le.classes_, le.transform(le.classes_)))
print(sex_labels)
le.fit(df['Embarked'])
df['Embarked'] = le.transform(df['Embarked'])
embarked_labels = dict(zip(le.classes_,
le.transform(le.classes_)))
print(embarked_labels)

The above code snippet label-encodes the Sex and Embarked columns. The output shows the mappings of the values for each column, which is very useful later when performing predictions:

{'female': 0, 'male': 1}
{'C': 0, 'Q': 1, 'S': 2}

The following statements show the relationship between Embarked and Sex:

ax = sns.barplot(x='Embarked', y='Sex', data=df, ci=None)
ax.set_xticklabels(embarked_labels.keys())

Image by author

Seems like more males boarded from Southampton (S) than in Queenstown (Q) and Cherbourg (C).

How about Embarked and Alone?

df.drop(columns=['PassengerId','Name', 'Ticket','Fare','Cabin'],
inplace=True)
df.dropna(inplace=True)
df
0

Image by author

Seems like a large proportion of those who embarked from Queenstown are alone.

And finally, let’s see the relationship between Sex and Alone:

df.drop(columns=['PassengerId','Name', 'Ticket','Fare','Cabin'],
inplace=True)
df.dropna(inplace=True)
df
1

Image by author

As you can see, there are more males than females who are alone for the trip.

Defining the Hypotheses

You now define your null hypothesis and alternate hypothesis. As explained earlier, they are:

  • H₀ (Null Hypothesis) — that the 2 categorical variables to be compared are independent of each other.
  • H₁ (Alternate Hypothesis) — that the 2 categorical variables being compared are dependent on each other.

And you draw your conclusions based on the following p-value conditions:

  • p < 0.05 — this means the two categorical variables are correlated.
  • p > 0.05 — this means the two categorical variables are not correlated.

Calculating χ2 manually

Let’s manually go through the steps in calculating the χ2 values. The first step is to create a contingency table. Using the Sex and Survived columns as example, you first create a contingency table:

Image by author

The contingency table above displays the frequency distribution of the two categorical columns — Sex and Survived.

The Degrees of Freedom is next calculated as (number of rows -1) * (number of columns -1). In this example, the degree of freedom is (2–1)*(2–1) = 1.

Once the contingency table is created, sum up all the rows and columns, like this:

Image by author

The above is your Observed values.

Next, you are going to calculate the Expected values. Here is how they are calculated:

  • Replace each value in the observed value with the product of the sum of its column and the sum of its row, divided by the total sum.

The following figure shows how the first value is calculated:

Image by author

The next figure shows how the second value is calculated:

Image by author

Here is the result for the Expected values:

Image by author

Then, calculate the chi-square value for each cell using the formula for χ2:

Image by author

Applying this formula to the Observed and Expected values, you get the chi-square values:

Image by author

The chi-square score is the grand total of the chi-square values:

Image by author

You can use the following websites to verify if the numbers are correct:

  • Chi-Square Calculator — //www.mathsisfun.com/data/chi-square-calculator.html

The Python implementation for the above steps is contained within the following chi2_by_hand() function:

df.drop(columns=['PassengerId','Name', 'Ticket','Fare','Cabin'],
inplace=True)
df.dropna(inplace=True)
df
2

The chi2_by_hand() function takes in three argument — the dataframe containing all your columns, followed by two strings containing the names of the two columns you are comparing against. It returns a tuple — the chi-square score, plus the degrees of freedom.

Let’s now test the above function using the Titanic dataset. First, let’s compare the Sex and the Survived columns:

df.drop(columns=['PassengerId','Name', 'Ticket','Fare','Cabin'],
inplace=True)
df.dropna(inplace=True)
df
3

You will see the following result:

df.drop(columns=['PassengerId','Name', 'Ticket','Fare','Cabin'],
inplace=True)
df.dropna(inplace=True)
df
4

Using the chi-square score, you can now decide if you will accept or reject the null hypothesis using the chi-square distribution curve:

Image by author

The x-axis represents the χ2 score. The area that is to the right of the critical chi-square region is known as the rejection region. Area to the left of it is known as the acceptance region. If the chi-square score that you have obtained falls in the acceptance region, the null hypothesis is accepted; else the alternate hypothesis is accepted.

So how do you obtain the critical chi-square region? For this, you have to check the chi-square table:

Table from //people.smp.uq.edu.au/YoniNazarathy/stat_models_B_course_spring_07/distributions/chisqtab.pdf; annotations by author

You can check out the Chi-Square Table at //www.mathsisfun.com/data/chi-square-table.html

This is how you use the chi-square table. With your α set to be 0.05, and 1 degrees of freedom, the critical chi-square region is 3.84 (refer to the chart above). Putting this value into the chi-square distribution curve, you can conclude that:

Image by author
  • As the calculated chi-square value (205) is greater than 3.84, it therefore falls in the rejection region, and hence the null hypothesis is rejected and the alternate hypothesis is accepted.
  • Recalling our alternate hypothesis as: H₁ (Alternate Hypothesis) — that the 2 categorical variables being compared are dependent on each other.

This means that the Sex and Survived columns are dependent on each other.

As a practise, you can use the chi2_by_hand() function on the other features.

Calculating the p-value

The previous section shows how you can accept or reject the null hypothesis by examining the chi-square score and comparing it with the chi-square distribution curve.

An alternative way to accept or reject the null hypothesis is by using the p-value. Remember, the p-value can be calculated using the chi-square score and the degrees of freedom.

For simplicity, we shall not go into the details of how to calculate the p-value by hand.

In Python, you can calculate the p-value using the stats module’s sf() function:

df.drop(columns=['PassengerId','Name', 'Ticket','Fare','Cabin'],
inplace=True)
df.dropna(inplace=True)
df
5

You can now call the chi2_by_hand() function and get both the chi_square score, degrees of freedom, and p-value:

df.drop(columns=['PassengerId','Name', 'Ticket','Fare','Cabin'],
inplace=True)
df.dropna(inplace=True)
df
6

The above code results in the following p-value:

df.drop(columns=['PassengerId','Name', 'Ticket','Fare','Cabin'],
inplace=True)
df.dropna(inplace=True)
df
7

As a quick recap, you accept or reject the hypotheses and form your conclusion based on the following p-value conditions:

  • p < 0.05 — this means the two categorical variables are correlated.
  • p > 0.05 — this means the two categorical variables are not correlated.

And since p < 0.05 — this means the two categorical variables are correlated.

Trying out the other features

Let’s try out the categorical columns that contains nominal values:

df.drop(columns=['PassengerId','Name', 'Ticket','Fare','Cabin'],
inplace=True)
df.dropna(inplace=True)
df
8

Since the p-values for both Embarked and Alone are < 0.05, you can conclude that both the Embarked and Alone features are correlated to the Survived target, and should be included for training in your model.

Summary

In this article, I have gone through a brief explanation of how the chi-square statistics test works, and how you can apply it to the Titanic dataset. A few notes of caution would be useful here:

  1. While the Pearson’s coefficient and Spearman’s rank coefficient measure the strength of an association between two variables, the chi-square test measures the significance of the association between two variables. What it tells you is whether the relationship you found in the sample is likely to exist in the population, or how likely it is by chance due to sampling error.
  2. The chi-square test is sensitive to small frequencies in your contingency table. Generally, if a cell in your contingency table has a frequency of 5 or less, the chi-square test will lead to errors in conclusion. Also, chi-square test should not be used if the sample size is less than 50.

I hope you now have a better understanding of how chi-square works and how it can be used for feature selection in machine learning. See you in my next article!

Apa itu Feature Selection untuk klasifikasi?

Feature selection atau seleksi fitur adalah salah satu teknik penting dan sering digunakan dalam tahap pre-processing. Teknik ini mengurangi jumlah fitur yang terlibat dalam menentukan suatu nilai kelas target. Fitur yang diabaikan biasanya berupa fitur yang tidak relevan dan data berlebih.

Mengapa feature selection itu penting?

Feature selection sangat penting pada preprocessing karna memungkinkan untuk menghilangkan fitur yang tidak relevan pada dataset untuk mengurangi process time dan juga meningkatkan accuracy rate [7].

Postingan terbaru

LIHAT SEMUA