CrossTab Sparsity for Classification

classification
metric
feature selection
Can our metric help us in making a classification problem work better ?
Author

Jitin Kapila

Published

January 3, 2023

Cross Roads where everyone meets!

Cross Roads where everyone meets!

Introduction: A Journey into Data

Picture this: you’re standing on the icy shores of Antarctica, the wind whipping around you as you watch a colony of Palmer Penguins waddling about, oblivious to the data detective work you’re about to embark on. As a data science architect, you’re not just an observer; you’re a sleuth armed with algorithms and insights, ready to unravel the mysteries hidden within data. Today, we’ll transform raw numbers into powerful narratives using CrossTab Sparsity as our guiding compass. This blog post will demonstrate how this metric can revolutionize classification tasks, shedding light on many fascinating datasets—the charming Palmer Penguins and the serious Obesity, Credit cards data and many more.

The Power of CrossTab Sparsity

What is CrossTab Sparsity?

CrossTab Sparsity isn’t just a fancy term that sounds good at dinner parties; it’s a statistical measure that helps us peer into the intricate relationships between categorical variables. Imagine it as a magnifying glass that reveals how different categories interact within a contingency table. Understanding these interactions is crucial in classification tasks, where the right features can make or break your model (and your day).

Why Does It Matter?

In the world of data science, especially in classification, selecting relevant features is like picking the right ingredients for a gourmet meal—get it wrong, and you might end up with something unpalatable. CrossTab Sparsity helps us achieve this by:

  • Highlighting Relationships: It’s like having a friend who always points out when two people are meant to be together—understanding how features interact with the target variable.
  • Streamlining Models: Reducing complexity by focusing on significant features means less time spent untangling spaghetti code.
  • Enhancing Interpretability: Making models easier to understand and explain to stakeholders is like translating tech jargon into plain English—everyone appreciates that!

Data Overview: Our Data People at work here

The Datasets

Data 1: Estimation of Obesity Levels Based On Eating Habits and Physical Condition

Little bit about the data: This dataset, shared on 8/26/2019, looks at obesity levels in people from Mexico, Peru, and Colombia based on their eating habits and physical health. It includes 2,111 records with 16 features, and classifies individuals into different obesity levels, from insufficient weight to obesity type III. Most of the data (77%) was created using a tool, while the rest (23%) was collected directly from users online.

Data 2: Predict Students’ Dropout and Academic Success

Little bit about the data: This dataset, shared on 12/12/2021, looks at factors like students’ backgrounds, academic path, and socio-economic status to predict whether they’ll drop out or succeed in their studies. With 4,424 records across 36 features, it covers students from different undergrad programs. The goal is to use machine learning to spot at-risk students early, so schools can offer support. The data has been cleaned and doesn’t have any missing values. It’s a classification task with three outcomes: dropout, still enrolled, or graduated

Key Features:

  • Multiclass: Both data set cater a multi class problems with NObeyesdad and Target columns
  • Mixed Data Type: A good mix of categorical and continuous variables are available for usage.
  • Sizeable: More than 2 K rows are available for testing.

Exploratory Data Analysis (EDA): Setting the Stage

Before we dive into model creation, let’s explore our dataset through some quick EDA. Think of this as getting to know your non-obese friends before inviting them to a party.

EDA for Obesity Data

Here’s a brief code snippet to perform essential EDA on the Obesity dataset:

Loading data and generating basic descriptive
# Load the Obesity data
raw_df = pd.read_csv('ObesityDataSet_raw_and_data_sinthetic.csv')
target = 'NObeyesdad'

# Load Students data

# Load Credit data
# raw_data = sm.datasets.get_rdataset("credit_data",'modeldata')
# raw_df = raw_data.data
# target = 'Status'

# # Load Palmer penguins data
# raw_data = sm.datasets.get_rdataset("penguins",'palmerpenguins')
# raw_df = raw_data.data
# target = 'species'


# # Load Credit data
# raw_data = sm.datasets.get_rdataset("CreditCard",'AER')
# raw_df = raw_data.data
# target = 'card'


# setting things up for aal the next steps
raw_df[target] = raw_df[target].astype('category') 
print('No of data points available to work:',raw_df.shape)
display(raw_df.head())


# Summary statistics
display(raw_df.describe())
No of data points available to work: (2111, 17)
Gender Age Height Weight Famil_Hist_Owt FAVC FCVC NCP CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS NObeyesdad
0 Female 21.0 1.62 64.0 yes no 2.0 3.0 Sometimes no 2.0 no 0.0 1.0 no Public_Transportation Normal_Weight
1 Female 21.0 1.52 56.0 yes no 3.0 3.0 Sometimes yes 3.0 yes 3.0 0.0 Sometimes Public_Transportation Normal_Weight
2 Male 23.0 1.80 77.0 yes no 2.0 3.0 Sometimes no 2.0 no 2.0 1.0 Frequently Public_Transportation Normal_Weight
3 Male 27.0 1.80 87.0 no no 3.0 3.0 Sometimes no 2.0 no 2.0 0.0 Frequently Walking Overweight_Level_I
4 Male 22.0 1.78 89.8 no no 2.0 1.0 Sometimes no 2.0 no 0.0 0.0 Sometimes Public_Transportation Overweight_Level_II
Age Height Weight FCVC NCP CH2O FAF TUE
count 2111.000000 2111.000000 2111.000000 2111.000000 2111.000000 2111.000000 2111.000000 2111.000000
mean 24.312600 1.701677 86.586058 2.419043 2.685628 2.008011 1.010298 0.657866
std 6.345968 0.093305 26.191172 0.533927 0.778039 0.612953 0.850592 0.608927
min 14.000000 1.450000 39.000000 1.000000 1.000000 1.000000 0.000000 0.000000
25% 19.947192 1.630000 65.473343 2.000000 2.658738 1.584812 0.124505 0.000000
50% 22.777890 1.700499 83.000000 2.385502 3.000000 2.000000 1.000000 0.625350
75% 26.000000 1.768464 107.430682 3.000000 3.000000 2.477420 1.666678 1.000000
max 61.000000 1.980000 173.000000 3.000000 4.000000 3.000000 3.000000 2.000000

Target distribution

Target and Correlation
# Visualize target data distribution
plt.figure(figsize=(4, 3))
sns.countplot(data=raw_df, x=target, hue=target, palette='Set2',)
plt.title(f'Distribution of {target} levels')
plt.xticks(rotation=45)
plt.show()

# Heatmap to check for correlations between numeric variables
corr = raw_df.corr('kendall',numeric_only=True)
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Kendall Correlation Heatmap')
plt.show()

EDA code
# Visualize the distribution of numerical variables
sns.pairplot(raw_df, hue=target, corner=True)
plt.show()




# Gettign Categorical data
categorical_columns = raw_df.select_dtypes(include='object').columns

# Plot categorical variables with respect to the target variable
for col in categorical_columns:
    plt.figure(figsize=(12, 5))
    sns.countplot(data=raw_df,x=col, hue=target)
    plt.title(f"Countplot of {col} with respect to {target}")
    plt.show()

Model Creation: Establishing a Baseline

With our exploratory analysis complete, we’re ready to create our baseline model using logistic regression with Statsmodels. This initial model will serve as our reference point—like setting up a benchmark for your favorite video game.

Splitting data and training a default Multinomila Logit model on our data
data_df = raw_df.dropna().reset_index(drop=True)
data_df[target] = data_df[target].cat.codes
# X = data_df[['bill_length_mm','bill_depth_mm','flipper_length_mm','body_mass_g']] 

data_df_test = data_df.sample(frac=0.1,random_state=3)
data_df_train = data_df.drop(data_df_test.index)

# Using MN logistic regression model using formula API
# This would essentially bold down to pair wise logsitic regression
logit_model = sm.MNLogit.from_formula(
    f"{target} ~ {' + '.join([col for col in data_df_train.columns if col != target])}", 
    data=data_df_train
).fit_regularized()
Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.17057119619320013
            Iterations: 485
            Function evaluations: 639
            Gradient evaluations: 485
Display summary
display(logit_model.summary())
MNLogit Regression Results
Dep. Variable: NObeyesdad No. Observations: 1900
Model: MNLogit Df Residuals: 1756
Method: MLE Df Model: 138
Date: Sun, 07 Dec 2025 Pseudo R-squ.: 0.9122
Time: 16:56:59 Log-Likelihood: -324.09
converged: True LL-Null: -3691.8
Covariance Type: nonrobust LLR p-value: 0.000
NObeyesdad=1 coef std err z P>|z| [0.025 0.975]
Intercept -11.2903 3.25e+05 -3.48e-05 1.000 -6.36e+05 6.36e+05
Gender[T.Male] -3.4851 0.817 -4.268 0.000 -5.085 -1.885
Famil_Hist_Owt[T.yes] -0.8162 0.655 -1.246 0.213 -2.100 0.468
FAVC[T.yes] 0.2636 0.785 0.336 0.737 -1.275 1.802
CAEC[T.Frequently] -8.2402 2.312 -3.564 0.000 -12.771 -3.709
CAEC[T.Sometimes] -6.2226 2.232 -2.787 0.005 -10.598 -1.847
CAEC[T.no] -8.5977 2.889 -2.976 0.003 -14.260 -2.935
SMOKE[T.yes] 4.4919 3.115 1.442 0.149 -1.614 10.598
SCC[T.yes] -0.7294 1.447 -0.504 0.614 -3.565 2.106
CALC[T.Frequently] -12.6192 3.25e+05 -3.89e-05 1.000 -6.36e+05 6.36e+05
CALC[T.Sometimes] -13.2985 3.25e+05 -4.1e-05 1.000 -6.36e+05 6.36e+05
CALC[T.no] -14.1585 3.25e+05 -4.36e-05 1.000 -6.36e+05 6.36e+05
MTRANS[T.Bike] 15.8909 2489.580 0.006 0.995 -4863.596 4895.378
MTRANS[T.Motorbike] 3.9944 47.659 0.084 0.933 -89.416 97.405
MTRANS[T.Public_Transportation] 4.4914 0.995 4.514 0.000 2.541 6.441
MTRANS[T.Walking] 4.3554 1.502 2.900 0.004 1.412 7.299
Age 0.3721 0.097 3.833 0.000 0.182 0.562
Height -14.4208 4.118 -3.502 0.000 -22.492 -6.349
Weight 1.0786 0.146 7.378 0.000 0.792 1.365
FCVC -0.7754 0.429 -1.806 0.071 -1.617 0.066
NCP -1.7094 0.491 -3.480 0.001 -2.672 -0.747
CH2O -1.7291 0.578 -2.992 0.003 -2.862 -0.596
FAF -0.1924 0.280 -0.688 0.491 -0.740 0.356
TUE -0.9320 0.456 -2.043 0.041 -1.826 -0.038
NObeyesdad=2 coef std err z P>|z| [0.025 0.975]
Intercept 17.4309 nan nan nan nan nan
Gender[T.Male] -14.0384 1.983 -7.079 0.000 -17.925 -10.151
Famil_Hist_Owt[T.yes] 2.0527 1.717 1.195 0.232 -1.313 5.418
FAVC[T.yes] 0.9668 1.752 0.552 0.581 -2.468 4.401
CAEC[T.Frequently] -10.0052 4.352 -2.299 0.021 -18.534 -1.476
CAEC[T.Sometimes] -1.0074 3.427 -0.294 0.769 -7.724 5.709
CAEC[T.no] -0.4896 894.479 -0.001 1.000 -1753.637 1752.658
SMOKE[T.yes] 8.1410 4.013 2.029 0.042 0.277 16.005
SCC[T.yes] -7.6940 152.983 -0.050 0.960 -307.535 292.147
CALC[T.Frequently] -2.4516 nan nan nan nan nan
CALC[T.Sometimes] -7.5316 nan nan nan nan nan
CALC[T.no] -7.2301 nan nan nan nan nan
MTRANS[T.Bike] -11.9350 8.09e+07 -1.47e-07 1.000 -1.59e+08 1.59e+08
MTRANS[T.Motorbike] 10.9226 48.493 0.225 0.822 -84.123 105.968
MTRANS[T.Public_Transportation] 11.1756 1.750 6.387 0.000 7.746 14.605
MTRANS[T.Walking] 1.7281 2.759 0.626 0.531 -3.679 7.135
Age 0.8111 0.132 6.139 0.000 0.552 1.070
Height -184.0385 14.746 -12.481 0.000 -212.939 -155.138
Weight 3.9438 0.288 13.688 0.000 3.379 4.508
FCVC 0.8915 1.014 0.879 0.379 -1.095 2.878
NCP -1.1415 0.711 -1.605 0.109 -2.536 0.253
CH2O -1.5390 0.876 -1.756 0.079 -3.256 0.179
FAF -1.5295 0.591 -2.586 0.010 -2.689 -0.370
TUE -0.5710 0.840 -0.680 0.497 -2.217 1.075
NObeyesdad=3 coef std err z P>|z| [0.025 0.975]
Intercept -138.5068 1.47e+07 -9.41e-06 1.000 -2.89e+07 2.89e+07
Gender[T.Male] -16.6365 8.279 -2.010 0.044 -32.863 -0.410
Famil_Hist_Owt[T.yes] 2.3538 11.601 0.203 0.839 -20.384 25.092
FAVC[T.yes] -8.7785 5.476 -1.603 0.109 -19.512 1.955
CAEC[T.Frequently] -71.7022 nan nan nan nan nan
CAEC[T.Sometimes] -3.9034 4.734 -0.824 0.410 -13.183 5.376
CAEC[T.no] 7.7265 895.063 0.009 0.993 -1746.566 1762.019
SMOKE[T.yes] 3.5306 19.342 0.183 0.855 -34.379 41.440
SCC[T.yes] -19.4879 154.607 -0.126 0.900 -322.512 283.536
CALC[T.Frequently] -43.6020 1.48e+07 -2.95e-06 1.000 -2.9e+07 2.9e+07
CALC[T.Sometimes] -45.7496 1.47e+07 -3.11e-06 1.000 -2.88e+07 2.88e+07
CALC[T.no] -28.2183 1.43e+07 -1.97e-06 1.000 -2.81e+07 2.81e+07
MTRANS[T.Bike] 0.0376 nan nan nan nan nan
MTRANS[T.Motorbike] -2.3812 1.05e+11 -2.27e-11 1.000 -2.06e+11 2.06e+11
MTRANS[T.Public_Transportation] 22.5234 6.664 3.380 0.001 9.463 35.584
MTRANS[T.Walking] -5.3334 33.279 -0.160 0.873 -70.560 59.893
Age 2.5106 0.964 2.605 0.009 0.621 4.400
Height -278.9439 44.201 -6.311 0.000 -365.576 -192.312
Weight 7.1539 1.394 5.132 0.000 4.422 9.886
FCVC 4.1064 3.285 1.250 0.211 -2.333 10.546
NCP -1.5637 2.424 -0.645 0.519 -6.315 3.187
CH2O -13.4088 5.560 -2.412 0.016 -24.306 -2.511
FAF -9.8534 4.356 -2.262 0.024 -18.390 -1.316
TUE -5.6951 3.292 -1.730 0.084 -12.147 0.757
NObeyesdad=4 coef std err z P>|z| [0.025 0.975]
Intercept -87.3214 nan nan nan nan nan
Gender[T.Male] -200.3037 5.41e+07 -3.7e-06 1.000 -1.06e+08 1.06e+08
Famil_Hist_Owt[T.yes] -30.9252 nan nan nan nan nan
FAVC[T.yes] -53.1818 3.98e+07 -1.34e-06 1.000 -7.8e+07 7.8e+07
CAEC[T.Frequently] -28.5483 nan nan nan nan nan
CAEC[T.Sometimes] -21.5821 5.38e+07 -4.01e-07 1.000 -1.05e+08 1.05e+08
CAEC[T.no] -2.2000 4.62e+29 -4.76e-30 1.000 -9.06e+29 9.06e+29
SMOKE[T.yes] -6.0944 nan nan nan nan nan
SCC[T.yes] -12.3054 nan nan nan nan nan
CALC[T.Frequently] -6.2460 nan nan nan nan nan
CALC[T.Sometimes] -37.2004 2.12e+08 -1.76e-07 1.000 -4.15e+08 4.15e+08
CALC[T.no] -64.5032 nan nan nan nan nan
MTRANS[T.Bike] -0.2989 1.92e+53 -1.56e-54 1.000 -3.76e+53 3.76e+53
MTRANS[T.Motorbike] -0.2031 nan nan nan nan nan
MTRANS[T.Public_Transportation] -57.6854 7.04e+07 -8.2e-07 1.000 -1.38e+08 1.38e+08
MTRANS[T.Walking] -7.4464 2.03e+15 -3.66e-15 1.000 -3.98e+15 3.98e+15
Age -9.3747 103.246 -0.091 0.928 -211.733 192.984
Height -174.4727 592.866 -0.294 0.769 -1336.469 987.523
Weight 8.7405 35.222 0.248 0.804 -60.293 77.774
FCVC 49.0613 3.02e+04 0.002 0.999 -5.91e+04 5.92e+04
NCP 2.3650 4572.743 0.001 1.000 -8960.047 8964.777
CH2O -18.5809 34.347 -0.541 0.589 -85.900 48.738
FAF -65.1761 262.887 -0.248 0.804 -580.424 450.072
TUE -44.3721 285.217 -0.156 0.876 -603.387 514.643
NObeyesdad=5 coef std err z P>|z| [0.025 0.975]
Intercept -12.5683 3.25e+05 -3.87e-05 1.000 -6.36e+05 6.36e+05
Gender[T.Male] -6.8149 1.085 -6.282 0.000 -8.941 -4.689
Famil_Hist_Owt[T.yes] -0.5822 0.790 -0.737 0.461 -2.130 0.966
FAVC[T.yes] 2.6008 0.978 2.660 0.008 0.684 4.517
CAEC[T.Frequently] -7.2298 2.507 -2.884 0.004 -12.143 -2.316
CAEC[T.Sometimes] -2.8197 2.413 -1.168 0.243 -7.550 1.910
CAEC[T.no] -3.8181 3.143 -1.215 0.224 -9.977 2.341
SMOKE[T.yes] 3.1451 3.296 0.954 0.340 -3.314 9.604
SCC[T.yes] 2.1647 1.617 1.339 0.181 -1.004 5.334
CALC[T.Frequently] -9.0315 3.25e+05 -2.78e-05 1.000 -6.36e+05 6.36e+05
CALC[T.Sometimes] -9.1446 3.25e+05 -2.82e-05 1.000 -6.36e+05 6.36e+05
CALC[T.no] -10.7708 3.25e+05 -3.32e-05 1.000 -6.36e+05 6.36e+05
MTRANS[T.Bike] 19.0425 2489.581 0.008 0.994 -4860.446 4898.531
MTRANS[T.Motorbike] 1.6235 47.716 0.034 0.973 -91.899 95.146
MTRANS[T.Public_Transportation] 5.9777 1.209 4.946 0.000 3.609 8.346
MTRANS[T.Walking] 4.3596 1.776 2.454 0.014 0.878 7.841
Age 0.4878 0.106 4.597 0.000 0.280 0.696
Height -50.0157 6.721 -7.442 0.000 -63.188 -36.844
Weight 1.7920 0.168 10.651 0.000 1.462 2.122
FCVC -0.8369 0.601 -1.393 0.164 -2.014 0.341
NCP -1.4453 0.554 -2.608 0.009 -2.531 -0.359
CH2O -1.7648 0.679 -2.601 0.009 -3.095 -0.435
FAF -0.5613 0.374 -1.499 0.134 -1.295 0.172
TUE -0.7982 0.555 -1.439 0.150 -1.886 0.289
NObeyesdad=6 coef std err z P>|z| [0.025 0.975]
Intercept -2.1693 6.28e+06 -3.45e-07 1.000 -1.23e+07 1.23e+07
Gender[T.Male] -6.6857 1.207 -5.537 0.000 -9.052 -4.319
Famil_Hist_Owt[T.yes] 1.9296 1.076 1.793 0.073 -0.179 4.038
FAVC[T.yes] -0.4617 1.141 -0.405 0.686 -2.698 1.775
CAEC[T.Frequently] -5.5324 3.264 -1.695 0.090 -11.930 0.866
CAEC[T.Sometimes] 0.7854 3.044 0.258 0.796 -5.181 6.752
CAEC[T.no] 1.7141 3.934 0.436 0.663 -5.997 9.426
SMOKE[T.yes] 7.0398 3.570 1.972 0.049 0.043 14.036
SCC[T.yes] 1.3664 2.012 0.679 0.497 -2.577 5.309
CALC[T.Frequently] -2.1001 6.28e+06 -3.34e-07 1.000 -1.23e+07 1.23e+07
CALC[T.Sometimes] -4.6772 6.28e+06 -7.45e-07 1.000 -1.23e+07 1.23e+07
CALC[T.no] -4.1972 6.28e+06 -6.68e-07 1.000 -1.23e+07 1.23e+07
MTRANS[T.Bike] -21.8420 6.54e+09 -3.34e-09 1.000 -1.28e+10 1.28e+10
MTRANS[T.Motorbike] 3.2252 47.781 0.068 0.946 -90.423 96.873
MTRANS[T.Public_Transportation] 8.8055 1.416 6.219 0.000 6.030 11.581
MTRANS[T.Walking] 1.2540 2.256 0.556 0.578 -3.168 5.676
Age 0.7030 0.116 6.086 0.000 0.477 0.929
Height -104.6838 9.021 -11.605 0.000 -122.364 -87.003
Weight 2.6259 0.190 13.819 0.000 2.253 2.998
FCVC 0.1776 0.764 0.232 0.816 -1.320 1.675
NCP -1.8276 0.608 -3.007 0.003 -3.019 -0.636
CH2O -1.8930 0.757 -2.502 0.012 -3.376 -0.410
FAF -1.0280 0.438 -2.347 0.019 -1.887 -0.169
TUE 0.1282 0.670 0.191 0.848 -1.186 1.442

Evaluating Model Performance

To gauge our models’ effectiveness, we’ll employ various metrics such as accuracy, precision, recall, and F1-score. A confusion matrix will help visualize how well our models perform in classifying outcomes—think of it as a report card for your model!

Evaluating the Logit model
# Predict on test data
base_preds = logit_model.predict(data_df_test).idxmax(axis=1)
y_test = data_df_test[target]

# Evaluate the model
accuracy_orig = accuracy_score(y_test, base_preds)
report_orig = classification_report(y_test, base_preds)

print("Accuracy:", accuracy_orig)
print("Classification Report:")
print(report_orig)
Accuracy: 0.909952606635071
Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.86      0.89        29
           1       0.86      0.83      0.84        29
           2       0.95      0.91      0.93        45
           3       0.94      0.97      0.95        31
           4       1.00      0.96      0.98        27
           5       0.83      0.90      0.86        21
           6       0.84      0.93      0.89        29

    accuracy                           0.91       211
   macro avg       0.91      0.91      0.91       211
weighted avg       0.91      0.91      0.91       211

Looking for some Improvments!

Feature Selection Using CrossTab Sparsity

Now comes the exciting part—using CrossTab Sparsity to refine our feature selection process! It’s like cleaning up your closet and only keeping the clothes that spark joy (thank you, Marie Kondo). 1

1 This is based on work in Unique Metric for Health Analysis with Optimization of Clustering Activity and Cross Comparison of Results from Different Approach. Paper Link

Code is here!

Standared Steps for Feature Selection

  1. Calculate CrossTab Sparsity: For each feature against the target variable.
  2. Select Features: Based on sparsity scores that indicate significant interactions with the target variable.
  3. Recreate Models: Train new models using only the selected features—less is often more!

Here we go!!!

Doing what needs to Done Code ;)
sns.set_style("white")
sns.set_context("paper")
# Calculating Crostab sparsity for each Column
results = crosstab_sparsity(data_df_train.iloc[:,:-1],data_df_train[target],numeric_bin='decile')

# presenting results for consumption
df_long = pd.melt(results['scores'], id_vars=['Columns'], value_vars=['seggregation', 'explaination', 'metric'],
                  var_name='Metric', value_name='values')

# Adding jitter: small random noise to 'Columns' (x-axis)
# df_long['values_jittered'] = df_long['Value'] + np.random.uniform(-0.1, 0.1, size=len(df_long))

# Create a seaborn scatter plot with jitter, more professional color palette, and transparency
plt.figure(figsize=(12, 5))
sns.scatterplot(x='Columns', y='values', hue='Metric', style='Metric',
        data=df_long, s=100, alpha=0.7, palette='deep')

# Title and labels
plt.title('Metrics by Columns', fontsize=16)
plt.xticks(rotation=45) 
plt.xlabel('Columns', fontsize=10)
plt.ylabel('Value', fontsize=10)

# Display legend outside the plot for better readability
plt.legend(title='Metric', loc='upper right', fancybox=True, framealpha=0.5)

# Show the plot
plt.tight_layout()
plt.show()
CSP calculated with decile for breaks!

Scores for 7 groups(s) is : 140.96057955229762

And Drum Rolls pelase!!!

Using just top 5 varaibles we are getting almost similar or better overall accuracy. This amounts to greatly simplifing the models and clearly explain why some variable are not useful for modeling.

And finally training and evaluating with drum rolls
logit_model_rev = sm.MNLogit.from_formula(f"{target} ~ {' + '.join(results['scores'].loc[:5,'Columns'].values)}", 
    data=data_df_train
).fit_regularized()

# Predict on test data
challenger_preds = logit_model_rev.predict(data_df_test).idxmax(axis=1)
y_test = data_df_test[target]

# Evaluate the model
accuracy_new = accuracy_score(y_test, challenger_preds)
report_new = classification_report(y_test, challenger_preds)

print("Accuracy:", accuracy_new)
print("Classification Report:")
print(report_new)
Singular matrix E in LSQ subproblem    (Exit mode 5)
            Current function value: nan
            Iterations: 470
            Function evaluations: 1227
            Gradient evaluations: 470
Accuracy: 0.9383886255924171
Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.97      0.95        29
           1       0.93      0.93      0.93        29
           2       0.96      1.00      0.98        45
           3       0.93      0.90      0.92        31
           4       0.93      0.93      0.93        27
           5       0.90      0.90      0.90        21
           6       0.96      0.90      0.93        29

    accuracy                           0.94       211
   macro avg       0.94      0.93      0.93       211
weighted avg       0.94      0.94      0.94       211
/home/jitin/Documents/applications/perceptions/.venv/lib/python3.12/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
Code
display(logit_model_rev.summary())
MNLogit Regression Results
Dep. Variable: NObeyesdad No. Observations: 1900
Model: MNLogit Df Residuals: 1858
Method: MLE Df Model: 36
Date: Sun, 07 Dec 2025 Pseudo R-squ.: nan
Time: 16:57:01 Log-Likelihood: nan
converged: False LL-Null: -3691.8
Covariance Type: nonrobust LLR p-value: nan
NObeyesdad=1 coef std err z P>|z| [0.025 0.975]
Intercept 58.1248 nan nan nan nan nan
TUE 0.1130 nan nan nan nan nan
CH2O -0.8634 nan nan nan nan nan
FAF 0.1425 nan nan nan nan nan
Age 0.0579 nan nan nan nan nan
Height -76.5735 nan nan nan nan nan
Weight 1.3337 nan nan nan nan nan
NObeyesdad=2 coef std err z P>|z| [0.025 0.975]
Intercept 328.4616 nan nan nan nan nan
TUE 2.2275 nan nan nan nan nan
CH2O -1.4150 nan nan nan nan nan
FAF -1.3585 nan nan nan nan nan
Age 0.1537 nan nan nan nan nan
Height -426.3945 nan nan nan nan nan
Weight 5.3584 nan nan nan nan nan
NObeyesdad=3 coef std err z P>|z| [0.025 0.975]
Intercept 306.6447 nan nan nan nan nan
TUE -7.8630 nan nan nan nan nan
CH2O -21.0118 nan nan nan nan nan
FAF -11.3624 nan nan nan nan nan
Age 2.4017 nan nan nan nan nan
Height -710.3867 nan nan nan nan nan
Weight 10.1072 nan nan nan nan nan
NObeyesdad=4 coef std err z P>|z| [0.025 0.975]
Intercept 352.4249 nan nan nan nan nan
TUE -9.2469 nan nan nan nan nan
CH2O -20.6780 nan nan nan nan nan
FAF -14.7525 nan nan nan nan nan
Age 2.1487 nan nan nan nan nan
Height -758.2318 nan nan nan nan nan
Weight 10.5011 nan nan nan nan nan
NObeyesdad=5 coef std err z P>|z| [0.025 0.975]
Intercept 126.2892 nan nan nan nan nan
TUE 0.5832 nan nan nan nan nan
CH2O -0.8764 nan nan nan nan nan
FAF -0.1920 nan nan nan nan nan
Age 0.0719 nan nan nan nan nan
Height -160.2982 nan nan nan nan nan
Weight 2.3663 nan nan nan nan nan
NObeyesdad=6 coef std err z P>|z| [0.025 0.975]
Intercept 207.3760 nan nan nan nan nan
TUE 1.6561 nan nan nan nan nan
CH2O -0.6583 nan nan nan nan nan
FAF -0.1243 nan nan nan nan nan
Age 0.1042 nan nan nan nan nan
Height -266.6050 nan nan nan nan nan
Weight 3.6160 nan nan nan nan nan

Impact on Model Accuracy

After applying feature selection based on CrossTab Sparsity, we’ll compare the accuracy of our new models against our baseline models. This comparison will reveal how effectively CrossTab Sparsity enhances classification performance.

Results and Discussion: Unveiling Insights

Model Comparison Table

After implementing CrossTab Sparsity in our feature selection process, let’s take a look at the results:

Comparision Code
metrics = {
    "Metric": ["Accuracy", "Precision", "Recall", "F1-Score"],
    "Baseline Model with all Parameters": [
        accuracy_score(y_test, base_preds),
        precision_score(y_test, base_preds, average='weighted'),
        recall_score(y_test, base_preds, average='weighted'),
        f1_score(y_test, base_preds, average='weighted'),
    ],
    "Challenger Model with only 5 Variables": [
        accuracy_score(y_test, challenger_preds),
        precision_score(y_test, challenger_preds, average='weighted'),
        recall_score(y_test, challenger_preds, average='weighted'),
        f1_score(y_test, challenger_preds, average='weighted'),
    ]
}
display(pd.DataFrame(metrics).round(4).set_index('Metric').T)
Metric Accuracy Precision Recall F1-Score
Baseline Model with all Parameters 0.9100 0.9123 0.9100 0.9103
Challenger Model with only 5 Variables 0.9384 0.9384 0.9384 0.9381

Insights Gained

Through this analysis, several key insights emerge:

Reduction of similar accuracy from 16 to 5 i.e 68.75% reduction
  1. Feature Interactions Matter: The selected features based on CrossTab Sparsity significantly improved model accuracy—like finding out which ingredients make your favorite dish even better!
  2. Simplicity is Key: By focusing on relevant features, we enhance accuracy while simplifying model interpretation—because nobody likes unnecessary complexity.
  3. Real-World Applications: These findings have practical implications in fields such as environmental science where classification plays a critical role—helping us make better decisions for our planet.

Conclusion: The Road Ahead

In conclusion, this blog has illustrated how CrossTab Sparsity can be a game-changer in classification tasks using the Obesity dataset. By leveraging this metric for feature selection, we achieved notable improvements in model performance—proof that sometimes less really is more!

Future Work: Expanding Horizons

As we look ahead, there are exciting avenues to explore:

  • Investigating regression problems using CrossTab Sparsity.
  • Comparing its effectiveness with other feature selection methods such as Recursive Feature Elimination (RFE) or comparision with other feature selection mehtods.

By continuing this journey into data science, we not only enhance our technical skills but also contribute valuable insights that can drive meaningful change in various industries.

Back to top