Introduction: A Journey into Data

Picture this: you’re standing on the icy shores of Antarctica, the wind whipping around you as you watch a colony of Palmer Penguins waddling about, oblivious to the data detective work you’re about to embark on. As a data science architect, you’re not just an observer; you’re a sleuth armed with algorithms and insights, ready to unravel the mysteries hidden within data. Today, we’ll transform raw numbers into powerful narratives using CrossTab Sparsity as our guiding compass. This blog post will demonstrate how this metric can revolutionize classification tasks, shedding light on many fascinating datasets—the charming Palmer Penguins and the serious Obesity, Credit cards data and many more.

The Power of CrossTab Sparsity

What is CrossTab Sparsity?

CrossTab Sparsity isn’t just a fancy term that sounds good at dinner parties; it’s a statistical measure that helps us peer into the intricate relationships between categorical variables. Imagine it as a magnifying glass that reveals how different categories interact within a contingency table. Understanding these interactions is crucial in classification tasks, where the right features can make or break your model (and your day).

Why Does It Matter?

In the world of data science, especially in classification, selecting relevant features is like picking the right ingredients for a gourmet meal—get it wrong, and you might end up with something unpalatable. CrossTab Sparsity helps us achieve this by:

Highlighting Relationships: It’s like having a friend who always points out when two people are meant to be together—understanding how features interact with the target variable.
Streamlining Models: Reducing complexity by focusing on significant features means less time spent untangling spaghetti code.
Enhancing Interpretability: Making models easier to understand and explain to stakeholders is like translating tech jargon into plain English—everyone appreciates that!

Data Overview: Our Data People at work here

The Datasets

Data 1: Estimation of Obesity Levels Based On Eating Habits and Physical Condition

Little bit about the data: This dataset, shared on 8/26/2019, looks at obesity levels in people from Mexico, Peru, and Colombia based on their eating habits and physical health. It includes 2,111 records with 16 features, and classifies individuals into different obesity levels, from insufficient weight to obesity type III. Most of the data (77%) was created using a tool, while the rest (23%) was collected directly from users online.

Data 2: Predict Students’ Dropout and Academic Success

Little bit about the data: This dataset, shared on 12/12/2021, looks at factors like students’ backgrounds, academic path, and socio-economic status to predict whether they’ll drop out or succeed in their studies. With 4,424 records across 36 features, it covers students from different undergrad programs. The goal is to use machine learning to spot at-risk students early, so schools can offer support. The data has been cleaned and doesn’t have any missing values. It’s a classification task with three outcomes: dropout, still enrolled, or graduated

Key Features:

Multiclass: Both data set cater a multi class problems with NObeyesdad and Target columns
Mixed Data Type: A good mix of categorical and continuous variables are available for usage.
Sizeable: More than 2 K rows are available for testing.

Exploratory Data Analysis (EDA): Setting the Stage

Before we dive into model creation, let’s explore our dataset through some quick EDA. Think of this as getting to know your non-obese friends before inviting them to a party.

EDA for Obesity Data

Here’s a brief code snippet to perform essential EDA on the Obesity dataset:

Loading data and generating basic descriptive

# Load the Obesity data
raw_df = pd.read_csv('ObesityDataSet_raw_and_data_sinthetic.csv')
target = 'NObeyesdad'

# Load Students data

# Load Credit data
# raw_data = sm.datasets.get_rdataset("credit_data",'modeldata')
# raw_df = raw_data.data
# target = 'Status'

# # Load Palmer penguins data
# raw_data = sm.datasets.get_rdataset("penguins",'palmerpenguins')
# raw_df = raw_data.data
# target = 'species'


# # Load Credit data
# raw_data = sm.datasets.get_rdataset("CreditCard",'AER')
# raw_df = raw_data.data
# target = 'card'


# setting things up for aal the next steps
raw_df[target] = raw_df[target].astype('category') 
print('No of data points available to work:',raw_df.shape)
display(raw_df.head())


# Summary statistics
display(raw_df.describe())

No of data points available to work: (2111, 17)

	Gender	Age	Height	Weight	Famil_Hist_Owt	FAVC	FCVC	NCP	CAEC	SMOKE	CH2O	SCC	FAF	TUE	CALC	MTRANS	NObeyesdad
0	Female	21.0	1.62	64.0	yes	no	2.0	3.0	Sometimes	no	2.0	no	0.0	1.0	no	Public_Transportation	Normal_Weight
1	Female	21.0	1.52	56.0	yes	no	3.0	3.0	Sometimes	yes	3.0	yes	3.0	0.0	Sometimes	Public_Transportation	Normal_Weight
2	Male	23.0	1.80	77.0	yes	no	2.0	3.0	Sometimes	no	2.0	no	2.0	1.0	Frequently	Public_Transportation	Normal_Weight
3	Male	27.0	1.80	87.0	no	no	3.0	3.0	Sometimes	no	2.0	no	2.0	0.0	Frequently	Walking	Overweight_Level_I
4	Male	22.0	1.78	89.8	no	no	2.0	1.0	Sometimes	no	2.0	no	0.0	0.0	Sometimes	Public_Transportation	Overweight_Level_II

	Age	Height	Weight	FCVC	NCP	CH2O	FAF	TUE
count	2111.000000	2111.000000	2111.000000	2111.000000	2111.000000	2111.000000	2111.000000	2111.000000
mean	24.312600	1.701677	86.586058	2.419043	2.685628	2.008011	1.010298	0.657866
std	6.345968	0.093305	26.191172	0.533927	0.778039	0.612953	0.850592	0.608927
min	14.000000	1.450000	39.000000	1.000000	1.000000	1.000000	0.000000	0.000000
25%	19.947192	1.630000	65.473343	2.000000	2.658738	1.584812	0.124505	0.000000
50%	22.777890	1.700499	83.000000	2.385502	3.000000	2.000000	1.000000	0.625350
75%	26.000000	1.768464	107.430682	3.000000	3.000000	2.477420	1.666678	1.000000
max	61.000000	1.980000	173.000000	3.000000	4.000000	3.000000	3.000000	2.000000

Target distribution

Target and Correlation

# Visualize target data distribution
plt.figure(figsize=(4, 3))
sns.countplot(data=raw_df, x=target, hue=target, palette='Set2',)
plt.title(f'Distribution of {target} levels')
plt.xticks(rotation=45)
plt.show()

# Heatmap to check for correlations between numeric variables
corr = raw_df.corr('kendall',numeric_only=True)
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Kendall Correlation Heatmap')
plt.show()

Some Mode EDA for the data

EDA code

# Visualize the distribution of numerical variables
sns.pairplot(raw_df, hue=target, corner=True)
plt.show()




# Gettign Categorical data
categorical_columns = raw_df.select_dtypes(include='object').columns

# Plot categorical variables with respect to the target variable
for col in categorical_columns:
    plt.figure(figsize=(12, 5))
    sns.countplot(data=raw_df,x=col, hue=target)
    plt.title(f"Countplot of {col} with respect to {target}")
    plt.show()

Model Creation: Establishing a Baseline

With our exploratory analysis complete, we’re ready to create our baseline model using logistic regression with Statsmodels. This initial model will serve as our reference point—like setting up a benchmark for your favorite video game.

Splitting data and training a default Multinomila Logit model on our data

data_df = raw_df.dropna().reset_index(drop=True)
data_df[target] = data_df[target].cat.codes
# X = data_df[['bill_length_mm','bill_depth_mm','flipper_length_mm','body_mass_g']] 

data_df_test = data_df.sample(frac=0.1,random_state=3)
data_df_train = data_df.drop(data_df_test.index)

# Using MN logistic regression model using formula API
# This would essentially bold down to pair wise logsitic regression
logit_model = sm.MNLogit.from_formula(
    f"{target} ~ {' + '.join([col for col in data_df_train.columns if col != target])}", 
    data=data_df_train
).fit_regularized()

Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.17113347578942742
            Iterations: 489
            Function evaluations: 670
            Gradient evaluations: 489

Base model summary for geeks

Display summary

display(logit_model.summary())

MNLogit Regression Results
Dep. Variable:	NObeyesdad	No. Observations:	1900
Model:	MNLogit	Df Residuals:	1756
Method:	MLE	Df Model:	138
Date:	Thu, 27 Feb 2025	Pseudo R-squ.:	0.9119
Time:	03:18:53	Log-Likelihood:	-325.15
converged:	True	LL-Null:	-3691.8
Covariance Type:	nonrobust	LLR p-value:	0.000

NObeyesdad=1	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-11.3439	3.45e+05	-3.29e-05	1.000	-6.76e+05	6.76e+05
Gender[T.Male]	-3.4606	0.819	-4.224	0.000	-5.066	-1.855
Famil_Hist_Owt[T.yes]	-0.8874	0.658	-1.349	0.177	-2.177	0.402
FAVC[T.yes]	0.2631	0.782	0.337	0.736	-1.269	1.795
CAEC[T.Frequently]	-8.2219	2.342	-3.511	0.000	-12.811	-3.632
CAEC[T.Sometimes]	-6.2475	2.265	-2.758	0.006	-10.687	-1.808
CAEC[T.no]	-8.6341	2.916	-2.961	0.003	-14.349	-2.919
SMOKE[T.yes]	4.5048	3.105	1.451	0.147	-1.582	10.591
SCC[T.yes]	-0.7063	1.458	-0.484	0.628	-3.565	2.152
CALC[T.Frequently]	-12.6173	3.45e+05	-3.66e-05	1.000	-6.76e+05	6.76e+05
CALC[T.Sometimes]	-13.3244	3.45e+05	-3.86e-05	1.000	-6.76e+05	6.76e+05
CALC[T.no]	-14.1980	3.45e+05	-4.12e-05	1.000	-6.76e+05	6.76e+05
MTRANS[T.Bike]	15.8821	2529.381	0.006	0.995	-4941.614	4973.378
MTRANS[T.Motorbike]	4.0050	47.345	0.085	0.933	-88.790	96.800
MTRANS[T.Public_Transportation]	4.5116	1.001	4.505	0.000	2.549	6.474
MTRANS[T.Walking]	4.3989	1.507	2.918	0.004	1.445	7.353
Age	0.3779	0.098	3.858	0.000	0.186	0.570
Height	-14.4182	4.123	-3.497	0.000	-22.499	-6.338
Weight	1.0784	0.146	7.384	0.000	0.792	1.365
FCVC	-0.7676	0.428	-1.793	0.073	-1.607	0.072
NCP	-1.7199	0.489	-3.516	0.000	-2.679	-0.761
CH2O	-1.7255	0.578	-2.985	0.003	-2.859	-0.592
FAF	-0.1753	0.281	-0.624	0.533	-0.726	0.375
TUE	-0.9735	0.458	-2.124	0.034	-1.872	-0.075
NObeyesdad=2	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	17.3832	nan	nan	nan	nan	nan
Gender[T.Male]	-13.9964	1.976	-7.083	0.000	-17.869	-10.123
Famil_Hist_Owt[T.yes]	2.0850	1.721	1.212	0.226	-1.288	5.458
FAVC[T.yes]	1.0223	1.765	0.579	0.562	-2.437	4.482
CAEC[T.Frequently]	-10.0658	4.392	-2.292	0.022	-18.674	-1.458
CAEC[T.Sometimes]	-1.0233	3.443	-0.297	0.766	-7.771	5.724
CAEC[T.no]	-0.4821	977.119	-0.000	1.000	-1915.601	1914.637
SMOKE[T.yes]	8.1449	4.011	2.030	0.042	0.283	16.007
SCC[T.yes]	-7.6939	155.443	-0.049	0.961	-312.356	296.968
CALC[T.Frequently]	-2.4712	nan	nan	nan	nan	nan
CALC[T.Sometimes]	-7.5357	nan	nan	nan	nan	nan
CALC[T.no]	-7.2634	nan	nan	nan	nan	nan
MTRANS[T.Bike]	-11.9360	1.16e+08	-1.03e-07	1.000	-2.27e+08	2.27e+08
MTRANS[T.Motorbike]	10.9302	48.258	0.226	0.821	-83.653	105.513
MTRANS[T.Public_Transportation]	11.2094	1.756	6.383	0.000	7.767	14.651
MTRANS[T.Walking]	1.7141	2.758	0.622	0.534	-3.691	7.119
Age	0.8105	0.133	6.108	0.000	0.550	1.071
Height	-184.0655	14.785	-12.450	0.000	-213.043	-155.088
Weight	3.9430	0.288	13.681	0.000	3.378	4.508
FCVC	0.8899	1.009	0.882	0.378	-1.088	2.867
NCP	-1.1103	0.710	-1.564	0.118	-2.502	0.281
CH2O	-1.5409	0.877	-1.757	0.079	-3.259	0.178
FAF	-1.4599	0.593	-2.461	0.014	-2.622	-0.297
TUE	-0.5909	0.840	-0.704	0.482	-2.237	1.055
NObeyesdad=3	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-138.5283	nan	nan	nan	nan	nan
Gender[T.Male]	-16.6646	8.382	-1.988	0.047	-33.094	-0.235
Famil_Hist_Owt[T.yes]	2.3697	11.592	0.204	0.838	-20.350	25.090
FAVC[T.yes]	-8.7847	5.440	-1.615	0.106	-19.447	1.878
CAEC[T.Frequently]	-71.7139	2.13e+08	-3.37e-07	1.000	-4.17e+08	4.17e+08
CAEC[T.Sometimes]	-3.9355	4.749	-0.829	0.407	-13.244	5.373
CAEC[T.no]	7.7274	977.625	0.008	0.994	-1908.382	1923.836
SMOKE[T.yes]	3.5336	19.117	0.185	0.853	-33.935	41.002
SCC[T.yes]	-19.4881	156.920	-0.124	0.901	-327.046	288.070
CALC[T.Frequently]	-43.6047	nan	nan	nan	nan	nan
CALC[T.Sometimes]	-45.7392	nan	nan	nan	nan	nan
CALC[T.no]	-28.2608	nan	nan	nan	nan	nan
MTRANS[T.Bike]	0.0374	nan	nan	nan	nan	nan
MTRANS[T.Motorbike]	-2.3922	1.05e+11	-2.28e-11	1.000	-2.05e+11	2.05e+11
MTRANS[T.Public_Transportation]	22.6192	6.634	3.410	0.001	9.618	35.621
MTRANS[T.Walking]	-5.3362	34.114	-0.156	0.876	-72.198	61.526
Age	2.5098	0.960	2.615	0.009	0.629	4.391
Height	-278.8861	44.172	-6.314	0.000	-365.461	-192.311
Weight	7.1526	1.391	5.141	0.000	4.426	9.879
FCVC	4.1479	3.269	1.269	0.204	-2.258	10.554
NCP	-1.5833	2.388	-0.663	0.507	-6.264	3.098
CH2O	-13.3811	5.527	-2.421	0.015	-24.213	-2.549
FAF	-9.8066	4.355	-2.252	0.024	-18.342	-1.271
TUE	-5.7061	3.289	-1.735	0.083	-12.152	0.739
NObeyesdad=4	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-87.3253	6.43e+07	-1.36e-06	1.000	-1.26e+08	1.26e+08
Gender[T.Male]	-200.2957	4.25e+07	-4.71e-06	1.000	-8.33e+07	8.33e+07
Famil_Hist_Owt[T.yes]	-30.9113	nan	nan	nan	nan	nan
FAVC[T.yes]	-53.1787	nan	nan	nan	nan	nan
CAEC[T.Frequently]	-28.5507	2.16e+08	-1.32e-07	1.000	-4.23e+08	4.23e+08
CAEC[T.Sometimes]	-21.5727	4.19e+07	-5.15e-07	1.000	-8.21e+07	8.21e+07
CAEC[T.no]	-2.1999	1.31e+29	-1.69e-29	1.000	-2.56e+29	2.56e+29
SMOKE[T.yes]	-6.0935	9.24e+08	-6.59e-09	1.000	-1.81e+09	1.81e+09
SCC[T.yes]	-12.3062	nan	nan	nan	nan	nan
CALC[T.Frequently]	-6.2458	1.59e+10	-3.93e-10	1.000	-3.12e+10	3.12e+10
CALC[T.Sometimes]	-37.1969	nan	nan	nan	nan	nan
CALC[T.no]	-64.5072	nan	nan	nan	nan	nan
MTRANS[T.Bike]	-0.2989	1.92e+53	-1.56e-54	1.000	-3.76e+53	3.76e+53
MTRANS[T.Motorbike]	-0.2031	3.86e+35	-5.26e-37	1.000	-7.57e+35	7.57e+35
MTRANS[T.Public_Transportation]	-57.6929	5.78e+07	-9.98e-07	1.000	-1.13e+08	1.13e+08
MTRANS[T.Walking]	-7.4454	2.11e+15	-3.52e-15	1.000	-4.14e+15	4.14e+15
Age	-9.3711	100.732	-0.093	0.926	-206.803	188.061
Height	-174.4791	585.777	-0.298	0.766	-1322.581	973.623
Weight	8.7401	34.352	0.254	0.799	-58.588	76.068
FCVC	49.0843	3.05e+04	0.002	0.999	-5.98e+04	5.99e+04
NCP	2.3456	4587.346	0.001	1.000	-8988.688	8993.379
CH2O	-18.5876	33.678	-0.552	0.581	-84.595	47.420
FAF	-65.1863	257.967	-0.253	0.801	-570.792	440.420
TUE	-44.3687	279.949	-0.158	0.874	-593.058	504.321
NObeyesdad=5	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-12.5582	3.45e+05	-3.64e-05	1.000	-6.76e+05	6.76e+05
Gender[T.Male]	-6.8927	1.091	-6.319	0.000	-9.031	-4.755
Famil_Hist_Owt[T.yes]	-0.5826	0.791	-0.736	0.462	-2.134	0.969
FAVC[T.yes]	2.6029	0.975	2.670	0.008	0.692	4.514
CAEC[T.Frequently]	-7.2782	2.533	-2.873	0.004	-12.243	-2.314
CAEC[T.Sometimes]	-2.8841	2.442	-1.181	0.238	-7.671	1.903
CAEC[T.no]	-3.8084	3.166	-1.203	0.229	-10.013	2.397
SMOKE[T.yes]	3.1147	3.291	0.947	0.344	-3.335	9.565
SCC[T.yes]	2.1332	1.626	1.312	0.190	-1.054	5.320
CALC[T.Frequently]	-9.0218	3.45e+05	-2.61e-05	1.000	-6.76e+05	6.76e+05
CALC[T.Sometimes]	-9.1622	3.45e+05	-2.66e-05	1.000	-6.76e+05	6.76e+05
CALC[T.no]	-10.7609	3.45e+05	-3.12e-05	1.000	-6.76e+05	6.76e+05
MTRANS[T.Bike]	19.0539	2529.381	0.008	0.994	-4938.442	4976.550
MTRANS[T.Motorbike]	1.6649	47.401	0.035	0.972	-91.240	94.570
MTRANS[T.Public_Transportation]	6.0083	1.212	4.956	0.000	3.632	8.385
MTRANS[T.Walking]	4.3751	1.779	2.460	0.014	0.889	7.861
Age	0.4896	0.107	4.589	0.000	0.281	0.699
Height	-49.9784	6.729	-7.427	0.000	-63.167	-36.790
Weight	1.7920	0.168	10.650	0.000	1.462	2.122
FCVC	-0.8144	0.599	-1.359	0.174	-1.989	0.360
NCP	-1.4253	0.552	-2.580	0.010	-2.508	-0.343
CH2O	-1.8250	0.678	-2.690	0.007	-3.155	-0.495
FAF	-0.5296	0.375	-1.412	0.158	-1.265	0.206
TUE	-0.8409	0.557	-1.510	0.131	-1.932	0.250
NObeyesdad=6	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-2.1495	1.51e+06	-1.42e-06	1.000	-2.96e+06	2.96e+06
Gender[T.Male]	-6.7717	1.213	-5.583	0.000	-9.149	-4.395
Famil_Hist_Owt[T.yes]	1.9277	1.078	1.789	0.074	-0.185	4.040
FAVC[T.yes]	-0.4390	1.141	-0.385	0.700	-2.676	1.798
CAEC[T.Frequently]	-5.4475	3.295	-1.653	0.098	-11.906	1.011
CAEC[T.Sometimes]	0.8345	3.075	0.271	0.786	-5.192	6.861
CAEC[T.no]	1.6818	3.972	0.423	0.672	-6.103	9.466
SMOKE[T.yes]	7.0586	3.567	1.979	0.048	0.068	14.049
SCC[T.yes]	1.3350	2.021	0.661	0.509	-2.625	5.295
CALC[T.Frequently]	-2.1230	1.51e+06	-1.41e-06	1.000	-2.96e+06	2.96e+06
CALC[T.Sometimes]	-4.6506	1.51e+06	-3.08e-06	1.000	-2.96e+06	2.96e+06
CALC[T.no]	-4.1703	1.51e+06	-2.76e-06	1.000	-2.96e+06	2.96e+06
MTRANS[T.Bike]	-21.8443	6.4e+09	-3.41e-09	1.000	-1.26e+10	1.26e+10
MTRANS[T.Motorbike]	3.1683	47.467	0.067	0.947	-89.865	96.202
MTRANS[T.Public_Transportation]	8.7749	1.423	6.165	0.000	5.985	11.564
MTRANS[T.Walking]	1.2621	2.258	0.559	0.576	-3.163	5.687
Age	0.6974	0.116	6.002	0.000	0.470	0.925
Height	-104.7093	9.038	-11.585	0.000	-122.424	-86.995
Weight	2.6268	0.190	13.821	0.000	2.254	2.999
FCVC	0.2192	0.764	0.287	0.774	-1.278	1.716
NCP	-1.8144	0.606	-2.992	0.003	-3.003	-0.626
CH2O	-1.9110	0.757	-2.525	0.012	-3.394	-0.428
FAF	-0.9928	0.439	-2.264	0.024	-1.852	-0.133
TUE	0.0701	0.671	0.104	0.917	-1.246	1.386

Evaluating Model Performance

To gauge our models’ effectiveness, we’ll employ various metrics such as accuracy, precision, recall, and F1-score. A confusion matrix will help visualize how well our models perform in classifying outcomes—think of it as a report card for your model!

Evaluating the Logit model

# Predict on test data
base_preds = logit_model.predict(data_df_test).idxmax(axis=1)
y_test = data_df_test[target]

# Evaluate the model
accuracy_orig = accuracy_score(y_test, base_preds)
report_orig = classification_report(y_test, base_preds)

print("Accuracy:", accuracy_orig)
print("Classification Report:")
print(report_orig)

Accuracy: 0.909952606635071
Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.86      0.89        29
           1       0.86      0.83      0.84        29
           2       0.95      0.91      0.93        45
           3       0.94      0.97      0.95        31
           4       1.00      0.96      0.98        27
           5       0.83      0.90      0.86        21
           6       0.84      0.93      0.89        29

    accuracy                           0.91       211
   macro avg       0.91      0.91      0.91       211
weighted avg       0.91      0.91      0.91       211

Looking for some Improvments!

Feature Selection Using CrossTab Sparsity

Now comes the exciting part—using CrossTab Sparsity to refine our feature selection process! It’s like cleaning up your closet and only keeping the clothes that spark joy (thank you, Marie Kondo). ¹

¹ This is based on work in Unique Metric for Health Analysis with Optimization of Clustering Activity and Cross Comparison of Results from Different Approach. Paper Link

Code is here!

Standared Steps for Feature Selection

Calculate CrossTab Sparsity: For each feature against the target variable.
Select Features: Based on sparsity scores that indicate significant interactions with the target variable.
Recreate Models: Train new models using only the selected features—less is often more!

Here we go!!!

Doing what needs to Done Code ;)

sns.set_style("white")
sns.set_context("paper")
# Calculating Crostab sparsity for each Column
results = crosstab_sparsity(data_df_train.iloc[:,:-1],data_df_train[target],numeric_bin='decile')

# presenting results for consumption
df_long = pd.melt(results['scores'], id_vars=['Columns'], value_vars=['seggregation', 'explaination', 'metric'],
                  var_name='Metric', value_name='values')

# Adding jitter: small random noise to 'Columns' (x-axis)
# df_long['values_jittered'] = df_long['Value'] + np.random.uniform(-0.1, 0.1, size=len(df_long))

# Create a seaborn scatter plot with jitter, more professional color palette, and transparency
plt.figure(figsize=(12, 5))
sns.scatterplot(x='Columns', y='values', hue='Metric', style='Metric',
        data=df_long, s=100, alpha=0.7, palette='deep')

# Title and labels
plt.title('Metrics by Columns', fontsize=16)
plt.xticks(rotation=45) 
plt.xlabel('Columns', fontsize=10)
plt.ylabel('Value', fontsize=10)

# Display legend outside the plot for better readability
plt.legend(title='Metric', loc='upper right', fancybox=True, framealpha=0.5)

# Show the plot
plt.tight_layout()
plt.show()

CSP calculated with decile for breaks!

Scores for 7 groups(s) is : 140.96057955229762

And Drum Rolls pelase!!!

Using just top 5 varaibles we are getting almost similar or better overall accuracy. This amounts to greatly simplifing the models and clearly explain why some variable are not useful for modeling.

And finally training and evaluating with drum rolls

logit_model_rev = sm.MNLogit.from_formula(f"{target} ~ {' + '.join(results['scores'].loc[:5,'Columns'].values)}", 
    data=data_df_train
).fit_regularized()

# Predict on test data
challenger_preds = logit_model_rev.predict(data_df_test).idxmax(axis=1)
y_test = data_df_test[target]

# Evaluate the model
accuracy_new = accuracy_score(y_test, challenger_preds)
report_new = classification_report(y_test, challenger_preds)

print("Accuracy:", accuracy_new)
print("Classification Report:")
print(report_new)

Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.174380345428068
            Iterations: 417
            Function evaluations: 662
            Gradient evaluations: 417
Accuracy: 0.9383886255924171
Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.97      0.95        29
           1       0.93      0.93      0.93        29
           2       0.96      1.00      0.98        45
           3       0.93      0.90      0.92        31
           4       0.93      0.93      0.93        27
           5       0.90      0.90      0.90        21
           6       0.96      0.90      0.93        29

    accuracy                           0.94       211
   macro avg       0.94      0.93      0.93       211
weighted avg       0.94      0.94      0.94       211

Summary of retrained model

Code

display(logit_model_rev.summary())

MNLogit Regression Results
Dep. Variable:	NObeyesdad	No. Observations:	1900
Model:	MNLogit	Df Residuals:	1858
Method:	MLE	Df Model:	36
Date:	Thu, 27 Feb 2025	Pseudo R-squ.:	0.9103
Time:	03:18:54	Log-Likelihood:	-331.32
converged:	True	LL-Null:	-3691.8
Covariance Type:	nonrobust	LLR p-value:	0.000

NObeyesdad=1	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	58.1248	9.361	6.209	0.000	39.778	76.472
TUE	0.1130	0.445	0.254	0.799	-0.759	0.985
CH2O	-0.8634	0.609	-1.419	0.156	-2.056	0.329
FAF	0.1425	0.334	0.426	0.670	-0.513	0.798
Age	0.0579	0.077	0.754	0.451	-0.093	0.208
Height	-76.5735	10.536	-7.268	0.000	-97.224	-55.923
Weight	1.3337	0.176	7.566	0.000	0.988	1.679
NObeyesdad=2	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	328.4616	25.112	13.080	0.000	279.242	377.681
TUE	2.2275	0.870	2.560	0.010	0.522	3.933
CH2O	-1.4150	0.984	-1.439	0.150	-3.343	0.513
FAF	-1.3585	0.747	-1.820	0.069	-2.822	0.105
Age	0.1537	0.097	1.591	0.112	-0.036	0.343
Height	-426.3945	30.970	-13.768	0.000	-487.095	-365.694
Weight	5.3584	0.372	14.386	0.000	4.628	6.088
NObeyesdad=3	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	306.6447	33.046	9.279	0.000	241.876	371.413
TUE	-7.8630	5.691	-1.382	0.167	-19.017	3.291
CH2O	-21.0118	11.508	-1.826	0.068	-43.567	1.543
FAF	-11.3624	5.759	-1.973	0.048	-22.650	-0.075
Age	2.4017	1.260	1.905	0.057	-0.069	4.872
Height	-710.3867	156.303	-4.545	0.000	-1016.734	-404.039
Weight	10.1072	2.588	3.905	0.000	5.034	15.180
NObeyesdad=4	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	352.4249	33.573	10.497	0.000	286.623	418.227
TUE	-9.2469	5.711	-1.619	0.105	-20.440	1.946
CH2O	-20.6780	11.516	-1.796	0.073	-43.250	1.894
FAF	-14.7525	5.794	-2.546	0.011	-26.108	-3.397
Age	2.1487	1.262	1.703	0.089	-0.325	4.622
Height	-758.2318	156.401	-4.848	0.000	-1064.772	-451.692
Weight	10.5011	2.589	4.056	0.000	5.427	15.575
NObeyesdad=5	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	126.2892	12.539	10.072	0.000	101.713	150.865
TUE	0.5832	0.541	1.077	0.281	-0.478	1.645
CH2O	-0.8764	0.706	-1.242	0.214	-2.260	0.507
FAF	-0.1920	0.403	-0.476	0.634	-0.983	0.599
Age	0.0719	0.082	0.874	0.382	-0.089	0.233
Height	-160.2982	14.026	-11.429	0.000	-187.788	-132.808
Weight	2.3663	0.208	11.397	0.000	1.959	2.773
NObeyesdad=6	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	207.3760	15.374	13.489	0.000	177.244	237.508
TUE	1.6561	0.646	2.564	0.010	0.390	2.922
CH2O	-0.6583	0.773	-0.851	0.395	-2.174	0.857
FAF	-0.1243	0.485	-0.256	0.798	-1.076	0.827
Age	0.1042	0.087	1.197	0.231	-0.066	0.275
Height	-266.6050	17.598	-15.150	0.000	-301.097	-232.113
Weight	3.6160	0.241	15.026	0.000	3.144	4.088

Impact on Model Accuracy

After applying feature selection based on CrossTab Sparsity, we’ll compare the accuracy of our new models against our baseline models. This comparison will reveal how effectively CrossTab Sparsity enhances classification performance.

Results and Discussion: Unveiling Insights

Model Comparison Table

After implementing CrossTab Sparsity in our feature selection process, let’s take a look at the results:

Comparision Code

metrics = {
    "Metric": ["Accuracy", "Precision", "Recall", "F1-Score"],
    "Baseline Model with all Parameters": [
        accuracy_score(y_test, base_preds),
        precision_score(y_test, base_preds, average='weighted'),
        recall_score(y_test, base_preds, average='weighted'),
        f1_score(y_test, base_preds, average='weighted'),
    ],
    "Challenger Model with only 5 Variables": [
        accuracy_score(y_test, challenger_preds),
        precision_score(y_test, challenger_preds, average='weighted'),
        recall_score(y_test, challenger_preds, average='weighted'),
        f1_score(y_test, challenger_preds, average='weighted'),
    ]
}
display(pd.DataFrame(metrics).round(4).set_index('Metric').T)

Metric	Accuracy	Precision	Recall	F1-Score
Baseline Model with all Parameters	0.9100	0.9123	0.9100	0.9103
Challenger Model with only 5 Variables	0.9384	0.9384	0.9384	0.9381

Insights Gained

Through this analysis, several key insights emerge:

Reduction of similar accuracy from 16 to 5 i.e 68.75% reduction

Feature Interactions Matter: The selected features based on CrossTab Sparsity significantly improved model accuracy—like finding out which ingredients make your favorite dish even better!
Simplicity is Key: By focusing on relevant features, we enhance accuracy while simplifying model interpretation—because nobody likes unnecessary complexity.
Real-World Applications: These findings have practical implications in fields such as environmental science where classification plays a critical role—helping us make better decisions for our planet.

Conclusion: The Road Ahead

In conclusion, this blog has illustrated how CrossTab Sparsity can be a game-changer in classification tasks using the Obesity dataset. By leveraging this metric for feature selection, we achieved notable improvements in model performance—proof that sometimes less really is more!

Future Work: Expanding Horizons

As we look ahead, there are exciting avenues to explore:

Investigating regression problems using CrossTab Sparsity.
Comparing its effectiveness with other feature selection methods such as Recursive Feature Elimination (RFE) or comparision with other feature selection mehtods.

By continuing this journey into data science, we not only enhance our technical skills but also contribute valuable insights that can drive meaningful change in various industries.

Reuse

CC BY 4.0