All the practitioners of data science always hit one giant thing to do with data and you know it well its EDA -Exploratory Data Analysis. This word EDA1 was coined by Tukey himself in his seminal book published in 1983. But do you think that before that EDA doesn’t existed ?
1 Emerson, J. D., & Hoaglin, D. C. (1983). Stem-and-leaf displays. In D. C. Hoaglin, F. Mosteller, & J. W. Tukey (Eds.) Understanding Robust and Exploratory Data Analysis, pp. 7–32. New York: Wiley. Book is here.
Well glad you thought. Before that all were doing what is called as Hypothesis Testing. Yes, before this the race was majorly to fit the data and make most unbiased and robust estimate. But remember one thing when you talk about Hypothesis Testing it was always and majorly would be related to RCTs (Randomized Controlled Trials) a.k.a Randomized Clinical Trials and is Gold Standard of data.
More on RCTs and ODs
Now let me now not hijack the discussion to what is RCTs and Observational Data (ODs) as it is more of Philosophical Reasoning rather than other quality of data, but essentially what we are trying to find is that can we by, using stats, identify interesting patterns in data.
The only thing happens wit RCT data is that we tend to believe these interesting patterns coincide with some sort of ‘Cause-Effect’ kind of relationship. But essentially due to bia nature of ODs, we certainly cant conclude this. And hence, can only find interesting patterns.
Lets move on. The big question is, for whatever reason you are doing HT , you are doing it for finding something intreating. And that something interesting is usually found by using Post-Hoc Tests. Now there are variety of Post-Hocs available but what is more know and hence easily found to be implemented in Tukey’s HSD.
So lets directly jump to how to follow this procedure. We’ll be using bioinfokit for this, as it is much simpler wrapper around whats implemented in statsmodels.
What are the results
Pheww… Thats too much code right. But that would save a lot of your time in real life. So in real life you would write code as 3 steps below:
Code
# import librariesimport pandas as pd# Getting car data from UCIdf = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data', sep='\s+',header=None, names=['mpg','cylinders','displacement','horsepower','weight','acceleration','model_year','origin','car_name'])df.head()# Syntax to do anove with validating the assumption, doing test and a post-hocresults = do_anova_test(df=df, res_var='mpg',xfac_var='cylinders', anova_model='mpg ~ C(cylinders)+C(origin)+C(cylinders):C(origin)', ss_typ=3, result_full=True)
# Numbers are clumsy for most. Making more interpretable plot on above results.plot_hsd(results.tukeyhsd.sort_values('Diff'), title="Tukey HSD resutls Anova of MPG ~ Cylinder")
Results form the plot_hsd
Tukey’s HSD comparison based on Anova Results
Plots look good with ‘p-values’.
Conclusion
Now since we applied the above to a Non RCT we cannot conclude that Difference in mpg based on cylinder is huge specially as number of cylinders goes up. But this statement might not be as explicit as might be appearing from plot. Unless you have a strong believe that the data follows with rules and assumptions of RCTs, we should be only seeking interesting as in associated results and not cause-effect results.
from bioinfokit import analysimport numpy as npfrom scipy import statsclass KeyResults:""" A basic class to hold all the results """def__init__(self,result_full):self.keys = []self.result_full = result_fulldef add_result(self,name,result):if name =='tukeyhsd':self.keys.append(name)setattr(self, name, result)elifself.result_full:self.keys.append(name)setattr(self, name, result)# Anova test codedef do_anova_test(df, res_var, xfac_var, anova_model,ss_typ=3, effectsize='n2',result_full=False,add_res=False):""" Do all sequential anova tests Step 1) Leven's/ bartellet test for checking weather variance is homogenous or not Step 2) Main ANOVA/ANCOVA test Step 3) Tukey's HSD for individual combinations :param df: Pandas DataFrame holding all the columns :param res_var: Variable for which we are checking ANOVA :param xfac_var: Grouping Variables for which we want to do the comparisons :param anova_model: SM formula for the model. This is life savour to make all things work :param result_full: To provide all the results of intermediate steps """ results = KeyResults(result_full)# initialize stat method res = analys.stat()# doing levens test res.levene(df=df, res_var=res_var,xfac_var=xfac_var)print('\nLeven\'s Test Result:')print(res.levene_summary) results.add_result('levene',res.levene_summary)# doing bartlett test res.bartlett(df=df, res_var=res_var,xfac_var=xfac_var)print('\nBartlett\'s Test Result:')print(res.bartlett_summary) results.add_result('bartlett',res.bartlett_summary)# doing anova / ancova res.anova_stat(df=df, res_var=res_var, anova_model=anova_model,ss_typ=ss_typ) aov_res = res.anova_summary# Add effect sizesif effectsize =="n2": all_effsize = (aov_res['sum_sq'] / aov_res['sum_sq'].sum()).to_numpy() all_effsize[-1] = np.nanelse: ss_resid = aov_res['sum_sq'].iloc[-1] all_effsize = aov_res['sum_sq'].apply(lambda x: x / (x + ss_resid)).to_numpy() all_effsize[-1] = np.nan aov_res[effectsize] = all_effsize#aov_res['bw_'] = res.anova_model_out.params.iloc[-1] aov_res = aov_res.round(4)# printing resultsprint('\nANOVA\ANCOVA Test Result:')print(aov_res) results.add_result('anova',res.anova_summary.round(4)) results.add_result('anova_model',res.anova_model_out)# doing tukey's hsd top compare the groups res.tukey_hsd(df=df, res_var=res_var,xfac_var=xfac_var, anova_model=anova_model,ss_typ=ss_typ)print('\nTukey HSD Result:')print(res.tukey_summary.round(4)) results.add_result('tukeyhsd',res.tukey_summary.round(4))# add all result componets again if needed if add_res: results.add_result('allresult',res)return results
Plotting results plot_hsd.py
import seaborn as snsimport matplotlib.pyplot as pltplt.style.use('seaborn-bright')def plot_hsd(hsdres,p_cutoff=0.05,title=None,ax=None,figsize=(10,7)):""" Do plotting of tukeyhsd results :param hsdres: 'tukeyhsd' result form the do_anova_test function :param p_cutoff: Cutoff at which we get say a combination is significant :param title: Title of the plot :param ax: Define or get the matplotlib axes :param figsize: Mention Figure size to draw """if ax isNone: fig,axp = plt.subplots(figsize=figsize)else: axp = ax# helper func p_ind =lambda x : ''if x >0.1else ('+'if x >0.05else ('*'if x >0.01else ('**'if x >0.001else'***'))) label_gen =lambda x: f"${x[0]} - {x[1]}\ |\ p:{x[2]:0.2f}{p_ind(x[2]):5s}$"#setting values mask = hsdres['p-value'] <= p_cutoff yticklabs = hsdres[['group1','group2','p-value']].apply(label_gen,axis=1).values ys = np.arange(len(hsdres))# adding plot to axes axp.errorbar(hsdres[~mask]['Diff'],ys[~mask],xerr=np.abs(hsdres[~mask][['Lower',"Upper"]]).values.T, fmt='o', color='black', ecolor='lightgray', elinewidth=2, capsize=0) axp.errorbar(hsdres[mask]['Diff'],ys[mask],xerr=np.abs(hsdres[mask][['Lower',"Upper"]]).values.T, fmt='o', color='red', ecolor='pink', elinewidth=2, capsize=5) axp.axvline(x=0,linestyle='--',c='skyblue') axp.set_yticks([]) (l,u) = axp.get_xlim() axp.set_xlim(l+1.5*l,u) (l,u) = axp.get_xlim()for idx,labs inenumerate(yticklabs): axp.text(l-0.1*l,ys[idx],labs) axp.set_yticklabels([])# finally doing what is neededif ax isNone: plt.title(''if title isNoneelse title,fontsize=14) plt.show()else:return axp
Hope this give you kickstart to find you intresting patterns. Happy Learning!