<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Jitin Kapila</title>
<link>https://jitinkapila.com/writing/</link>
<atom:link href="https://jitinkapila.com/writing/index.xml" rel="self" type="application/rss+xml"/>
<description>AI strategy consulting for COOs, CFOs, and operations leaders. Find the Logic Leak in your AI investment. $90M+ in portfolios. Independent.</description>
<generator>quarto-1.9.30</generator>
<lastBuildDate>Sat, 04 Apr 2026 18:30:00 GMT</lastBuildDate>
<item>
  <title>The AI Confusion Tax: Why Companies Burn Millions on Wrong AI</title>
  <dc:creator>Jitin Kapila</dc:creator>
  <link>https://jitinkapila.com/writing/ai-confusion-tax/</link>
  <description><![CDATA[ 





<p><a href="cover-image.png" class="lightbox" data-gallery="quarto-lightbox-gallery-1"><img src="https://jitinkapila.com/writing/ai-confusion-tax/cover-image.png" class="profile-image img-fluid" style="width: 75%; margin:auto;"></a></p>
<p>Its Q2 2026 and companies still think AI means one thing: GenAI. Large language models. Chatbots. Content generation.</p>
<p>That’s not AI. That’s one tool in a 50-year-old toolkit.</p>
<p>Classification problems have classifiers. Forecasting problems have regression. Anomaly detection has isolation forests. Survival problems have survival models. Each one is cheaper, faster, and more accurate than GenAI for its specific job.</p>
<p>And yet, companies are paying 12x more for GenAI solutions to problems that $5,000 of classical ML handles better. Not to mention the ROI is non existent in earlier one.</p>
<p>That’s the <strong>AI Confusion Tax</strong>. You’re paying it whether you see the invoice or not.</p>
<p><br></p>
<section id="the-case" class="level2">
<h2 class="anchored" data-anchor-id="the-case">The Case</h2>
<div class="grid">
<div class="g-col-12 g-col-md-7">
<p>This happened. a vendor quoted $62,000 a year for a supplier categorization problem. Implementation: $50K. Recurring API costs: $1K per month.</p>
<p>Now doing same job with topic modeling would cost: $15–20K to build. $10 a month to run.</p>
<p>They almost signed. They had no idea another way existed. Let that sink in!</p>
</div>
<div class="g-col-12 g-col-md-5">
<p><a href="idea.png" class="lightbox" data-gallery="quarto-lightbox-gallery-2"><img src="https://jitinkapila.com/writing/ai-confusion-tax/idea.png" class="img-fluid"></a></p>
</div>
</div>
<p>Well this isn’t rare. RAND Corporation research puts the AI project failure rate at over 80%, twice the rate of non-AI technology projects. MIT’s 2025 report found 95% of generative AI pilots at enterprises are failing. And S&amp;P Global’s 2025 survey found 42% of companies abandoned most of their AI initiatives this year — up from 17% in 2024.</p>
<p>The numbers keep climbing. Not because AI doesn’t work. Because the wrong AI gets applied to the wrong problem, at the wrong cost, with nobody asking the right questions upfront.</p>
<p><br></p>
</section>
<section id="what-is-the-ai-confusion-tax" class="level2">
<h2 class="anchored" data-anchor-id="what-is-the-ai-confusion-tax">What is the AI Confusion Tax?</h2>
<p>The AI Confusion Tax is the hidden cost companies pay when they select an AI solution before defining the problem it needs to solve. It shows up as overspend on the wrong tools, mismatched model types, and months of data preparation that never connects to a business decision.</p>
<p>It’s structural, not malicious. Vendors pitch what they sell. Companies don’t know the alternatives exist. The gap between those two facts is where the tax lives.</p>
<p>Three patterns account for the bulk of it.</p>
<section id="wrong-ai-type-for-the-problem" class="level3">
<h3 class="anchored" data-anchor-id="wrong-ai-type-for-the-problem">1. Wrong AI type for the problem</h3>
<p>A glass manufacturer wanted to organize their vendor base — which vendors sell which SKUs, how to consolidate for better negotiating power. A vendor came in with a solution built on LLM APIs. Expensive, slow, hard to explain to a supply chain manager.</p>
<p>The problem was text categorization. Topic modeling — LDA, basic NLP, even keyword extraction — would have solved it at a fraction of the cost. These are solved problems. They’re auditable. They’re cheap.</p>
<p>Recent research backs this up: fine-tuned small language models outperform GPT-4 on 85% of classification tasks while costing 10-100x less per inference. The hidden operational costs of LLMs — token monitoring, latency tracking, security audits — add 20-40% on top of the sticker price.</p>
<p>The vendor sold LLM because they sell LLM. The company bought it because they didn’t know topic modeling existed.</p>
<p>Same result. 12x cost difference. That’s not a technology gap. That’s an information gap — generative AI is one tool in the toolbox, not the toolbox itself.</p>
</section>
<section id="wrong-problem-selected" class="level3">
<h3 class="anchored" data-anchor-id="wrong-problem-selected">2. Wrong problem selected</h3>
<p>This one costs the most over time.</p>
<p>A food company wanted to predict when their inventory would expire. They hired a team to build a classification model — will this item decay or not?</p>
<p>The model never produced useful outputs. Low confidence, poor predictions, no usable time window.</p>
<p>Nobody asked the right question.</p>
<p>This isn’t a classification problem. It’s a survival analysis problem. You don’t need to know <em>whether</em> the item will decay. You need to know <em>when</em> it will decay — and how much time you have to act before it does.</p>
<p>The difference matters. Classification gives you a yes/no on the day you check. Survival analysis gives you a time window. In inventory management, that window is everything — it’s the difference between selling at full margin and writing off the batch.</p>
<p>The same mistake appears in churn prediction. Teams build classification models — will this customer churn? But customers don’t just switch overnight. They gradually reduce engagement, then spending, then leave. A classification model catches none of that progression. A survival model does.</p>
<p>The team blames the data. The data is fine. The model type was wrong from the start. The root cause is almost never data quality. It’s problem framing.</p>
</section>
<section id="data-before-decision" class="level3">
<h3 class="anchored" data-anchor-id="data-before-decision">3. Data before decision</h3>
<p>This one is expensive in a different way. It costs you in delay.</p>
<p>Every manufacturing and FMCG company I’ve worked with has started an AI project by running a data audit. Six months of cleaning, structuring, waiting for the right data.</p>
<p>$100K per quarter in lost optimization value.</p>
<p>The question nobody asks early enough: what decision are we trying to improve?</p>
<p>If you don’t know that, you don’t know which data matters. You end up preparing data that has nothing to do with your actual problem. The order matters. Decision first. Data second. Model third.</p>
<p>Vendors don’t push back. They scope the data work. The delay is yours.</p>
<p><br></p>
</section>
</section>
<section id="the-pattern-underneath" class="level2">
<h2 class="anchored" data-anchor-id="the-pattern-underneath">The pattern underneath</h2>
<p>Vendors pitch what they sell. Companies don’t know the alternatives. Nobody defines the problem before selecting the solution.</p>
<p>Every AI decision — vendor selection, project scoping, budget approval — has this tax embedded in it. Sometimes it’s $57K a year on the wrong tool. Sometimes it’s $100K a quarter in data prep delay. Sometimes it’s a model that works technically and delivers nothing commercially.</p>
<p>The sticker price on an AI vendor quote is only 40-55% of your actual total cost of ownership. The rest is hidden in integration, governance, and waste from solving the wrong problem.</p>
<p>Understanding which AI type fits your problem — whether it’s rules-based automation, analytics, predictive ML, or generative AI — is the single most important question you can answer before signing anything.</p>
</section>
<section id="before-you-sign-the-next-contract" class="level2">
<h2 class="anchored" data-anchor-id="before-you-sign-the-next-contract">Before you sign the next contract</h2>
<p>One question before any AI project gets approved: does the person scoping this understand what they’re scoping?</p>
<p>Not — is the model accurate? Not — is the vendor credible?</p>
<p>Can they tell you which AI type this problem needs? Classification, regression, clustering, survival analysis? And can they explain why that type is the right one for this specific business decision?</p>
<p>If they can’t answer that, they’re guessing. And you’re paying for the guess.</p>
<p>The confusion is solvable. You just need to know what questions to ask before anyone writes a contract.</p>
</section>
<section id="frequently-asked-questions" class="level2">
<h2 class="anchored" data-anchor-id="frequently-asked-questions">Frequently Asked Questions</h2>
<p><strong>What is the AI Confusion Tax?</strong></p>
<p>The AI Confusion Tax is the hidden cost companies pay when they select an AI solution — usually GenAI — before defining the problem it needs to . It shows up as 12x overspend on the wrong tools, mismatched model types, and months of data preparation that never connects to a business decision.</p>
<p><strong>Why do companies pay the AI Confusion Tax?</strong></p>
<p>Vendors pitch what they sell. Companies don’t know the alternatives exist. Every business problem has a specific ML model that’s cheaper, faster, and more accurate than GenAI for that job — but most leaders don’t know these alternatives exist.</p>
<p><strong>How much does the wrong AI cost?</strong></p>
<p>Companies routinely pay 12x more for GenAI solutions to problems that $5,000–$20,000 of classical ML handles better. A vendor quoted $62,000/year for a supplier categorization problem. The same job done with topic modeling: $15–20K to build, $10/month to run.</p>
<p><strong>How do you avoid the AI Confusion Tax?</strong></p>
<p>Before approving any AI project, ask: can the person scoping this tell you which AI type this problem needs — classification, regression, clustering, survival analysis — and why that type is the right one for this specific business decision? If they can’t answer, they’re guessing. And you’re paying for the guess.</p>
<hr>
<p><em>If your AI projects are running into walls you can’t explain, the AI Strategy Audit finds the confusion tax in your current initiatives — and tells you exactly what to fix. Or start with the AI Profit Quotient to find out where your company actually sits on the AI spectrum.</em></p>
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "BlogPosting",
  "headline": "The AI Confusion Tax: Why Companies Burn Millions on Wrong AI Solutions",
  "description": "80% of AI projects fail — not from bad data, but wrong problem framing. Learn the 3 patterns that cost enterprises $100K+ per quarter and how to stop paying the AI Confusion Tax.",
  "author": {
    "@type": "Person",
    "name": "Jitin Kapila",
    "url": "https://jitinkapila.com/about"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Jitin Kapila",
    "url": "https://jitinkapila.com"
  },
  "datePublished": "2026-04-05",
  "dateModified": "2026-04-05",
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://jitinkapila.com/blog/strategy/ai-confusion-tax/"
  },
  "keywords": ["AI confusion tax", "AI project failure", "wrong AI solution", "classical ML vs GenAI", "AI strategy"],
  "wordCount": 1100,
  "url": "https://jitinkapila.com/blog/strategy/ai-confusion-tax/"
}
</script>

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "What is the AI Confusion Tax?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "The AI Confusion Tax is the hidden cost companies pay when they select an AI solution — usually GenAI — before defining the problem it needs to solve. It shows up as 12x overspend on the wrong tools, mismatched model types, and months of data preparation that never connects to a business decision."
      }
    },
    {
      "@type": "Question",
      "name": "Why do companies pay the AI Confusion Tax?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Vendors pitch what they sell. Companies don't know the alternatives exist. Every business problem has a specific ML model that's cheaper, faster, and more accurate than GenAI for that job — but most leaders don't know these alternatives exist."
      }
    },
    {
      "@type": "Question",
      "name": "How much does the wrong AI cost?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Companies routinely pay 12x more for GenAI solutions to problems that $5,000-$20,000 of classical ML handles better. A vendor quoted $62,000/year for a supplier categorization problem. The same job done with topic modeling: $15-20K to build, $10/month to run."
      }
    },
    {
      "@type": "Question",
      "name": "What are the three patterns of the AI Confusion Tax?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "1. Wrong AI type for the problem (using GenAI where classical ML fits). 2. Wrong problem selected (using classification where survival analysis fits). 3. Data before decision (spending months on data prep without knowing what decision the AI is supposed to improve)."
      }
    },
    {
      "@type": "Question",
      "name": "How to avoid the AI Confusion Tax?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Before approving any AI project, ask: can the person scoping this tell me which AI type this problem needs — classification, regression, clustering, survival analysis — and why that type is the right one for this specific business decision? If they can't answer, they're guessing. And you're paying for the guess."
      }
    }
  ]
}
</script>



</section>

<a onclick="window.scrollTo(0, 0); return false;" id="quarto-back-to-top"><i class="bi bi-arrow-up"></i> Back to top</a> ]]></description>
  <category>strategy</category>
  <guid>https://jitinkapila.com/writing/ai-confusion-tax/</guid>
  <pubDate>Sat, 04 Apr 2026 18:30:00 GMT</pubDate>
  <media:content url="https://jitinkapila.com/writing/ai-confusion-tax/cover-image.png" medium="image" type="image/png" height="81" width="144"/>
</item>
<item>
  <title>AI Strategic Skills: Where Should a CEO Draw the Line?</title>
  <dc:creator>Jitin Kapila</dc:creator>
  <link>https://jitinkapila.com/writing/ai-strategic-skills/</link>
  <description><![CDATA[ 





<script type="application/ld+json">
[
  {
    "@context": "https://schema.org",
    "@type": "BlogPosting",
    "headline": "AI Strategic Skills: Where Should a CEO Draw the Line?",
    "description": "The 5 essential AI strategic skills that leaders and teams need to thrive in an AI-driven world — and where the line between strategy and execution really sits.",
    "image": "https://jitinkapila.com/assets/img/pexels-car-dashboard.jpg",
    "author": {
      "@type": "Person",
      "name": "Jitin Kapila",
      "url": "https://jitinkapila.com/about",
      "jobTitle": "AI Strategy Consultant",
      "worksFor": {
        "@type": "Organization",
        "name": "Kriyalytics",
        "url": "https://kriyalytics.com"
      }
    },
    "publisher": {
      "@type": "Organization",
      "name": "Jitin Kapila",
      "logo": {
        "@type": "ImageObject",
        "url": "https://jitinkapila.com/assets/logo.png"
      }
    },
    "datePublished": "2026-01-29",
    "dateModified": "2026-01-29",
    "mainEntityOfPage": {
      "@type": "WebPage",
      "@id": "https://jitinkapila.com/blog/ai-strategic-skills"
    },
    "articleSection": "AI Strategy",
    "keywords": ["AI strategic skills", "executive AI", "AI governance", "problem-solution mapping", "architectural literacy", "data viability", "time-to-value", "prompt strategy", "CEO AI skills"],
    "wordCount": 1200,
    "about": [
      {
        "@type": "Thing",
        "name": "AI Strategic Skills",
        "description": "The 5 essential competencies that business leaders need to evaluate, manage, and extract value from AI investments"
      }
    ]
  },
  {
    "@context": "https://schema.org",
    "@type": "FAQPage",
    "mainEntity": [
      {
        "@type": "Question",
        "name": "What are the 5 AI strategic skills for executives?",
        "acceptedAnswer": {
          "@type": "Answer",
          "text": "The 5 AI strategic skills are: (1) Problem-Solution Mapping — knowing which AI tool solves which business problem and avoiding the mistake of forcing one tool type onto every problem; (2) Architectural Literacy — understanding enough about AI system architecture to know when a vendor is building a dependency versus solving a problem; (3) Data Viability — knowing what data needs to be captured today to enable prediction six months from now; (4) Time-to-Value — defining payback milestones before starting an AI project, not after; and (5) Prompt Strategy — understanding how to structure AI interactions for business outcomes, not just generating outputs."
        }
      },
      {
        "@type": "Question",
        "name": "What's the difference between strategic AI skills and technical AI skills?",
        "acceptedAnswer": {
          "@type": "Answer",
          "text": "Technical AI skills (coding, model training, data engineering) belong to the technical team. Strategic AI skills belong to business leaders. The key distinction: you own the Pain, they own the Steps. Executives define the strategy, articulate the pain (Point A) and destination (Point B), and link model accuracy to dollars saved. Technical teams choose models, clean data, and build integration. The gap between these two worlds — where nobody translates technical output to business value — is where most AI projects fail."
        }
      },
      {
        "@type": "Question",
        "name": "Why do most AI projects fail due to translation error?",
        "acceptedAnswer": {
          "@type": "Answer",
          "text": "The CEO says 'I want more revenue.' The data team hears 'I need a model with high accuracy score.' These are not the same thing. Revenue can come from lowering churn, but only if you focus on high-CLTV customers. A technically accurate model on low-value customers is meaningless. The translation error — executives not defining the business lever and technical teams not asking — produces a technically successful model that drives zero actual revenue."
        }
      },
      {
        "@type": "Question",
        "name": "How does the dashboard analogy explain AI literacy for executives?",
        "acceptedAnswer": {
          "@type": "Answer",
          "text": "Think of AI as car instrumentation: the Rearview Mirror (Exploratory AI) answers 'what just happened?' — traditional dashboards. The Windshield (Predictive AI) answers 'what's coming at us?' — forecasting and churn prediction. The GPS Map (Prescriptive AI) answers 'what's the best route?' — optimization and recommendations. Leaders need to know how to read all three, not build any of them. Each requires different AI tools: exploratory uses dashboards, predictive uses machine learning, prescriptive uses optimization."
        }
      },
      {
        "@type": "Question",
        "name": "What does 'you cannot delegate delivery of value' mean in AI?",
        "acceptedAnswer": {
          "@type": "Answer",
          "text": "AI isn't like buying Microsoft Office where you install it and walk away. Leaders cannot delegate the small involvement (defining the North Star metric) or the large involvement (revamping business processes to actually use the new intelligence). You also cannot delegate the Data Viability Instinct — asking 'what data do we need to capture today to predict the future in six months?' If you delegate the business process change or data strategy, you don't own the value. You just own the bill."
        }
      }
    ]
  },
  {
    "@context": "https://schema.org",
    "@type": "BreadcrumbList",
    "itemListElement": [
      {
        "@type": "ListItem",
        "position": 1,
        "name": "Home",
        "item": "https://jitinkapila.com"
      },
      {
        "@type": "ListItem",
        "position": 2,
        "name": "Blog",
        "item": "https://jitinkapila.com/blog"
      },
      {
        "@type": "ListItem",
        "position": 3,
        "name": "AI Strategy",
        "item": "https://jitinkapila.com/blog/ai-strategy"
      },
      {
        "@type": "ListItem",
        "position": 4,
        "name": "AI Strategic Skills",
        "item": "https://jitinkapila.com/blog/strategy/ai-strategic-skills"
      }
    ]
  }
]
</script>

<p><br></p>
<div class="padded text-center justify-center page-columns page-full" style="margin:auto;">
<div class="quarto-figure quarto-figure-center page-columns page-full">
<figure class="figure page-columns page-full">
<p class="page-columns page-full"><a href="pexels-car-dashboard.jpg" class="lightbox page-columns page-full" data-gallery="quarto-lightbox-gallery-1" title="Driving your way with AI"><img src="https://jitinkapila.com/writing/ai-strategic-skills/pexels-car-dashboard.jpg" class="profile-image img-fluid figure-img column-body" style="width: 75%; margin:auto;" alt="Driving your way with AI"></a></p>
<figcaption>Driving your way with AI</figcaption>
</figure>
</div>
</div>
<p><br></p>
<p>We’ve talked about the <a href="https://aicrosscurrent.substack.com/p/top-5-ai-skills-for-executives">skills you need to survive</a> and the <a href="../../writing/why-ai-projects-fail/index.html">reasons most AI projects fail</a>. Now, let’s talk about the destination.</p>
<p>What does a healthy, AI-enabled enterprise actually look like?</p>
<p>It doesn’t look like a room full of glowing servers or a workforce in a state of “FOMO” panic. It looks like a driver sitting in a well-engineered car, hands on the wheel, confident in exactly where they are going.</p>
<p>In my <a href="../../ai-profit-os.html"><strong>“AI Profit OS”</strong></a> framework, the ultimate goal isn’t just “automation”—it’s <strong>Clarity</strong>. But to get there, you need to know exactly where your job ends and the technical team’s job begins. If you blur this line, you’re not just micromanaging; you’re likely to crash the car.</p>
<p>Here is the “Resolution”: The final rulebook for governing AI without getting lost in the code.</p>
<p><br></p>
<section id="the-line-in-the-sand-strategy-vs.-steps" class="level2">
<h2 class="anchored" data-anchor-id="the-line-in-the-sand-strategy-vs.-steps">1. The Line in the Sand: Strategy vs.&nbsp;Steps</h2>
<p>I get asked all the time: <em>“Jitin, how deep do I really need to go into the tech?”</em></p>
<p>The answer is simpler than you think: <strong>You own the Pain; they own the Steps.</strong></p>
<ul>
<li><strong>The CEO &amp; Senior Leaders:</strong> Your job is to define the <strong>Strategy and Value</strong>. You have to articulate the “Pain” (Point A) and the “Destination” (Point B). You’re the one who decides <em>what</em> success is actually worth to the P&amp;L.</li>
<li><strong>The Technical Team:</strong> They own the <strong>Execution</strong>. They choose the models, clean the data, manage the integration, and build the feedback loops.</li>
</ul>
<p><strong>The Crux:</strong> The magic (or the disaster) happens in the middle. This is where your <strong>“BS Detector”</strong> comes in. You don’t need to know how the algorithm works, but you <em>must</em> be able to translate their technical jargon (like “F1 scores” or “Accuracy”) into business outcomes. If you can’t link “model accuracy” to “dollars saved,” you don’t have a strategy—you have a science fair project.</p>
<p><br></p>
</section>
<section id="the-dashboard-analogy-driving-the-business" class="level2">
<h2 class="anchored" data-anchor-id="the-dashboard-analogy-driving-the-business">2. The Dashboard Analogy: Driving the Business</h2>
<p>Stop thinking of AI as a “magic box.” Think of it as the instrumentation of your car. To get from A to B effectively, you need three distinct views through the glass:</p>
<ol type="1">
<li><strong>The Rearview Mirror (Exploratory AI):</strong> “What just happened?” (Traditional data analysis and dashboards).</li>
<li><strong>The Windshield (Predictive AI):</strong> “What’s coming at us?” (Forecasting, Churn Prediction).</li>
<li><strong>The GPS Map (Prescriptive AI):</strong> “What’s the best route to take?” (Optimization, Recommendation Engines).</li>
</ol>
<p><strong>Your Job:</strong> You are the driver. You don’t need to be a mechanic to know how the fuel injection works. But you <em>must</em> know how to read the Windshield and the Map. Part of this is <strong>Problem-Solution Mapping</strong>. If you try to use a “GenAI” chatbot to solve a “Math” problem (like inventory optimization), you’re going to drive off a cliff. Knowing which tool to pull from the <a href="../../writing/ai-umbrella/index.html">AI Umbrella</a> is the definition of strategic competence.</p>
<p><br></p>
</section>
<section id="the-one-thing-you-cannot-delegate" class="level2">
<h2 class="anchored" data-anchor-id="the-one-thing-you-cannot-delegate">3. The One Thing You Cannot Delegate</h2>
<p>I’ve seen leaders try to delegate everything—including the outcome. This is fatal. <strong>You cannot delegate the “Delivery of Value.”</strong></p>
<p>AI isn’t like buying a copy of Microsoft Office where you “install it” and walk away. * <strong>Small Involvement:</strong> Defining the “North Star” Metric. * <strong>Large Involvement:</strong> Revamping a business process to actually use the new intelligence.</p>
<p>You also can’t delegate the <strong>Data Viability Instinct</strong>. You need to ask: <em>“What data do we need to start capturing TODAY so we can predict the future in six months?”</em> If you delegate the business process change or the data strategy, you don’t own the value—you just own the bill.</p>
<p><br></p>
</section>
<section id="the-promised-land-from-grunt-work-to-growth" class="level2">
<h2 class="anchored" data-anchor-id="the-promised-land-from-grunt-work-to-growth">4. The “Promised Land”: From Grunt Work to Growth</h2>
<p>What happens when you get this right? When you move from “Tier 1” (playing with ChatGPT) to “Tier 3” (The Strategist)?</p>
<p><strong>The Day-to-Day Shift:</strong> * <strong>Before (Chaos):</strong> You’re playing defense. Worrying about overstocking, customer tickets, and hiring frantically just to keep up. * <strong>After (Leverage):</strong> You’re playing offense. Your day is spent deciding <em>where to invest next</em>. Which market to enter? What feature to launch?</p>
<p><strong>The Human Impact:</strong> I once worked with a Fortune 500 client who feared AI would demoralize their team. The reality? The team stopped doing the “robot work” (the grunt work on high-selling products). Instead, they focused on the creative, high-leverage stuff—like upselling new products. They didn’t shrink; they just became more human.</p>
<p><br></p>
</section>
<section id="the-monday-morning-report" class="level2">
<h2 class="anchored" data-anchor-id="the-monday-morning-report">5. The Monday Morning Report</h2>
<p>How do you sleep at night? You don’t need to check code commits. You need to check the <strong>Value Report</strong>.</p>
<p>Every Monday, you should be looking for one thing: <em>“Are the steps taken by the technical team moving us closer to the ROI we defined?”</em></p>
<p>This requires <strong>Architectural Literacy</strong>. You need to look past the “beautiful plating” of a fancy chatbot wrapper and ask if there’s an actual logic engine beneath it. If the answer is “Yes,” keep driving. If “No,” pull over and check the engine.</p>
<p><br></p>
</section>
<section id="lastly-your-license-to-operate" class="level2">
<h2 class="anchored" data-anchor-id="lastly-your-license-to-operate">Lastly Your License to Operate</h2>
<p>You don’t need to be a mechanic to drive a Ferrari, but you do need to know if the brakes work and exactly where you’re headed. The “AI Profit OS” isn’t about replacing you—it’s about giving you the dashboard you’ve always wanted.</p>
<p><strong>Are you ready to take the wheel?</strong></p>
<p><a href="../../ai-profit-os.html"><strong>Check out the AI Profit OS Course</strong></a> | <a href="../../work-with-me.html"><strong>Book a Strategic Call</strong></a></p>
<!-- ::: {.column-screen .bg-navy}

<br>




::: {.column-page}

### Subscribe to Weekly AI Decision Brief 

::: {.grid .column-page-inset .padded}

::: {.g-col-12 .g-col-md-6 style="margin: auto;"}
One sharp insight on making AI work for your business — every week. Frameworks from actual deployments. Case studies with real numbers. The questions your AI vendor hopes you never ask.

No hype. No vendor pitch. Written for operations leaders, not technology teams.
:::

::: {.g-col-0 .g-col-md-1}
:::

::: {.g-col-12 .g-col-md-5}


```{=html}
<form
  method="post"                                                                                                                                         action="https://systeme.io/embedded/39549717/subscription"
> 

    <label for="bd-first-name">First Name</label>
    <input
    class="form-control"
    placeholder="Your first name"
    type="text" name="first_name" id="bd-first-name" />

    <label for="bd-email">Email Address</label>                                                                                                           <input
    class="form-control"
    placeholder="your@email.here" 
    type="email" name="email" id="bd-email" />

    <input 
    class="btn btn-primary"
    style="margin-top: 1em;"
    type="submit" value="Subscribe" />
</form>
```
:::

:::

<br>

<br>

:::



<!-- <form
  method="post" 
  action="https://systeme.io/embedded/36566252/subscription"
>
  <label for="bd-email">Email Address</label>
  <input 
  class="form-control"
  placeholder="your@email.here" 
  type="email" name="email" id="bd-email" />
  
  <input 
  class="btn btn-primary"
  style="margin-top: 1em;"
  type="submit" value="Subscribe" />
</form>
 -->
<div class="-->">


</div>
</section>

<a onclick="window.scrollTo(0, 0); return false;" id="quarto-back-to-top"><i class="bi bi-arrow-up"></i> Back to top</a> ]]></description>
  <category>strategy</category>
  <category>career</category>
  <guid>https://jitinkapila.com/writing/ai-strategic-skills/</guid>
  <pubDate>Wed, 28 Jan 2026 18:30:00 GMT</pubDate>
  <media:content url="https://jitinkapila.com/writing/ai-strategic-skills/pexels-introspectivedsgn-23319107.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Why your GenAI strategy is just a pair of pliers in 2026</title>
  <dc:creator>Jitin Kapila</dc:creator>
  <link>https://jitinkapila.com/writing/genai-is-plier/</link>
  <description><![CDATA[ 






<script type="application/ld+json">
[
  {
    "@context": "https://schema.org",
    "@type": "BlogPosting",
    "headline": "Why Your GenAI Strategy Is Just a Pair of Pliers in 2026",
    "description": "GenAI is one tool in a large AI toolbox — not a complete strategy. Using it everywhere is like trying to build a skyscraper with just pliers.",
    "image": "https://jitinkapila.com/assets/img/genai-plier-effect.png",
    "author": {
      "@type": "Person",
      "name": "Jitin Kapila",
      "url": "https://jitinkapila.com/about",
      "jobTitle": "AI Strategy Consultant",
      "worksFor": {
        "@type": "Organization",
        "name": "Kriyalytics",
        "url": "https://kriyalytics.com"
      }
    },
    "publisher": {
      "@type": "Organization",
      "name": "Jitin Kapila",
      "logo": {
        "@type": "ImageObject",
        "url": "https://jitinkapila.com/assets/logo.png"
      }
    },
    "datePublished": "2026-01-15",
    "dateModified": "2026-01-15",
    "mainEntityOfPage": {
      "@type": "WebPage",
      "@id": "https://jitinkapila.com/blog/genai-is-plier"
    },
    "articleSection": "AI Strategy",
    "keywords": ["GenAI", "AI strategy", "tool selection", "AI umbrella", "enterprise AI", "LLM", "machine learning", "ROI", "AI failure", "strategy"],
    "wordCount": 1200,
    "about": [
      {
        "@type": "Thing",
        "name": "GenAI Limitations",
        "description": "Generative AI excels at language tasks — summarizing, drafting, answering queries — but is the wrong tool for mathematical optimization, prediction, or relational analysis"
      }
    ]
  },
  {
    "@context": "https://schema.org",
    "@type": "FAQPage",
    "mainEntity": [
      {
        "@type": "Question",
        "name": "What does the plier metaphor mean for GenAI strategy?",
        "acceptedAnswer": {
          "@type": "Answer",
          "text": "GenAI is a powerful tool, but like pliers — it's specialized for specific tasks (gripping, pulling, twisting). Using pliers to drive screws, hammer nails, or dig foundations fails or damages things. Similarly, GenAI (LLMs) is world-class at summarizing documents, drafting emails, and answering questions — but using it to predict inventory needs, detect fraud, or optimize routes is applying the wrong tool to the wrong problem. Companies saying 'we have a GenAI strategy' are often trying to hammer every nail with pliers."
        }
      },
      {
        "@type": "Question",
        "name": "What are the three dangers of 'plier-only' thinking in AI?",
        "acceptedAnswer": {
          "@type": "Answer",
          "text": "The three dangers are: (1) The Hallucination Trap — using an LLM to 'predict' inventory needs when LLMs are built for language probability, not mathematical optimization; (2) The Thin-Wrapper Problem — building a chat interface on top of bad data where the plating looks nice but the food is terrible; (3) The ROI Gap — spending millions on GPU tokens to do work that simple linear regression or traditional ML can do faster, cheaper, and more accurately. These patterns are primary reasons why most AI projects fail."
        }
      },
      {
        "@type": "Question",
        "name": "How do you know which AI tool to use for which problem?",
        "acceptedAnswer": {
          "@type": "Answer",
          "text": "The AI Umbrella framework gives the mapping: (1) Need to predict churn? Use Machine Learning (the Torque Wrench); (2) Need to find the best delivery route? Use Optimization & Planning (the GPS); (3) Need to find fraud patterns in a network? Use Graph Algorithms (the X-Ray); (4) Need to summarize a 50-page contract? Use GenAI (the Pliers). Your job as a leader isn't to build the tools — it's to know when to use each one."
        }
      },
      {
        "@type": "Question",
        "name": "Why is prompt engineering the least important of the 5 AI strategic skills?",
        "acceptedAnswer": {
          "@type": "Answer",
          "text": "Prompt engineering is the entry-level skill — it's about how you talk to the AI. Problem-solution mapping, architectural literacy, data viability, and time-to-value matter far more. You could write the perfect prompt and still fail because you chose the wrong problem to apply GenAI to, or because your data is missing, or because the output doesn't map to any business decision. Leaders who focus on prompting miss the strategic skills that actually drive AI value."
        }
      }
    ]
  },
  {
    "@context": "https://schema.org",
    "@type": "BreadcrumbList",
    "itemListElement": [
      {
        "@type": "ListItem",
        "position": 1,
        "name": "Home",
        "item": "https://jitinkapila.com"
      },
      {
        "@type": "ListItem",
        "position": 2,
        "name": "Blog",
        "item": "https://jitinkapila.com/blog"
      },
      {
        "@type": "ListItem",
        "position": 3,
        "name": "AI Strategy",
        "item": "https://jitinkapila.com/blog/ai-strategy"
      },
      {
        "@type": "ListItem",
        "position": 4,
        "name": "GenAI Is a Pair of Pliers",
        "item": "https://jitinkapila.com/blog/strategy/genai-is-plier"
      }
    ]
  }
]
</script>
<div class="padded text-center justify-center page-columns page-full" style="margin:auto;">
<div class="quarto-figure quarto-figure-center page-columns page-full">
<figure class="figure page-columns page-full">
<p class="page-columns page-full"><a href="genai-plier-effect.png" class="lightbox page-columns page-full" data-gallery="quarto-lightbox-gallery-1" title="You dont use just pliers for creating your dream home."><img src="https://jitinkapila.com/writing/genai-is-plier/genai-plier-effect.png" class="profile-image img-fluid figure-img column-body" style="width: 75%; margin:auto;" alt="You dont use just pliers for creating your dream home."></a></p>
<figcaption>You dont use just pliers for creating your dream home.</figcaption>
</figure>
</div>
</div>
<p><br></p>
<p>Imagine walking onto a massive construction site where a crew is trying to build a skyscraper. You look at their toolbelts and see one thing: a pair of pliers.</p>
<p>They are trying to tighten bolts with them. They are trying to hammer nails with them. They are even trying to dig the foundation with them.</p>
<p>It sounds ridiculous, right? But in 2026, this is exactly what most “GenAI Strategies” look like. Companies are buying the “pliers” (LLMs like ChatGPT, Claude, or Gemini) and trying to force them to solve every problem in the enterprise—from inventory optimization to fraud detection.</p>
<p><strong>GenAI is a powerful tool. But it is just one tool in a very large toolbox.</strong></p>
<p><br></p>
<section id="the-plier-metaphor-what-genai-actually-does" class="level2">
<h2 class="anchored" data-anchor-id="the-plier-metaphor-what-genai-actually-does">1. The Plier Metaphor: What GenAI Actually Does</h2>
<p>In my <a href="../../writing/ai-umbrella/index.html">AI Umbrella framework</a>, I categorize GenAI as a tool for handling <strong>Cognitive Load</strong>.</p>
<p>Pliers are great for a specific set of tasks: gripping, pulling, and twisting. Similarly, GenAI is world-class at: * <strong>Summarizing</strong> long documents. * <strong>Drafting</strong> emails or reports. * <strong>Answering</strong> basic customer queries. * <strong>Generating</strong> creative ideas.</p>
<p>But you wouldn’t use pliers to drive a screw (you’d use a screwdriver) or to dig a trench (you’d use an excavator). If you try to use a “Plier” (GenAI) to solve a “Math” problem or a “Relationship” problem, the tool will fail—or worse, it will “hallucinate” a solution that looks right but is dangerously wrong.</p>
<p><br></p>
</section>
<section id="the-danger-of-plier-only-thinking" class="level2">
<h2 class="anchored" data-anchor-id="the-danger-of-plier-only-thinking">2. The Danger of “Plier-Only” Thinking</h2>
<p>When a CEO tells me, <em>“We have a GenAI strategy,”</em> I usually hear, <em>“We are trying to hammer every nail with pliers.”</em> This is a primary reason why <a href="../../writing/why-ai-projects-fail/index.html">most AI projects fail</a>.</p>
<p>Here is what happens when you use the wrong tool: * <strong>The Hallucination Trap:</strong> Using an LLM to “predict” inventory needs. LLMs are built for language probability, not mathematical optimization. * <strong>The Thin-Wrapper Problem:</strong> Building a “chat interface” on top of bad data. The “plating” looks nice, but the “food” is still terrible (as I discussed in <a href="../../writing/ai-strategic-skills/index.html">AI Strategic Skills</a>). * <strong>The ROI Gap:</strong> Spending millions on GPU tokens to do work that a simple linear regression (Traditional ML) could do faster, cheaper, and more accurately.</p>
<p><br></p>
</section>
<section id="building-a-full-toolbox" class="level2">
<h2 class="anchored" data-anchor-id="building-a-full-toolbox">3. Building a Full Toolbox</h2>
<p>To move from “Tier 1” (playing with tools) to “Tier 3” (The Strategist), you need to understand the rest of the toolbox.</p>
<p>As we explored in <a href="../../writing/decision-first-ai/index.html">Decision-First AI</a>, you must start with the <strong>Decision</strong> first, then pick the tool:</p>
<ol type="1">
<li><strong>Need to predict churn?</strong> Use <strong>Machine Learning</strong> (The Torque Wrench).</li>
<li><strong>Need to find the best delivery route?</strong> Use <strong>Optimization &amp; Planning</strong> (The GPS).</li>
<li><strong>Need to find fraud patterns in a network?</strong> Use <strong>Graph Algorithms</strong> (The X-Ray).</li>
<li><strong>Need to summarize a 50-page contract?</strong> <strong>Now</strong> you pull out the <strong>GenAI Pliers</strong>.</li>
</ol>
<p><br></p>
</section>
<section id="the-executives-job-choosing-the-tool" class="level2">
<h2 class="anchored" data-anchor-id="the-executives-job-choosing-the-tool">4. The Executive’s Job: Choosing the Tool</h2>
<p>Your job as a leader isn’t to know how to <em>build</em> the pliers. It’s to know <em>when</em> to use them.</p>
<p>In my <strong>“AI Profit OS”</strong> framework, we teach leaders to move past the hype and look at the “Umbrella.” If your technical team is suggesting a GenAI solution for a problem that requires structural data logic, you need to pull the emergency brake.</p>
<p><strong>Are you building a skyscraper, or just playing with pliers?</strong></p>
<p><br></p>
</section>
<section id="take-the-next-step" class="level2">
<h2 class="anchored" data-anchor-id="take-the-next-step">Take the Next Step</h2>
<p>Don’t let your strategy get stuck in “pilot purgatory” because you picked the wrong tool.</p>
<p><a href="../../ai-profit-os.html"><strong>Join the AI Profit OS Cohort</strong></a> | <a href="../../work-with-me.html"><strong>Book a Strategic Call</strong></a></p>
<p><em>If you need help identifying which tool in the AI Umbrella fits your specific business pain, let’s talk.</em></p>
<!-- ::: {.column-screen .bg-navy}

<br>




::: {.column-page}

### Subscribe to Weekly AI Decision Brief 

::: {.grid .column-page-inset .padded}

::: {.g-col-12 .g-col-md-6 style="margin: auto;"}
One sharp insight on making AI work for your business — every week. Frameworks from actual deployments. Case studies with real numbers. The questions your AI vendor hopes you never ask.

No hype. No vendor pitch. Written for operations leaders, not technology teams.
:::

::: {.g-col-0 .g-col-md-1}
:::

::: {.g-col-12 .g-col-md-5}


```{=html}
<form
  method="post"                                                                                                                                         action="https://systeme.io/embedded/39549717/subscription"
> 

    <label for="bd-first-name">First Name</label>
    <input
    class="form-control"
    placeholder="Your first name"
    type="text" name="first_name" id="bd-first-name" />

    <label for="bd-email">Email Address</label>                                                                                                           <input
    class="form-control"
    placeholder="your@email.here" 
    type="email" name="email" id="bd-email" />

    <input 
    class="btn btn-primary"
    style="margin-top: 1em;"
    type="submit" value="Subscribe" />
</form>
```
:::

:::

<br>

<br>

:::



<!-- <form
  method="post" 
  action="https://systeme.io/embedded/36566252/subscription"
>
  <label for="bd-email">Email Address</label>
  <input 
  class="form-control"
  placeholder="your@email.here" 
  type="email" name="email" id="bd-email" />
  
  <input 
  class="btn btn-primary"
  style="margin-top: 1em;"
  type="submit" value="Subscribe" />
</form>
 -->
<div class="-->">


</div>
</section>

<a onclick="window.scrollTo(0, 0); return false;" id="quarto-back-to-top"><i class="bi bi-arrow-up"></i> Back to top</a> ]]></description>
  <category>strategy</category>
  <category>business</category>
  <guid>https://jitinkapila.com/writing/genai-is-plier/</guid>
  <pubDate>Wed, 14 Jan 2026 18:30:00 GMT</pubDate>
</item>
<item>
  <title>Why Most AI Strategies Fail</title>
  <dc:creator>Jitin Kapila</dc:creator>
  <link>https://jitinkapila.com/writing/why-ai-projects-fail/</link>
  <description><![CDATA[ 







<div class="column-body padded text-center justify-center" style="margin:auto;">
<p><a href="business-team-thinking-about-possible-solution.jpg" class="lightbox" data-gallery="quarto-lightbox-gallery-1"><img src="https://jitinkapila.com/writing/why-ai-projects-fail/business-team-thinking-about-possible-solution.jpg" class="profile-image img-fluid" style="width: 75%; margin:auto;"></a></p>
</div>
<p><br></p>
<p><strong>Most AI projects don’t fail because the code is bad. They fail because they were born from the wrong emotions.</strong></p>
<p>In my experience of advising Fortune 500s and startups on data science, I’ve seen hundreds of AI initiatives. The successful ones all start by identifying a specific business <strong>PAIN</strong>. The failed ones? They almost always begin with the “Triad of Bad Feelings”: <strong>Pressure, FOMO, or Anxiety.</strong></p>
<ul>
<li><strong>Pressure:</strong> The Board says, “We need to do something with AI.”</li>
<li><strong>FOMO (Fear Of Missing Out):</strong> You see a competitor launch a chatbot and feel the need to keep up.</li>
<li><strong>Anxiety:</strong> You read the headlines and fear your business is becoming irrelevant in a hype cycle.</li>
</ul>
<p>When you build from anxiety instead of pain, you get “Science Fair Projects”—expensive pilots that never touch the P&amp;L. Here is my diagnostic guide to why your last AI project might have stalled, and the specific “Technical Fluency” gaps that likely caused it.</p>
<p><br></p>
<section id="the-1-silent-killer-absence-of-pain" class="level2">
<h2 class="anchored" data-anchor-id="the-1-silent-killer-absence-of-pain">The #1 Silent Killer: Absence of Pain</h2>
<p>If you cannot name the specific business pain you are solving, your project is already dead.</p>
<p>I recently audited a company where the C-Suite wanted “AI for efficiency.” That is not a pain; that is a wish. Because they couldn’t quantify the pain (e.g., “We are losing $50,000 per month in manual data entry errors”), they couldn’t quantify the <strong>Value</strong> of a solution.</p>
<p>The result was a project with no “End in Mind.” The team drifted, timelines slipped, and the budget evaporated because no one knew what “success” actually looked like.</p>
<p><strong>The Fix:</strong> Before writing a single line of code, you must be able to fill in this blank:</p>
<blockquote class="blockquote">
<p><em>“This AI solves <strong>[Specific Pain]</strong> which is currently costing us <strong>[$$$$]</strong> per month.”</em></p>
</blockquote>
<p><br></p>
</section>
<section id="the-translation-error-ceo-vs.-data-scientist" class="level2">
<h2 class="anchored" data-anchor-id="the-translation-error-ceo-vs.-data-scientist">The “Translation Error”: CEO vs.&nbsp;Data Scientist</h2>
<p>This is the most common friction point I see in enterprise AI.</p>
<ul>
<li><strong>The CEO says:</strong> <em>“I want more revenue.”</em></li>
<li><strong>The Data Team hears:</strong> <em>“I need a model with a high accuracy score.”</em></li>
</ul>
<p>These are not the same thing. “More revenue” is a goal; high accuracy is just one possible tactic. For example, revenue can come from lowering churn, but only if you focus on the right customers. A 1% churn reduction on low-value customers is meaningless. You need to save the High-CLTV (Customer Lifetime Value) clients.</p>
<p><a href="../../writing/decision-first-ai/index.html"><strong>The Fluency Gap:</strong></a> The executives didn’t define the business lever (CLTV), and the tech team didn’t ask. The result is a technically “successful” model that drives zero actual revenue.</p>
<p><br></p>
</section>
<section id="when-to-kill-an-ai-project-the-code-red-checklist" class="level2">
<h2 class="anchored" data-anchor-id="when-to-kill-an-ai-project-the-code-red-checklist">When to Kill an AI Project: The “Code Red” Checklist</h2>
<p>Part of my job as a strategist is telling clients when to stop spending money. An AI project should be put on hold or killed immediately if it meets these criteria:</p>
<ol type="1">
<li><strong>Invisible ROI:</strong> You are three months in and still cannot <a href="../../writing/ai-umbrella/index.html">map the model’s output</a> to a dollar value.</li>
<li><strong>Missing Ingredients:</strong> The core data is missing or inaccessible. You can’t build a churn model if you aren’t tracking customer complaints or service failures.</li>
<li><strong>The “30% Drift” Rule:</strong> If 30% of the project timeline has passed and the team still has “no clue” what the final output looks like or lacks stakeholder support.</li>
</ol>
<p>The only exception is when the potential ROI is massive and justifies a high-risk, “Agile” approach to de-risk the project in smaller steps.</p>
<p><br></p>
</section>
<section id="case-study-how-fixing-definitions-saved-a-fortune-500" class="level2">
<h2 class="anchored" data-anchor-id="case-study-how-fixing-definitions-saved-a-fortune-500">Case Study: How Fixing Definitions Saved a Fortune 500</h2>
<p>I worked with a <a href="../../../case-studies/retail/index.qmd">global Fortune 500 retailer</a> whose AI initiatives were failing.</p>
<p><strong>The Problem:</strong> Every country operated as a mini-company. They had different definitions for “Delivery Time,” “Stock in Hand,” and “Marketing Spend.” <strong>The Consequence:</strong> The global AI forecasting model was useless because the input data meant different things in different regions. This “Translation Error” was costing them millions in lost inventory.</p>
<p><strong>The Turnaround:</strong> We didn’t start with a better AI model. We started with <strong>Definitions</strong>. By standardizing the meaning of “ROI” and “COGS” across their SAPMENA region, the data became clean. Once the data was clean, the AI model could actually forecast.</p>
<p>The lesson is simple: You cannot layer AI on top of organizational chaos. You must fix the business definitions first.</p>
<p><br></p>
<section id="how-to-audit-your-ai-strategy" class="level3">
<h3 class="anchored" data-anchor-id="how-to-audit-your-ai-strategy">How to Audit Your AI Strategy</h3>
<p>If you are feeling the <strong>Pressure, FOMO, or Anxiety</strong> of AI right now, stop. Don’t build another “Solution looking for a Problem.” My framework helps leaders:</p>
<ol type="1">
<li><strong>Identify the PAIN</strong> (not the hype).</li>
<li><strong>Quantify the Value</strong> (put a dollar sign on it).</li>
<li><strong>Assign Accountability</strong> (who owns the ROI?).</li>
</ol>
<p>If you want to diagnose your current AI roadmap before you spend another dollar, <a href="https://cal.com/kriyalytics/strategy">let’s talk</a>.</p>
<!-- ::: {.column-screen .bg-navy}

<br>




::: {.column-page}

### Subscribe to Weekly AI Decision Brief 

::: {.grid .column-page-inset .padded}

::: {.g-col-12 .g-col-md-6 style="margin: auto;"}
One sharp insight on making AI work for your business — every week. Frameworks from actual deployments. Case studies with real numbers. The questions your AI vendor hopes you never ask.

No hype. No vendor pitch. Written for operations leaders, not technology teams.
:::

::: {.g-col-0 .g-col-md-1}
:::

::: {.g-col-12 .g-col-md-5}


```{=html}
<form
  method="post"                                                                                                                                         action="https://systeme.io/embedded/39549717/subscription"
> 

    <label for="bd-first-name">First Name</label>
    <input
    class="form-control"
    placeholder="Your first name"
    type="text" name="first_name" id="bd-first-name" />

    <label for="bd-email">Email Address</label>                                                                                                           <input
    class="form-control"
    placeholder="your@email.here" 
    type="email" name="email" id="bd-email" />

    <input 
    class="btn btn-primary"
    style="margin-top: 1em;"
    type="submit" value="Subscribe" />
</form>
```
:::

:::

<br>

<br>

:::



<!-- <form
  method="post" 
  action="https://systeme.io/embedded/36566252/subscription"
>
  <label for="bd-email">Email Address</label>
  <input 
  class="form-control"
  placeholder="your@email.here" 
  type="email" name="email" id="bd-email" />
  
  <input 
  class="btn btn-primary"
  style="margin-top: 1em;"
  type="submit" value="Subscribe" />
</form>
 -->
<div class="-->">
<!-- 
<br><br>

::: {.grid .justify-item-center .text-center .column-body-inset .padded} 

::: {.g-col-12 .g-col-md-6}
[*Explore Course*](/ai-profit-os.qmd){.btn .btn-secondary .w-50 }
:::

::: {.g-col-12 .g-col-md-6}
[*Book a strategic call*](https://cal.com/kriyalytics/strategy){.btn .btn-secondary} 
:::

::: -->


</div>
</section>
</section>

<a onclick="window.scrollTo(0, 0); return false;" id="quarto-back-to-top"><i class="bi bi-arrow-up"></i> Back to top</a> ]]></description>
  <category>business</category>
  <category>strategy</category>
  <guid>https://jitinkapila.com/writing/why-ai-projects-fail/</guid>
  <pubDate>Wed, 07 Jan 2026 18:30:00 GMT</pubDate>
  <media:content url="https://jitinkapila.com/writing/why-ai-projects-fail/business-team-thinking-about-possible-solution.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Decision-First AI: Why Data Should Follow, Not Lead</title>
  <dc:creator>Jitin Kapila</dc:creator>
  <link>https://jitinkapila.com/writing/decision-first-ai/</link>
  <description><![CDATA[ 





<script type="application/ld+json">
[
  {
    "@context": "https://schema.org",
    "@type": "BlogPosting",
    "headline": "Decision-First AI: Why Data Should Follow, Not Lead",
    "description": "The dataset-first trap that kills AI projects — and the Decision Map framework that flips the order. Start with the decision, map it, then pick the simplest data and model.",
    "image": "https://jitinkapila.com/assets/img/panel.jpg",
    "author": {
      "@type": "Person",
      "name": "Jitin Kapila",
      "url": "https://jitinkapila.com/about",
      "jobTitle": "AI Strategy Consultant",
      "worksFor": {
        "@type": "Organization",
        "name": "Kriyalytics",
        "url": "https://kriyalytics.com"
      }
    },
    "publisher": {
      "@type": "Organization",
      "name": "Jitin Kapila",
      "logo": {
        "@type": "ImageObject",
        "url": "https://jitinkapila.com/assets/logo.png"
      }
    },
    "datePublished": "2025-10-09",
    "dateModified": "2025-10-09",
    "mainEntityOfPage": {
      "@type": "WebPage",
      "@id": "https://jitinkapila.com/blog/decision-first-ai"
    },
    "articleSection": "AI Strategy",
    "keywords": ["decision-first AI", "data strategy", "decision map", "AI project planning", "decision-focused learning", "predict then optimize", "AI failure prevention"],
    "wordCount": 1600,
    "about": [
      {
        "@type": "Thing",
        "name": "Decision-First AI",
        "description": "An AI project methodology that starts with defining the decision to be changed, then maps the minimal data and simplest model needed to improve that decision"
      },
      {
        "@type": "Thing",
        "name": "Decision Map",
        "description": "A one-page framework with six steps for scoping an AI project before touching data: name the decision, define the metric, map the process, identify the action trigger, list minimal signals, and plan the feedback loop"
      }
    ]
  },
  {
    "@context": "https://schema.org",
    "@type": "FAQPage",
    "mainEntity": [
      {
        "@type": "Question",
        "name": "What is the dataset-first trap in AI projects?",
        "acceptedAnswer": {
          "@type": "Answer",
          "text": "The dataset-first trap is the pattern where an AI project starts with a dataset, builds dashboards and models, shows a demo with good accuracy — and then nobody can answer 'what decision does this support?' The project optimizes prediction metrics (error, F1, AUC, MAPE) without linking them to business decisions. The result: technically good models that drive zero business impact. The trap wastes time, money, and faith, and it gives AI a bad name."
        }
      },
      {
        "@type": "Question",
        "name": "What are the 6 steps of the Decision Map framework?",
        "acceptedAnswer": {
          "@type": "Answer",
          "text": "The Decision Map has 6 steps: (1) Name the Decision — who decides, how often, what options; (2) Define the Decision Metric — primary and secondary measures of success; (3) Map the Current Process — where decisions are made, who is involved, where delays happen; (4) Identify the Minimal Action Trigger — what threshold or output causes a behavior change; (5) List Minimal Signals — the smallest set of data needed to reduce uncertainty for this decision; (6) Plan the Feedback Loop — how to measure the decision metric after deployment. Completing this map does 80% of the work most teams skip."
        }
      },
      {
        "@type": "Question",
        "name": "Why do models with better accuracy sometimes produce worse business outcomes?",
        "acceptedAnswer": {
          "@type": "Answer",
          "text": "Because accuracy is not the same as decision value. A model can be 95% accurate on the wrong question and produce confident wrong answers. If your data was collected for reporting purposes and your decision is made on a different variable, the model is accurate to the wrong question. Research in 'decision-focused learning' (smart predict-then-optimize) shows that training models directly for decision loss — not prediction loss — yields better business outcomes even with 'worse' prediction metrics."
        }
      },
      {
        "@type": "Question",
        "name": "How does decision-first prevent AI project failures?",
        "acceptedAnswer": {
          "@type": "Answer",
          "text": "Decision-first forces alignment before investment. If the decision owner cannot commit to a measurable metric, the project is paused before spending begins. This prevents the common failure of building technically impressive systems nobody uses. The telecom case study demonstrates this: a real-time anomaly detection system saved ~£80K/month because the decision was so clearly scoped — 'dispatch a technician for a suspected DSL fault' — that value could be measured before full scale. The clarity came from starting with the decision, not the data."
        }
      },
      {
        "@type": "Question",
        "name": "What does 'start with a rule baseline' mean in decision-first AI?",
        "acceptedAnswer": {
          "@type": "Answer",
          "text": "Before building any model, define a simple rule that represents the current baseline behavior — for example, 'if X > T, create a ticket.' This rule is your fail-safe and your benchmark. If a machine learning model cannot beat that rule in decision impact — not just prediction accuracy — it should be scrapped. Many projects spend months and significant budget building models that marginally improve accuracy metrics while matching or underperforming simple business rules in actual business outcomes."
        }
      }
    ]
  },
  {
    "@context": "https://schema.org",
    "@type": "BreadcrumbList",
    "itemListElement": [
      {
        "@type": "ListItem",
        "position": 1,
        "name": "Home",
        "item": "https://jitinkapila.com"
      },
      {
        "@type": "ListItem",
        "position": 2,
        "name": "Blog",
        "item": "https://jitinkapila.com/blog"
      },
      {
        "@type": "ListItem",
        "position": 3,
        "name": "AI Strategy",
        "item": "https://jitinkapila.com/blog/ai-strategy"
      },
      {
        "@type": "ListItem",
        "position": 4,
        "name": "Decision-First AI",
        "item": "https://jitinkapila.com/blog/strategy/decision-first-ai"
      }
    ]
  }
]
</script>

<p><strong>Start here:</strong> don’t open a notebook until you know the decision you want to change.</p>
<p>Sounds obvious. But most AI projects don’t start there. They start with a dataset, or with a “let’s try this model,” or with a platform demo that looks great in the cloud. And then weeks later the obvious question appears: “Okay — what decision does this support?” People shrug. The project stalls. The models are good. The business impact is vague.</p>
<p>This is the dataset-first trap. <a href="../../writing/why-ai-projects-fail/index.html">It wastes time, money, and faith.</a> It also gives AI a bad name.</p>
<p>I’ve seen the opposite work — a lot. Start with the decision. Map the decision. Then pick the simplest data and model that make the decision better. The result? Faster pilots, clearer ROI, and systems that actually get used.</p>
<p>That approach is not just a management neat-idea. There’s a real body of research showing that aligning models to downstream decision goals yields better decisions than optimizing prediction accuracy alone. And there are real, practical wins — from telecom fault detection to inventory systems — <a href="../../work-with-me.html">when you flip the order.</a></p>
<p><br></p>
<section id="the-dataset-first-trap-what-it-looks-like-and-why-it-hurts" class="level2">
<h2 class="anchored" data-anchor-id="the-dataset-first-trap-what-it-looks-like-and-why-it-hurts">The “dataset-first” trap — what it looks like, and why it hurts</h2>
<p>Here’s the typical playbook I see in companies:</p>
<ul>
<li>Someone discovers a new dataset.<br>
</li>
<li>They build dashboards, then a model, then a fine model, then a fancier model.<br>
</li>
<li>They show a demo. The demo gets applause. Then the work hits integration, governance, and the messy reality of people who must make decisions every day. <a href="../../writing/ai-umbrella/index.html">The model’s outputs don’t map to a decision process.</a> So adoption fails.</li>
</ul>
<p>Why? Because the project optimized the wrong thing. It optimized prediction metrics — error, F1, AUC, MAPE. And those are useful. But they’re not the measure of business impact. A model with better accuracy can still be useless if it doesn’t change what someone does.</p>
<p>Harvard Business Review captured this idea well: decisions don’t start with data. They start with a problem, a role, a process, and a behavior. If your analytics don’t connect to that reality, you get slides and disappointment. Harvard Business Review</p>
<p>There are more subtle costs too. Dataset-first projects often create models that are brittle in production: they overfit to historical quirks, they require constant data wrangling, and they produce numbers no one trusts. That kills scale.</p>
<section id="decision-first-in-research-not-new-but-finally-practical" class="level3">
<h3 class="anchored" data-anchor-id="decision-first-in-research-not-new-but-finally-practical">Decision-first in research: not new, but finally practical</h3>
<p>There’s academic grounding for starting with decisions. In the machine-learning community this shows up as “decision-focused learning” or “smart predict-then-optimize.” The idea: train predictive models not for pure accuracy, but to minimize the loss that matters to the downstream optimization or decision task. When you optimize directly for the decision loss, you often get better business outcomes — even with “worse” prediction metrics. arXiv+1</p>
<p>Recent papers and reviews show both the theory and practical methods: surrogate losses that reflect decision outcomes, techniques to differentiate through optimization, and heuristics for discrete problems. The takeaway: the math supports the intuition. If you want a model to help choose inventory levels, price points, or routing, train it with that decision in mind — not just with RMSE.</p>
<p>That doesn’t mean every model must be complex. Often the opposite. Framing the decision reduces model complexity because you only model what matters. This is your Occam’s Razor.</p>
<p><br></p>
</section>
</section>
<section id="the-decision-map-simple-framework-you-can-use-today" class="level2">
<h2 class="anchored" data-anchor-id="the-decision-map-simple-framework-you-can-use-today">The Decision Map: simple framework you can use today</h2>
<p>If you want to flip to decision-first, start with a small, disciplined tool I call a <strong>Decision Map</strong>. It’s a one-page artefact. Build it before you touch data.</p>
<p>Here’s the <strong>Decision Map</strong> — six steps. Do them in order.</p>
<ul>
<li><p><strong>Name the decision.</strong> Who decides, how often, and what options do they choose? Example: “Field-ops decides whether to dispatch a technician to a suspected DSL fault.” Be specific. Frequency matters — hourly, daily, weekly change what you can do.</p></li>
<li><p><strong>Define the decision metric(s).</strong>What counts as success? Lower cost? Faster response time? Increase in net revenue? Pick one primary metric and one secondary. If you can’t name it in a single measurable sentence, you don’t have a decision.</p></li>
<li><p><strong>Map the current process.</strong> Where is the decision made today? Which people and tools are involved? Where does data enter? Where do delays happen? This step exposes the friction you must remove.</p></li>
<li><p><strong>Identify the minimal action the model must trigger.</strong> The model doesn’t need to be perfect. It needs to change behavior. If the model’s output is a probability, what threshold triggers action? Who gets the alert? What’s the handoff?</p></li>
<li><p><strong>List the minimal signals (data) needed.</strong> Only include data that directly reduces uncertainty for the decision. You’ll be surprised how small this list often is. Think: signal → action. Not “all the data.”</p></li>
<li><p><strong>Plan the feedback loop.</strong> How will you measure the decision metric after deployment? How will you collect labels and iterate? Decide that upfront.</p></li>
</ul>
<p>If you complete this map, you’ll have done 80% of the work most teams skip. It forces alignment, and it reveals whether the project is worth doing.</p>
<p><a href="panel.jpg" class="lightbox" data-gallery="quarto-lightbox-gallery-1"><img src="https://jitinkapila.com/writing/decision-first-ai/panel.jpg" class="img-fluid"></a></p>
<p>P.S. - I am hosting a course where you’ll learn exactly how to <a href="../../ai-profit-os.html">implement this for your usecase</a>.</p>
<p><br></p>
<section id="a-telecom-example-mapped-end-to-end-real-story" class="level3">
<h3 class="anchored" data-anchor-id="a-telecom-example-mapped-end-to-end-real-story">A telecom example, mapped end-to-end (real story)</h3>
<p>A brief, real example matters. I built a real-time anomaly detection system for a European telecom client early in my career. The project didn’t begin with “we have logs.” It began with this decision map:</p>
<ul>
<li><strong>Decision:</strong> Should operations dispatch a field technician proactively for a suspected DSL fault?</li>
<li><strong>Metric:</strong> Reduce customer reported faults and improve Net Promoter Score (NPS) by reducing time-to-detect. Also: monthly cost savings from fewer reactive truck rolls and more planned truck rolls.</li>
<li><strong>Process:</strong> Operations received the customer call, created a ticket, and dispatched if needed — often hours or days later. That was slow and expensive.
<ul>
<li><strong>Action:</strong> If the system flags an anomaly with high confidence</li>
<li><strong>Red :</strong> Very high probability in next 48 hours,</li>
<li><strong>Amber:</strong> probability in next 2-15 days</li>
<li><strong>Green:</strong> No visibility of error in next 15 days This creates a high-priority ticket and dispatch a remote check or technician and was planned in region wise manner.</li>
</ul></li>
<li><strong>Signals:</strong> DSL line metrics, error rates, device telemetry, event logs — a handful of streams, not every log.</li>
<li><strong>Feedback:</strong> Compare flagged incidents to customer complaints and adjust thresholds.</li>
</ul>
<p>Because the decision was so clear, we could measure value before full scale. The pilot cut detection time from days to hours, raised customer satisfaction significantly ( We sent messages to possible signal disruption early), and saved ~£80K per month ( because reducing complete breakdown to DSL by heat or so and by planning the route than going everywhere) . It wasn’t an exotic model (or may be it was, tech details some other day); it was a tightly scoped system that informed a clear action.</p>
<p>Notice how this maps to the <strong>Decision-First steps</strong>. The model existed to change a single operational choice. That focus made deployment possible and measurable fast.</p>
</section>
<section id="another-quick-example-inventory-decisions" class="level3">
<h3 class="anchored" data-anchor-id="another-quick-example-inventory-decisions">Another quick example: inventory decisions</h3>
<p>Inventory forecasting is a classic area where decision-first matters. You can chase lower MAPE and never change stocking policy. Or you can ask: what decision do merchandisers make with this forecast? When you frame it as “which SKUs do we order for next week, and what reorder points trigger expedited shipments,” you design the forecast differently: shorter horizons, bias for understock on fast movers, and direct constraints on reorder costs.</p>
<p>I’ve led projects that delivered $11M in inventory optimization by building forecasts and decision rules that match merchant behavior and supply constraints. The trick was not better models — it was framing forecasts so the merchandisers could act with confidence.</p>
<p><br></p>
</section>
</section>
<section id="practical-tips-for-teams-do-this-in-week-one" class="level2">
<h2 class="anchored" data-anchor-id="practical-tips-for-teams-do-this-in-week-one">Practical tips for teams (do this in week one)</h2>
<ol type="1">
<li><strong>Run a one-hour Decision Map workshop.</strong> Invite the decision owner, one operator, one engineer, and one product owner. Build the one-page map. If the owner can’t commit to a metric, pause the project.</li>
<li><strong>Start with a simple rule baseline.</strong> Before modeling, define a rule that will be your baseline (e.g., “if X &gt; T, create ticket”). If the model can’t beat that rule in decision impact, scrap it.</li>
<li><strong>Measure decision impact, not model accuracy.</strong> Your dashboard should show business metric delta — not just RMSE. If you show the board a change in cost or conversion, you’ll get attention.</li>
<li><strong>Prioritize deployment constraints.</strong> Decide telemetry, latency, and handoff requirements first. Models that can’t meet latency or trust constraints are useless no matter how accurate.</li>
<li><strong>Iterate with real feedback.</strong> Don’t wait for “perfect.” Ship an MVP that can be measured, then refine. Real decisions provide labels and operational learning.</li>
</ol>
<p><a href="../../ai-profit-os.html">Learn how to exactly do this here.</a></p>
<section id="common-objections-and-how-to-handle-them" class="level3">
<h3 class="anchored" data-anchor-id="common-objections-and-how-to-handle-them">Common objections and how to handle them</h3>
<ol type="1">
<li><p><strong><em>“But we don’t have a clear decision owner.”</em></strong> Then don’t build a model. Decisions live in roles. Pull the right owner in early, or you’ll build for nobody.</p></li>
<li><p><strong><em>“Our data is messy.”</em></strong> Fine. If you can define the minimal signals, you can often create a proxy or start with manual labels. Messy data is easier to handle when you only need a few signals for a specific decision.</p></li>
<li><p><strong><em>“We need predictions for many uses.”</em></strong> Build a simple decision-first pilot first. Use its success to fund broader platform work. Pilots create proof that unlocks investment.</p></li>
<li><p><strong><em>“Decision-focused methods are academic — too hard.”</em></strong> There’s truth and myth here. The academic techniques show big wins when decision loss can be written down. But you don’t need complicated differentiable optimization to start. Use a decision map, simple thresholds, A/B tests, and iterative measurement. Graduate to decision-focused training once you have a stable objective. The research just tells us — unsurprisingly — that when you train with the decision in mind, outcomes improve. Optimization Online+1</p></li>
</ol>
<p><br></p>
</section>
</section>
<section id="one-page-checklist-copy-this" class="level2">
<h2 class="anchored" data-anchor-id="one-page-checklist-copy-this">One-page checklist (copy this)</h2>
<ul>
<li>Decision name (The GOTO problem statement): ____________________</li>
<li>Primary / Secondary metric (which can accounted for ROI calculation later ): _____</li>
<li>Decision owner (I don’t want to debate on this): ________________________</li>
<li>Frequency of predictions: real-time / hourly / daily / weekly / monthly</li>
<li>Action / interventions which can be triggered by model: ____________________</li>
<li>Minimal signals required( The core Data to begin with): ______________________</li>
<li>Baseline rule (your fail-safe if everything goes wrong): ____________________</li>
<li>Feedback source ( might be tricky, but you should have one): __________________</li>
</ul>
<p>If you can fill this in, you’re set for a pilot.</p>
<p><br></p>
</section>
<section id="final-note-start-small-measure-fast-then-scale-with-discipline" class="level2">
<h2 class="anchored" data-anchor-id="final-note-start-small-measure-fast-then-scale-with-discipline">Final note — <br>start small, measure fast, then scale with discipline</h2>
<p>The decision-first approach is simple because business problems are simple when stated well. The hard part is discipline: saying no to shiny demos and yes to measurable change. Start with one decision that matters. Map it. Ship a small system that changes behavior. Measure the business metric. Iterate.</p>
<p>Research supports this: models trained with the decision in mind perform better on the actual outcomes you care about. And in practice, teams that flip the order — decision first, data second — get to value faster. arXiv+1</p>
<p>If you want help mapping a decision in your company, send me one line describing the decision and the current process. I’ll reply with the Decision Map template you can use in a one-hour workshop. Or if you want we can connect too.</p>
<p>If this post help you or think help someone in need, please do share it. Thanks!!!</p>
<section id="key-sources-further-reading" class="level3">
<h3 class="anchored" data-anchor-id="key-sources-further-reading">Key sources &amp; further reading</h3>
<ul>
<li>Elmachtoub, A.N., &amp; Grigas, P. — Smart “Predict, then Optimize” (foundational paper on decision-focused loss and SPO). arXiv</li>
<li>Reviews and recent work on decision-focused learning (predict-and-optimize / decision-focused methods). arXiv+1</li>
<li>Harvard Business Review — Decisions Don’t Start with Data — on why framing the decision matters. Harvard Business Review</li>
</ul>
<p><strong>(And — the telecom and inventory examples referenced above are from <a href="../../../case-studies/index.qmd">projects</a>; <a href="https://cal.com/kriyalytics/strategy">let’s talk, if you want more details !!</a> )</strong></p>
<!-- ::: {.column-screen .bg-navy}

<br>




::: {.column-page}

### Subscribe to Weekly AI Decision Brief 

::: {.grid .column-page-inset .padded}

::: {.g-col-12 .g-col-md-6 style="margin: auto;"}
One sharp insight on making AI work for your business — every week. Frameworks from actual deployments. Case studies with real numbers. The questions your AI vendor hopes you never ask.

No hype. No vendor pitch. Written for operations leaders, not technology teams.
:::

::: {.g-col-0 .g-col-md-1}
:::

::: {.g-col-12 .g-col-md-5}


```{=html}
<form
  method="post"                                                                                                                                         action="https://systeme.io/embedded/39549717/subscription"
> 

    <label for="bd-first-name">First Name</label>
    <input
    class="form-control"
    placeholder="Your first name"
    type="text" name="first_name" id="bd-first-name" />

    <label for="bd-email">Email Address</label>                                                                                                           <input
    class="form-control"
    placeholder="your@email.here" 
    type="email" name="email" id="bd-email" />

    <input 
    class="btn btn-primary"
    style="margin-top: 1em;"
    type="submit" value="Subscribe" />
</form>
```
:::

:::

<br>

<br>

:::



<!-- <form
  method="post" 
  action="https://systeme.io/embedded/36566252/subscription"
>
  <label for="bd-email">Email Address</label>
  <input 
  class="form-control"
  placeholder="your@email.here" 
  type="email" name="email" id="bd-email" />
  
  <input 
  class="btn btn-primary"
  style="margin-top: 1em;"
  type="submit" value="Subscribe" />
</form>
 -->
<div class="-->">


</div>
</section>
</section>

<a onclick="window.scrollTo(0, 0); return false;" id="quarto-back-to-top"><i class="bi bi-arrow-up"></i> Back to top</a> ]]></description>
  <category>strategy</category>
  <category>business</category>
  <guid>https://jitinkapila.com/writing/decision-first-ai/</guid>
  <pubDate>Wed, 08 Oct 2025 18:30:00 GMT</pubDate>
  <media:content url="https://jitinkapila.com/writing/decision-first-ai/panel.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>The AI Umbrella: A Simple Guide to What Actually Matters</title>
  <dc:creator>Jitin Kapila</dc:creator>
  <link>https://jitinkapila.com/writing/ai-umbrella/</link>
  <description><![CDATA[ 





<script type="application/ld+json">
[
  {
    "@context": "https://schema.org",
    "@type": "BlogPosting",
    "headline": "The AI Umbrella: A Simple Guide to What Actually Matters",
    "description": "The AI Umbrella framework — how to pick the right AI tool for your business problem and avoid the hype of forcing GenAI onto every use case.",
    "image": "https://jitinkapila.com/assets/img/main-umbrella.webp",
    "author": {
      "@type": "Person",
      "name": "Jitin Kapila",
      "url": "https://jitinkapila.com/about",
      "jobTitle": "AI Strategy Consultant",
      "worksFor": {
        "@type": "Organization",
        "name": "Kriyalytics",
        "url": "https://kriyalytics.com"
      }
    },
    "publisher": {
      "@type": "Organization",
      "name": "Jitin Kapila",
      "logo": {
        "@type": "ImageObject",
        "url": "https://jitinkapila.com/assets/logo.png"
      }
    },
    "datePublished": "2025-09-25",
    "dateModified": "2025-09-25",
    "mainEntityOfPage": {
      "@type": "WebPage",
      "@id": "https://jitinkapila.com/blog/ai-umbrella"
    },
    "articleSection": "AI Strategy",
    "keywords": ["AI Umbrella", "GenAI", "machine learning", "graph algorithms", "optimization", "AI framework", "AI strategy", "tool selection", "enterprise AI"],
    "wordCount": 1400,
    "about": [
      {
        "@type": "Thing",
        "name": "AI Umbrella Framework",
        "description": "A framework categorizing AI tools into 5 pillars — ML, Graph, Optimization, Human-Machine Interface, and Legacy — to help leaders select the right tool for each business problem"
      }
    ]
  },
  {
    "@context": "https://schema.org",
    "@type": "FAQPage",
    "mainEntity": [
      {
        "@type": "Question",
        "name": "What is the AI Umbrella framework?",
        "acceptedAnswer": {
          "@type": "Answer",
          "text": "The AI Umbrella is a categorization framework that maps all AI tools into 5 pillars: (1) Machine Learning — pattern recognition for prediction and classification; (2) Graph Algorithms — mapping relationships and networks; (3) Optimization & Planning — finding the best route, allocation, or schedule; (4) Human-Machine Interface — robotics and collaborative systems; and (5) Legacy/Foundational — expert systems and rule-based approaches now embedded in modern methods. The framework helps leaders match business problems to the right AI tool, rather than defaulting everything to GenAI."
        }
      },
      {
        "@type": "Question",
        "name": "Why is GenAI only 8.6% of the AI market?",
        "acceptedAnswer": {
          "@type": "Answer",
          "text": "The total AI market is ~$235 billion. GenAI (ChatGPT-class language models) is $20.2 billion — just 8.6%. Traditional machine learning dominates at 91.4% ($194.6 billion) because it quietly runs the highest-ROI business applications: fraud detection, recommendation engines, supply chain optimization. GenAI is visible because it's consumer-facing, but enterprise value flows from traditional ML, which delivers up to 30% ROI versus GenAI's ~12%."
        }
      },
      {
        "@type": "Question",
        "name": "What ROI do different AI types deliver?",
        "acceptedAnswer": {
          "@type": "Answer",
          "text": "Predictive maintenance delivers up to 400% ROI in 6 months. Fraud detection delivers 150% ROI within 9 months. Marketing automation using ML achieves up to 544% ROI annually. Traditional ML projects complete in 3-6 months. GenAI averages 12% ROI over the same period and takes 3-12 months with more tuning. The data is clear: for most business problems, traditional ML outperforms GenAI on ROI and speed."
        }
      },
      {
        "@type": "Question",
        "name": "Why do 95% of GenAI pilots fail to show real ROI?",
        "acceptedAnswer": {
          "@type": "Answer",
          "text": "GenAI pilots fail because they apply the wrong tool to the wrong problem. LLMs are built for language probability — summarizing, drafting, answering questions. When applied to mathematical optimization, inventory prediction, or fraud detection, they hallucinate or underperform. The failure pattern is consistent: a team uses GenAI for a problem that requires ML or optimization, then blames the data or the implementation. The problem was the tool selection, not the execution."
        }
      },
      {
        "@type": "Question",
        "name": "How do you use the AI Umbrella to avoid wrong tool selection?",
        "acceptedAnswer": {
          "@type": "Answer",
          "text": "Before starting any AI project, ask two questions: 'What problem am I trying to solve?' and 'Which part of the AI Umbrella fits?' If the answer is 'predict what will happen,' use ML. If it's 'find the best route or allocation,' use optimization. If it's 'find relationships in a network,' use graph algorithms. If it's 'summarize or generate text,' use GenAI. The mistake most companies make is using the AI that made headlines instead of the one that solves the actual problem."
        }
      }
    ]
  },
  {
    "@context": "https://schema.org",
    "@type": "BreadcrumbList",
    "itemListElement": [
      {
        "@type": "ListItem",
        "position": 1,
        "name": "Home",
        "item": "https://jitinkapila.com"
      },
      {
        "@type": "ListItem",
        "position": 2,
        "name": "Blog",
        "item": "https://jitinkapila.com/blog"
      },
      {
        "@type": "ListItem",
        "position": 3,
        "name": "AI Strategy",
        "item": "https://jitinkapila.com/blog/ai-strategy"
      },
      {
        "@type": "ListItem",
        "position": 4,
        "name": "AI Umbrella",
        "item": "https://jitinkapila.com/blog/strategy/ai-umbrella"
      }
    ]
  }
]
</script>

<p><a href="img/AI-umbrealla-v2.png" class="profile-image"></a></p>
<p>AI is everywhere now, but most folks still see it as just ChatGPT or image generators. The truth is, AI covers much more ground. It runs in software, business processes, search engines, robotics, and more. In this post, here’s a plain view of the <strong>“AI Umbrella”</strong> — a way to see what’s really out there and how to use it to solve real problems, not just chase hype.</p>
<section id="what-is-the-ai-umbrella" class="level2">
<h2 class="anchored" data-anchor-id="what-is-the-ai-umbrella">What Is the AI Umbrella?</h2>
<p>AI is a big term. It wraps up everything from crunching numbers in Excel to self-driving cars and chatbots/AI Agents. This umbrella covers many areas, and each one is changing fast. Some ideas fade, some thrive, and others turn into new things. AI is not a magic trick — and it’s not new. For over sixty years, it’s helped businesses do work better, from simple data entry to complex automation.</p>
<section id="hype-vs.-reality" class="level4">
<h4 class="anchored" data-anchor-id="hype-vs.-reality">Hype vs.&nbsp;Reality</h4>
<p>Each cycle brings promises: “AI will change everything.” But those promises rarely deliver overnight. Just like Excel automated bookkeeping but didn’t end jobs, GenAI (like ChatGPT, Claude, Gemini) does not replace human work — it just shifts what we do. Jobs evolve. AI lets us do work faster, spot patterns, and solve tricky problems. But to really get value, businesses need a clear plan, not just headlines.</p>
</section>
</section>
<section id="the-ai-umbrella-framework" class="level2">
<h2 class="anchored" data-anchor-id="the-ai-umbrella-framework">The AI Umbrella Framework</h2>
<p>So how do you make AI work for real business needs? Use the “AI Umbrella” to map your problem and pick the right tool. Here’s a simple view:</p>
<p><img src="https://jitinkapila.com/writing/ai-umbrella/img/main-umbrella.webp" class="img-fluid" alt="AI Umbrella"> <!-- {.profile-image .w-50} --></p>
<p>It covers everything from classic number crunching to the newest breakthroughs. Think of AI like a big toolbox, not just one gadget. Well we are going to take a deeper dive in each pillar, but a small disclaimer this is not-an-exhaustive list.</p>
<section id="machine-learning-ml" class="level3">
<h3 class="anchored" data-anchor-id="machine-learning-ml">Machine Learning (ML)</h3>
<p>Machine learning drives most <a href="../../case-studies/index.html">real business</a> value in AI today. It focuses on finding patterns in data to predict outcomes, classify information, or spot trends that humans might miss. Unlike the flashy GenAI tools, ML quietly runs recommendation engines, fraud detection systems, and supply chain optimization across industries. here is a non-exhaustive view of Machine Learning (ML)</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="img/ml-umbrella.webp" class="lightbox" data-gallery="quarto-lightbox-gallery-1" title="ML Umbrella"><img src="https://jitinkapila.com/writing/ai-umbrella/img/ml-umbrella.webp" class="profile-image img-fluid figure-img" alt="ML Umbrella"></a></p>
<figcaption>ML Umbrella</figcaption>
</figure>
</div>
<p>Notice , at the bottom is you GenAI, and above is the dearth amount of research that has went in that is a a way bigger chunk of AI. Here are some nuggets for you thoughts:</p>
<ul>
<li><p><strong>Supervised learning</strong> delivers 25-30% ROI on average, with applications like fraud detection showing 150% ROI within 9 months. Marketing automation using ML achieves up to 544% ROI annually. This is all about finding answers with labeled data.</p></li>
<li><p><strong>Unsupervised methods</strong> excel at finding hidden patterns - clustering helps retail companies segment customers for personalized campaigns, while anomaly detection catches equipment failures before they happen. And this is all about spotting patterns in unlabeled data.</p></li>
</ul>
<p><strong>ROI Examples:</strong></p>
<ul>
<li><p><strong>Predictive maintenance:</strong> 400% ROI in 6 months by preventing costly equipment breakdowns</p></li>
<li><p><strong>Customer segmentation:</strong> E-commerce businesses see 27% increase in purchase likelihood through ML-powered personalization</p></li>
</ul>
</section>
<section id="graph-algorithms" class="level3">
<h3 class="anchored" data-anchor-id="graph-algorithms">Graph Algorithms</h3>
<p>Graph algorithms map connections between data points, revealing relationships that traditional databases can’t capture. Think of how Google’s PageRank revolutionized search by understanding link relationships, or how LinkedIn suggests connections by analyzing your professional network.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="img/graph-umbrella.webp" class="lightbox" data-gallery="quarto-lightbox-gallery-2" title="Graph Umbrella"><img src="https://jitinkapila.com/writing/ai-umbrella/img/graph-umbrella.webp" class="profile-image img-fluid figure-img" alt="Graph Umbrella"></a></p>
<figcaption>Graph Umbrella</figcaption>
</figure>
</div>
<p>Graph algorithms are used in variety of ways:</p>
<ul>
<li><p><strong>Neo4j customers</strong> report 417% ROI over three years, with 20% improvement in business results and 60% faster time-to-value. Graph databases excel at real-time pattern recognition and complex relationship analysis.</p></li>
<li><p><strong>Financial services</strong> use graph algorithms for fraud detection by mapping transaction networks - one e-commerce company reduced fraud risk through real-time pattern detection of suspicious shipping routes.</p></li>
<li><p><strong>Maps:</strong> Google Maps uses Dijkstra’s and A* algorithms to find optimal routes through road networks, processing millions of nodes with real-time traffic data and ML predictions for dynamic rerouting.</p></li>
<li><p><strong>Telecom:</strong> BGP uses Bellman-Ford for inter-domain routing between networks, while OSPF runs Dijkstra’s algorithm within networks for shortest-path calculations and load balancing.</p></li>
</ul>
<p><strong>ROI Examples:</strong></p>
<ul>
<li><p><strong>Fraud detection:</strong> Financial institutions prevent losses through network analysis, with ROI justified by avoided fraud costs</p></li>
<li><p><strong>Recommendation systems:</strong> Social platforms and e-commerce sites drive higher engagement through connection-based suggestions</p></li>
</ul>
</section>
<section id="optimization-planning" class="level3">
<h3 class="anchored" data-anchor-id="optimization-planning">Optimization &amp; Planning</h3>
<p>Optimization tackles the “best way” problems - finding the most efficient routes, optimal inventory levels, or ideal resource allocation. This field delivers some of the highest ROI because it directly cuts waste and improves efficiency in existing operations.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="img/optim-umbrella.webp" class="lightbox" data-gallery="quarto-lightbox-gallery-3" title="Optimization Umbrella"><img src="https://jitinkapila.com/writing/ai-umbrella/img/optim-umbrella.webp" class="profile-image img-fluid figure-img" alt="Optimization Umbrella"></a></p>
<figcaption>Optimization Umbrella</figcaption>
</figure>
</div>
<p><strong>ROI Examples:</strong></p>
<ul>
<li><p><strong>Route optimization</strong> shows 15-30% reduction in travel costs, with companies like cold drink bottlers cutting fuel costs by 12% while increasing shop coverage by 18%. CPG brands report 19% savings on field visit costs through AI-based routing.</p></li>
<li><p><strong>Supply chain optimization</strong> delivers 80% ROI within 12 months by reducing inventory costs and improving delivery performance. Manufacturing scheduling optimization can achieve 172% ROI through better resource utilization.</p></li>
<li><p><strong>Inventory management:</strong> 30% reduction in holding costs while improving on-time deliveries by 25%</p></li>
<li><p><strong>Production scheduling:</strong> Manufacturers see 20-30% cost reduction through optimized workflows</p></li>
</ul>
</section>
<section id="human-machine-interface" class="level3">
<h3 class="anchored" data-anchor-id="human-machine-interface">Human-Machine Interface</h3>
<p>This pillar focuses on how humans and machines work together - from industrial robotics to neural interfaces to your mobiles to whats is possible future. It’s about extending human capabilities rather than replacing them, creating hybrid systems that combine human judgment with machine precision.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="img/hmi-umbrella.webp" class="lightbox" data-gallery="quarto-lightbox-gallery-4" title="Human-Machine Interface"><img src="https://jitinkapila.com/writing/ai-umbrella/img/hmi-umbrella.webp" class="profile-image img-fluid figure-img" alt="Human-Machine Interface"></a></p>
<figcaption>Human-Machine Interface</figcaption>
</figure>
</div>
<p><strong>ROI Examples:</strong></p>
<ul>
<li><p><strong>Industrial robotics</strong> typically delivers ROI within 3-5 years, with autonomous mobile robots showing quick implementation and measurable labor savings. BMW reported 25% reduction in production time and 30% cut in operational costs within two years.</p></li>
<li><p><strong>Warehouse automation</strong> ranges from $5-15 million for semi-automated systems, with companies seeing 24/7 operation capabilities and reduced human error rates.</p></li>
<li><p><strong>Factory automation:</strong> 20% average productivity increase with significant operational cost reductions</p></li>
<li><p><strong>Customer service automation:</strong> 30% reduction in support costs with faster query resolution</p></li>
</ul>
</section>
<section id="faded-but-foundational-ai" class="level3">
<h3 class="anchored" data-anchor-id="faded-but-foundational-ai">Faded but Foundational AI</h3>
<p>Early AI systems like expert systems and rule-based engines laid the foundation for today’s AI. While many faded due to rigidity and high maintenance costs, their core principles live on in modern decision support systems and automated workflows.</p>
<p><strong>Some thoughts:</strong></p>
<ul>
<li><p>Expert systems peaked in the 1980s but declined due to scalability issues and inability to handle ambiguity. However, their legacy continues in clinical decision support systems and business automation tools.</p></li>
<li><p>Clinical decision support: Healthcare systems still use rule-based approaches combined with ML for diagnosis assistance</p></li>
<li><p>Business automation: Modern workflow systems trace back to expert system principles, delivering consistent decision-making in structured environments</p></li>
</ul>
<p>Older methods like rule-based systems, expert systems, and fuzzy logic now live inside more advanced techniques like tree-based ML and probabilistic models. <strong>Legacy system integration</strong> of modern AI shows that 95% of GenAI pilots fail to show real ROI, often because they ignore lessons learned from early AI implementations.</p>
</section>
</section>
<section id="the-market-numbers" class="level2">
<h2 class="anchored" data-anchor-id="the-market-numbers">The Market Numbers</h2>
<p>Here’s what the data says about the real business side of AI:</p>
<ul>
<li>The whole AI market is about $235 billion.</li>
<li>GenAI (think ChatGPT) is just 8.6% ($20.2 billion). The rest — traditional ML — is 91.4% ($194.6 billion).</li>
<li>Traditional AI gets better results: up to 30% ROI, while GenAI averages 12%.</li>
<li>Predictive maintenance delivers up to 400% ROI in six months. Fraud detection? 150% in under a year. LLMs do much less: about 12% over the same time.</li>
</ul>
<section id="what-makes-ai-projects-work" class="level3">
<h3 class="anchored" data-anchor-id="what-makes-ai-projects-work">What Makes AI Projects Work?</h3>
<ul>
<li>Using external tools yields a 67% success rate.</li>
<li>Internal builds succeed only 33% of the time.</li>
<li>Many spend most budget on sales and marketing for AI, but real savings come from automating back-office work.</li>
<li>Traditional machine learning projects finish in 3-6 months. GenAI takes 3-12 months and needs more tuning.</li>
</ul>
</section>
</section>
<section id="putting-it-all-together" class="level2">
<h2 class="anchored" data-anchor-id="putting-it-all-together">Putting It All Together</h2>
<p>Before starting any AI project, ask: <a href="../../writing/decision-first-ai/index.html">What problem am I trying to solve?</a> Use the AI Umbrella map to find what fits. Is it prediction, classification, pattern finding, or optimization? Do you need machine learning, deep learning, or just smarter software? This clarity saves money, time, and helps get actual results.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="img/ai-flow.webp" class="lightbox" data-gallery="quarto-lightbox-gallery-5" title="AI Decision Flow"><img src="https://jitinkapila.com/writing/ai-umbrella/img/ai-flow.webp" class="profile-image img-fluid figure-img" alt="AI Decision Flow"></a></p>
<figcaption>AI Decision Flow</figcaption>
</figure>
</div>
<p>Next time someone pitches an AI solution, ask “Which part of the AI umbrella are we using?” And “How does it solve our actual problem?”, or <a href="../../ai-profit-os.html">You can learn to ask specific question for your use case,</a>, but ask for clarity and not hype. Use the framework above as your map. <strong>AI isn’t a buzzword — it’s a toolbox.</strong> The right tool solves the right problem. And that’s how you make AI work for your business, today and tomorrow.</p>
<p>This is just the start. I’m building an Enterprise AI series that shows how to combine these AI pillars for real business impact.</p>
<p>Next may be: How one manufacturing company used ML, optimization, and robotics together to cut costs by 40%. <a href="https://aicrosscurrent.substack.com/">Subscribe to see the full breakdown</a> or <a href="../../work-with-me.html">book a strategic call</a></p>
<!-- ::: {.column-screen .bg-navy}

<br>




::: {.column-page}

### Subscribe to Weekly AI Decision Brief 

::: {.grid .column-page-inset .padded}

::: {.g-col-12 .g-col-md-6 style="margin: auto;"}
One sharp insight on making AI work for your business — every week. Frameworks from actual deployments. Case studies with real numbers. The questions your AI vendor hopes you never ask.

No hype. No vendor pitch. Written for operations leaders, not technology teams.
:::

::: {.g-col-0 .g-col-md-1}
:::

::: {.g-col-12 .g-col-md-5}


```{=html}
<form
  method="post"                                                                                                                                         action="https://systeme.io/embedded/39549717/subscription"
> 

    <label for="bd-first-name">First Name</label>
    <input
    class="form-control"
    placeholder="Your first name"
    type="text" name="first_name" id="bd-first-name" />

    <label for="bd-email">Email Address</label>                                                                                                           <input
    class="form-control"
    placeholder="your@email.here" 
    type="email" name="email" id="bd-email" />

    <input 
    class="btn btn-primary"
    style="margin-top: 1em;"
    type="submit" value="Subscribe" />
</form>
```
:::

:::

<br>

<br>

:::



<!-- <form
  method="post" 
  action="https://systeme.io/embedded/36566252/subscription"
>
  <label for="bd-email">Email Address</label>
  <input 
  class="form-control"
  placeholder="your@email.here" 
  type="email" name="email" id="bd-email" />
  
  <input 
  class="btn btn-primary"
  style="margin-top: 1em;"
  type="submit" value="Subscribe" />
</form>
 -->
<div class="-->">


</div>
</section>

<a onclick="window.scrollTo(0, 0); return false;" id="quarto-back-to-top"><i class="bi bi-arrow-up"></i> Back to top</a> ]]></description>
  <category>strategy</category>
  <category>business</category>
  <guid>https://jitinkapila.com/writing/ai-umbrella/</guid>
  <pubDate>Wed, 24 Sep 2025 18:30:00 GMT</pubDate>
  <media:content url="https://jitinkapila.com/writing/ai-umbrella/img/AI-umbrealla-v2.png" medium="image" type="image/png" height="97" width="144"/>
</item>
<item>
  <title>TrieKNN: Unleashing KNN’s Power on Mixed Data Types</title>
  <dc:creator>Jitin Kapila</dc:creator>
  <link>https://jitinkapila.com/writing/engineering/10_treeknn/</link>
  <description><![CDATA[ 





<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="pexels-gelgas-401213.jpg" class="lightbox" data-gallery="quarto-lightbox-gallery-1" title="Photo by Gelgas Airlangga"><img src="https://jitinkapila.com/writing/engineering/10_treeknn/pexels-gelgas-401213.jpg" class="img-fluid figure-img" alt="Photo by Gelgas Airlangga"></a></p>
<figcaption><a href="https://www.pexels.com/photo/shallow-focus-of-sprout-401213/">Photo by Gelgas Airlangga</a></figcaption>
</figure>
</div>
<!-- [Photo by Anna Tarazevich](https://www.pexels.com/photo/strawberry-plant-on-a-black-container-7299985/)

[Photo by Eva Bronzini](https://www.pexels.com/photo/succulent-plants-in-pot-shaped-soil-7127801/) -->
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>In This Post
</div>
</div>
<div class="callout-body-container callout-body">
<ul>
<li>We’ll dissect the limitations of traditional KNN when faced with mixed data types.</li>
<li>Introduce TrieKNN, a Trie-based approach that elegantly handles mixed data.</li>
<li>Walk through the implementation and training of a TrieKNN model.</li>
<li>Evaluate its performance and discuss its potential impact.</li>
</ul>
</div>
</div>
<section id="the-allure-and-limitation-of-knn" class="level2">
<h2 class="anchored" data-anchor-id="the-allure-and-limitation-of-knn">The Allure and Limitation of KNN</h2>
<p>In the realm of machine learning, the K-Nearest Neighbors (KNN) algorithm stands out for its intuitive nature and ease of implementation. Its principle is simple: classify a data point based on the majority class among its ‘k’ nearest neighbors in the feature space. This non-parametric approach makes no assumptions about the underlying data distribution, rendering it versatile for various applications. <a href="https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm">KNN</a> is very popular, but it comes with some limitations.</p>
<p>However, KNN’s Achilles’ heel lies in its reliance on distance metrics, which are inherently designed for numerical data. Real-world datasets often contain a mix of numerical and categorical features, posing a significant challenge for KNN. How do you measure the distance between ‘red’ and ‘blue,’ or ‘large’ and ‘small’?</p>
<section id="prior-art" class="level3">
<h3 class="anchored" data-anchor-id="prior-art">Prior Art</h3>
<p>Several strategies have been proposed to adapt KNN for mixed data:</p>
<ul>
<li><strong>One-Hot Encoding:</strong> Converts categorical features into numerical vectors, but can lead to high dimensionality.</li>
<li><strong>Distance Functions for Mixed Data:</strong> Develops and apply custom distance metrics that can handle both numerical and categorical features such as <a href="https://conservancy.umn.edu/server/api/core/bitstreams/845f587d-079a-469b-97e9-411533fa666d/content">HEOM and many others</a>.</li>
<li><strong>Using mean/mode values</strong>: Replace the missing values with mean/mode.</li>
</ul>
<p>These methods often involve compromises, either distorting the data’s inherent structure or adding computational overhead.</p>
</section>
</section>
<section id="enter-trieknn-a-novel-approach" class="level2">
<h2 class="anchored" data-anchor-id="enter-trieknn-a-novel-approach">Enter TrieKNN: A Novel Approach</h2>
<p>What if we could cleverly sidestep the distance calculation problem for categorical features, while still leveraging KNN’s power? TrieKNN offers just that—a way to perform KNN on any mixed data!</p>
<p>TrieKNN combines the strengths of Trie data structures and KNN to handle mixed data types gracefully. Here’s the core idea:</p>
<ol type="1">
<li><strong>Trie-Based Categorical Encoding:</strong> A Trie is used to store the categorical features of the data. Each node in the Trie represents a category.</li>
<li><strong>Leaf-Node KNN Models:</strong> At the leaf nodes of the Trie, where specific combinations of categorical features are found, we fit individual KNN models using only the numerical features.</li>
<li><strong>Weighted Prediction:</strong> To classify a new data point, we traverse the Trie based on its categorical features. At each level, we calculate a weighted distance based on available data, ending in a probability score in each leaf node.</li>
</ol>
<section id="why-this-works" class="level3">
<h3 class="anchored" data-anchor-id="why-this-works">Why This Works</h3>
<ul>
<li><strong>No Direct Distance Calculation for Categorical Features:</strong> The Trie structure implicitly captures the relationships between categorical values.</li>
<li><strong>Localized KNN Models:</strong> By fitting KNN models at the leaf nodes, we ensure that distance calculations are performed only on relevant numerical features.</li>
<li><strong>Scalability:</strong> The Trie structure efficiently handles a large number of categorical features and values.</li>
</ul>
</section>
</section>
<section id="building-a-trieknn-model" class="level2">
<h2 class="anchored" data-anchor-id="building-a-trieknn-model">Building a TrieKNN Model</h2>
<p>Let’s dive into the implementation. We’ll start with the <code>TrieNode</code> and <code>Trie</code> classes, then move on to the KNN model and the training/prediction process.</p>
<section id="trie-implementation" class="level3">
<h3 class="anchored" data-anchor-id="trie-implementation">Trie Implementation</h3>
<div id="1ded8041" class="cell" data-execution_count="2">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> collections <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Counter</span>
<span id="cb1-3"></span>
<span id="cb1-4"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> TrieNode:</span>
<span id="cb1-5">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">__init__</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>):</span>
<span id="cb1-6">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.children <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {}  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Dictionary to store child nodes</span></span>
<span id="cb1-7">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.is_end_of_word <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># True if the node is the end of a word</span></span>
<span id="cb1-8">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.count <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Count of how many times a word has been inserted</span></span>
<span id="cb1-9">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.class_counts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {}  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Class counts</span></span>
<span id="cb1-10">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.class_weights <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {}</span>
<span id="cb1-11">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Model at leaf nodes</span></span>
<span id="cb1-12">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.indexes <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Store data indexes belonging to this leaf</span></span>
<span id="cb1-13">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Store data indexes belonging to this leaf</span></span>
<span id="cb1-14">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.node_weight <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span></span>
<span id="cb1-15"></span>
<span id="cb1-16"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> Trie:</span>
<span id="cb1-17">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">__init__</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>):</span>
<span id="cb1-18">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.root <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> TrieNode()  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Root node of the Trie</span></span>
<span id="cb1-19">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.data_index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Initialize data index</span></span>
<span id="cb1-20"></span>
<span id="cb1-21">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> insert(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, word_val, model):</span>
<span id="cb1-22">        current_node <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.root</span>
<span id="cb1-23">        word, val <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> word_val</span>
<span id="cb1-24">        current_node.count <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb1-25"></span>
<span id="cb1-26">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Adding class counts</span></span>
<span id="cb1-27">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> val <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">not</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> current_node.class_counts:</span>
<span id="cb1-28">            current_node.class_counts[val] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb1-29">        current_node.class_counts[val] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb1-30"></span>
<span id="cb1-31">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> char <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> word:</span>
<span id="cb1-32">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># If the character is not in children, add a new TrieNode</span></span>
<span id="cb1-33">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> char <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">not</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> current_node.children:</span>
<span id="cb1-34">                current_node.children[char] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> TrieNode()</span>
<span id="cb1-35">            current_node <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> current_node.children[char]</span>
<span id="cb1-36"></span>
<span id="cb1-37">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Adding count of instances</span></span>
<span id="cb1-38">            current_node.count <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb1-39"></span>
<span id="cb1-40">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># adding class counts</span></span>
<span id="cb1-41">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> val <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">not</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> current_node.class_counts:</span>
<span id="cb1-42">                current_node.class_counts[val] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb1-43">            current_node.class_counts[val] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb1-44"></span>
<span id="cb1-45">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Mark the end of the word and increment count</span></span>
<span id="cb1-46">        current_node.is_end_of_word <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span></span>
<span id="cb1-47">        current_node.indexes.append(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.data_index)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Store the data index</span></span>
<span id="cb1-48">        current_node.labels.append(val)</span>
<span id="cb1-49">        current_node.model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model</span>
<span id="cb1-50">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.data_index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Increment data index</span></span>
<span id="cb1-51"></span>
<span id="cb1-52">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> search(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, word):</span>
<span id="cb1-53">        current_node <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.root</span>
<span id="cb1-54">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> char <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> word:</span>
<span id="cb1-55">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># If the character doesn't exist in the children, the word doesn't exist</span></span>
<span id="cb1-56">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> char <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">not</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> current_node.children:</span>
<span id="cb1-57">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span></span>
<span id="cb1-58">            current_node <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> current_node.children[char]</span>
<span id="cb1-59"></span>
<span id="cb1-60">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Return True if it's the end of a word and the word exists</span></span>
<span id="cb1-61">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> current_node.is_end_of_word</span>
<span id="cb1-62"></span>
<span id="cb1-63">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> count_word(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, word):</span>
<span id="cb1-64">        current_node <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.root</span>
<span id="cb1-65">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> char <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> word:</span>
<span id="cb1-66">            <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># If the character doesn't exist, the word doesn't exist</span></span>
<span id="cb1-67">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> char <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">not</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> current_node.children:</span>
<span id="cb1-68">                <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, current_node.class_counts  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Correctly return class_counts</span></span>
<span id="cb1-69">            current_node <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> current_node.children[char]</span>
<span id="cb1-70"></span>
<span id="cb1-71">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Return the count of the word</span></span>
<span id="cb1-72">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> current_node.count, current_node.class_counts</span>
<span id="cb1-73"></span>
<span id="cb1-74">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> display(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>):</span>
<span id="cb1-75">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Recursively display the tree</span></span>
<span id="cb1-76">        <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> _display(node, word):</span>
<span id="cb1-77">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> node.is_end_of_word:</span>
<span id="cb1-78">                <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Data: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>word<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">, Count: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>node<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>count<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">, Indexes: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(node.indexes)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> Classes :</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>node<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>class_counts<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> weights:</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(node.class_weights)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Display indexes too</span></span>
<span id="cb1-79">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> char, child <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> node.children.items():</span>
<span id="cb1-80">                _display(child, word <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> char)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># corrected the display</span></span>
<span id="cb1-81"></span>
<span id="cb1-82">        _display(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.root, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>)</span>
<span id="cb1-83"></span>
<span id="cb1-84">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">apply</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, func):</span>
<span id="cb1-85">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb1-86"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        Applies a function to all models in the leaf nodes.</span></span>
<span id="cb1-87"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        """</span></span>
<span id="cb1-88">        <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> _apply(node):</span>
<span id="cb1-89">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> node.is_end_of_word <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">and</span> node.model <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">is</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">not</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>:</span>
<span id="cb1-90">                func(node)</span>
<span id="cb1-91">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> child <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> node.children.values():</span>
<span id="cb1-92">                _apply(child)</span>
<span id="cb1-93"></span>
<span id="cb1-94">        _apply(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.root)</span>
<span id="cb1-95"></span>
<span id="cb1-96">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> apply_weight_to_indexes(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, weight):</span>
<span id="cb1-97">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb1-98"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        Applies a weight to the indexes based on the percentage of data available.</span></span>
<span id="cb1-99"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">        """</span></span>
<span id="cb1-100">        <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> _apply_weight_to_indexes(node):</span>
<span id="cb1-101">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> node.is_end_of_word:</span>
<span id="cb1-102">                total_count <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.root.children[child].count <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> child <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.root.children)</span>
<span id="cb1-103">                percentage <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> node.count <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> total_count <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> total_count <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb1-104">                weighted_indexes <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [(index, weight <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> percentage) <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> index <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> node.indexes]</span>
<span id="cb1-105">                node.class_weights <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> weighted_indexes  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Corrected this line</span></span>
<span id="cb1-106">            <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> child <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> node.children.values():</span>
<span id="cb1-107">                _apply_weight_to_indexes(child)</span>
<span id="cb1-108"></span>
<span id="cb1-109">        _apply_weight_to_indexes(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.root)</span></code></pre></div></div>
</details>
</div>
</section>
<section id="knn-model" class="level3">
<h3 class="anchored" data-anchor-id="knn-model">KNN Model</h3>
<div id="a15022ff" class="cell" data-execution_count="3">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> KNNModel:</span>
<span id="cb2-2">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">__init__</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, k<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>):</span>
<span id="cb2-3">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span></span>
<span id="cb2-4">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb2-5">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.k <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> k</span>
<span id="cb2-6"></span>
<span id="cb2-7">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> fit(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, data, indexes, labels):</span>
<span id="cb2-8">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># print("Fitting model with indexes:", len(indexes), "labels:", len(labels))</span></span>
<span id="cb2-9">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> data[indexes].astype(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">float</span>)</span>
<span id="cb2-10">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.labels <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.array(labels).astype(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">float</span>)</span>
<span id="cb2-11"></span>
<span id="cb2-12">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> predict(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, data):</span>
<span id="cb2-13">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># print("Predicting with data:", data)</span></span>
<span id="cb2-14">        dist_ind <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.sqrt(np.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>((<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> data) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">**</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># euclidean distance</span></span>
<span id="cb2-15">        main_arr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.column_stack((<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.labels, dist_ind))  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># labels with distance</span></span>
<span id="cb2-16">        main <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> main_arr[main_arr[:, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>].argsort()]  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># sorting based on distance</span></span>
<span id="cb2-17">        count <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Counter(main[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>:<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.k, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>])  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># counting labels</span></span>
<span id="cb2-18">        sums <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.array(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>(count.values()))  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># getting counts</span></span>
<span id="cb2-19">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> sums <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> np.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(sums)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># prediction as probability</span></span></code></pre></div></div>
</details>
</div>
</section>
<section id="training-and-evaluation" class="level3">
<h3 class="anchored" data-anchor-id="training-and-evaluation">Training and Evaluation</h3>
<p>Here’s how we train and evaluate the TrieKNN model:</p>
<div id="ce2ea5e7" class="cell" data-execution_count="4">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Sample data</span></span>
<span id="cb3-2">n <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000</span></span>
<span id="cb3-3">data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.array((np.random.choice([<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Anything '</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'By '</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Chance '</span>], p<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.6</span>,<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>,<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.3</span>],size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>n),</span>
<span id="cb3-4">                 np.random.choice([<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'can'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'go'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'here'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'lets'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'see'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"it"</span>], p<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.4</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>], size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>n),</span>
<span id="cb3-5">                 np.random.normal(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>n),</span>
<span id="cb3-6">                 np.random.normal(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>n))).T</span>
<span id="cb3-7">y_label <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.random.choice([<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], p<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.7</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.3</span>], size<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>n)</span>
<span id="cb3-8"></span>
<span id="cb3-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Trie training</span></span>
<span id="cb3-10">trie <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> Trie()</span>
<span id="cb3-11"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> X, y <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">zip</span>(data, y_label):</span>
<span id="cb3-12">    trie.insert((X[:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>], y),<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>)</span>
<span id="cb3-13"></span>
<span id="cb3-14"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Apply weights to indexes</span></span>
<span id="cb3-15">trie.apply_weight_to_indexes(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>)</span>
<span id="cb3-16"></span>
<span id="cb3-17"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Fit models of leaf nodes</span></span>
<span id="cb3-18"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> add_model(node, data):</span>
<span id="cb3-19">    node.model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> KNNModel()</span>
<span id="cb3-20">    node.model.fit(data, node.indexes, node.labels)</span>
<span id="cb3-21"></span>
<span id="cb3-22"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> traverse_and_add_model(node, data):</span>
<span id="cb3-23">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> node.is_end_of_word:</span>
<span id="cb3-24">        add_model(node, data)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Add model to leaf node</span></span>
<span id="cb3-25">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> child <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> node.children.values():</span>
<span id="cb3-26">        traverse_and_add_model(child, data)</span>
<span id="cb3-27"></span>
<span id="cb3-28">traverse_and_add_model(trie.root, data[:, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>:])</span></code></pre></div></div>
</details>
</div>
</section>
<section id="explanation" class="level3">
<h3 class="anchored" data-anchor-id="explanation">Explanation</h3>
<ul>
<li>We create sample data with mixed categorical and numerical features.</li>
<li>We insert each data point into the Trie, using the categorical features as the path.</li>
<li>After the Trie is built, we traverse it and fit a KNN model to the data points stored at each leaf node.</li>
<li>Finally, we can predict the class of new data points by traversing the Trie and using the KNN model at the corresponding leaf node.</li>
</ul>
</section>
</section>
<section id="results-and-discussion" class="level2 page-columns page-full">
<h2 class="anchored" data-anchor-id="results-and-discussion">Results and Discussion</h2>
<p>Let us display the trie.</p>

<div class="no-row-height column-margin column-container"><div class="">
<div id="a01dcaab" class="cell" data-execution_count="5">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb4-1">trie.display()</span></code></pre></div></div>
</details>
<div class="cell-output cell-output-stdout">
<pre><code>Data: Anything lets, Count: 1260, Indexes: 1260 Classes :{np.int64(0): 882, np.int64(1): 378} weights:1260
Data: Anything see, Count: 2364, Indexes: 2364 Classes :{np.int64(0): 1710, np.int64(1): 654} weights:2364
Data: Anything go, Count: 616, Indexes: 616 Classes :{np.int64(0): 425, np.int64(1): 191} weights:616
Data: Anything can, Count: 606, Indexes: 606 Classes :{np.int64(0): 401, np.int64(1): 205} weights:606
Data: Anything it, Count: 584, Indexes: 584 Classes :{np.int64(0): 416, np.int64(1): 168} weights:584
Data: Anything here, Count: 619, Indexes: 619 Classes :{np.int64(1): 170, np.int64(0): 449} weights:619
Data: Chance see, Count: 1170, Indexes: 1170 Classes :{np.int64(1): 376, np.int64(0): 794} weights:1170
Data: Chance here, Count: 334, Indexes: 334 Classes :{np.int64(0): 232, np.int64(1): 102} weights:334
Data: Chance lets, Count: 562, Indexes: 562 Classes :{np.int64(0): 387, np.int64(1): 175} weights:562
Data: Chance it, Count: 270, Indexes: 270 Classes :{np.int64(0): 193, np.int64(1): 77} weights:270
Data: Chance can, Count: 291, Indexes: 291 Classes :{np.int64(0): 195, np.int64(1): 96} weights:291
Data: Chance go, Count: 310, Indexes: 310 Classes :{np.int64(0): 229, np.int64(1): 81} weights:310
Data: By can, Count: 99, Indexes: 99 Classes :{np.int64(0): 64, np.int64(1): 35} weights:99
Data: By lets, Count: 210, Indexes: 210 Classes :{np.int64(0): 138, np.int64(1): 72} weights:210
Data: By see, Count: 385, Indexes: 385 Classes :{np.int64(1): 109, np.int64(0): 276} weights:385
Data: By go, Count: 115, Indexes: 115 Classes :{np.int64(1): 31, np.int64(0): 84} weights:115
Data: By here, Count: 104, Indexes: 104 Classes :{np.int64(0): 73, np.int64(1): 31} weights:104
Data: By it, Count: 101, Indexes: 101 Classes :{np.int64(0): 73, np.int64(1): 28} weights:101</code></pre>
</div>
</div>
</div></div><p>The model predicted the following values:</p>

<div class="no-row-height column-margin column-container"><div class="">
<div id="e3579960" class="cell" data-execution_count="6">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Prediction example</span></span>
<span id="cb6-2"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> predict_with_model(node):</span>
<span id="cb6-3">    predictions <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> node.model.predict(np.array([<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>]))</span>
<span id="cb6-4">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Predictions:"</span>, predictions)</span>
<span id="cb6-5"></span>
<span id="cb6-6">trie.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">apply</span>(predict_with_model)</span></code></pre></div></div>
</details>
<div class="cell-output cell-output-stdout">
<pre><code>Predictions: [1.]
Predictions: [0.8 0.2]
Predictions: [1.]
Predictions: [0.6 0.4]
Predictions: [0.6 0.4]
Predictions: [0.6 0.4]
Predictions: [0.8 0.2]
Predictions: [0.6 0.4]
Predictions: [0.8 0.2]
Predictions: [0.2 0.8]
Predictions: [0.2 0.8]
Predictions: [0.2 0.8]
Predictions: [0.2 0.8]
Predictions: [0.8 0.2]
Predictions: [0.2 0.8]
Predictions: [1.]
Predictions: [0.8 0.2]
Predictions: [0.8 0.2]</code></pre>
</div>
</div>
</div></div><p>The predictions will vary on each run. From this we can see that we can use KNN on mixed data types.</p>
</section>
<section id="conclusion-a-promising-path-forward" class="level2">
<h2 class="anchored" data-anchor-id="conclusion-a-promising-path-forward">Conclusion: A Promising Path Forward</h2>
<p>TrieKNN presents a compelling solution for extending the applicability of KNN to mixed data types. By leveraging the Trie data structure, it avoids direct distance calculations on categorical features, enabling the use of localized KNN models for numerical data.</p>
<p>Further research could explore:</p>
<ul>
<li>Optimizing the weighting scheme for combining predictions from different Trie levels.</li>
<li>Comparing TrieKNN’s performance against other mixed-data KNN approaches on benchmark datasets.</li>
<li>Extending TrieKNN to handle missing data and noisy categorical features.</li>
</ul>
<p>TrieKNN opens up new possibilities for applying KNN in domains where mixed data types are prevalent, such as healthcare, e-commerce, and social science.</p>
<p>Resources and further reads:<br>
1. <a href="https://cran.r-project.org/web/packages/nomclust/nomclust.pdf">Nomclust R package</a><br>
2. <a href="https://ieeexplore.ieee.org/abstract/document/8337394">An Improved kNN Based on Class Contribution and Feature Weighting</a><br>
3. <a href="https://ieeexplore.ieee.org/abstract/document/8780580">An Improved Weighted KNN Algorithm for Imbalanced Data Classification</a><br>
4. <a href="https://ieeexplore.ieee.org/abstract/document/6718270">A weighting approach for KNN classifier</a><br>
5. <a href="https://ieeexplore.ieee.org/abstract/document/9739702">Unsupervised Outlier Detection for Mixed-Valued Dataset Based on the Adaptive k-Nearest Neighbor Global Network</a><br>
6. <a href="https://journalskuwait.org/kjs/index.php/KJS/article/download/18331/1253">A hybrid approach based on k-nearest neighbors and decision tree for software fault prediction</a><br>
7. <a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC7173366/">Analysis of Decision Tree and K-Nearest Neighbor Algorithm in the Classification of Breast Cancer</a></p>


</section>

<a onclick="window.scrollTo(0, 0); return false;" id="quarto-back-to-top"><i class="bi bi-arrow-up"></i> Back to top</a> ]]></description>
  <category>technical</category>
  <category>ml</category>
  <guid>https://jitinkapila.com/writing/engineering/10_treeknn/</guid>
  <pubDate>Tue, 25 Feb 2025 18:30:00 GMT</pubDate>
  <media:content url="https://jitinkapila.com/writing/engineering/10_treeknn/pexels-anntarazevich-7299985.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>CrossTab Sparsity for Classification</title>
  <dc:creator>Jitin Kapila</dc:creator>
  <link>https://jitinkapila.com/writing/engineering/04_crosstab_sparsity_classification/</link>
  <description><![CDATA[ 





<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="crossroad-unsplash-thumb.jpg" class="lightbox" data-gallery="quarto-lightbox-gallery-1" title="Cross Roads where everyone meets!"><img src="https://jitinkapila.com/writing/engineering/04_crosstab_sparsity_classification/crossroad-unsplash-thumb.jpg" class="w-100 img-fluid figure-img" alt="Cross Roads where everyone meets!"></a></p>
<figcaption>Cross Roads where everyone meets!</figcaption>
</figure>
</div>
<section id="introduction-a-journey-into-data" class="level3">
<h3 class="anchored" data-anchor-id="introduction-a-journey-into-data">Introduction: A Journey into Data</h3>
<p>Picture this: you’re standing on the icy shores of Antarctica, the wind whipping around you as you watch a colony of Palmer Penguins waddling about, oblivious to the data detective work you’re about to embark on. As a data science architect, you’re not just an observer; you’re a sleuth armed with algorithms and insights, ready to unravel the mysteries hidden within data. Today, we’ll transform raw numbers into powerful narratives using CrossTab Sparsity as our guiding compass. This blog post will demonstrate how this metric can revolutionize classification tasks, shedding light on many fascinating datasets—the charming Palmer Penguins and the serious Obesity, Credit cards data and many more.</p>
</section>
<section id="the-power-of-crosstab-sparsity" class="level3">
<h3 class="anchored" data-anchor-id="the-power-of-crosstab-sparsity">The Power of CrossTab Sparsity</h3>
<section id="what-is-crosstab-sparsity" class="level4">
<h4 class="anchored" data-anchor-id="what-is-crosstab-sparsity">What is CrossTab Sparsity?</h4>
<p>CrossTab Sparsity isn’t just a fancy term that sounds good at dinner parties; it’s a statistical measure that helps us peer into the intricate relationships between categorical variables. Imagine it as a magnifying glass that reveals how different categories interact within a contingency table. Understanding these interactions is crucial in classification tasks, where the right features can make or break your model (and your day).</p>
<p><strong>Why Does It Matter?</strong></p>
<p>In the world of data science, especially in classification, selecting relevant features is like picking the right ingredients for a gourmet meal—get it wrong, and you might end up with something unpalatable. CrossTab Sparsity helps us achieve this by:</p>
<ul>
<li>Highlighting Relationships: It’s like having a friend who always points out when two people are meant to be together—understanding how features interact with the target variable.</li>
<li>Streamlining Models: Reducing complexity by focusing on significant features means less time spent untangling spaghetti code.</li>
<li>Enhancing Interpretability: Making models easier to understand and explain to stakeholders is like translating tech jargon into plain English—everyone appreciates that!</li>
</ul>
</section>
</section>
<section id="data-overview-our-data-people-at-work-here" class="level3">
<h3 class="anchored" data-anchor-id="data-overview-our-data-people-at-work-here">Data Overview: Our Data People at work here</h3>
<section id="the-datasets" class="level4">
<h4 class="anchored" data-anchor-id="the-datasets">The Datasets</h4>
<p>Data 1: <a href="https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition">Estimation of Obesity Levels Based On Eating Habits and Physical Condition</a></p>
<p><em>Little bit about the data:</em> This dataset, shared on 8/26/2019, looks at obesity levels in people from Mexico, Peru, and Colombia based on their eating habits and physical health. It includes 2,111 records with 16 features, and classifies individuals into different obesity levels, from insufficient weight to obesity type III. Most of the data (77%) was created using a tool, while the rest (23%) was collected directly from users online.</p>
<p>Data 2: <a href="https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success">Predict Students’ Dropout and Academic Success</a></p>
<p><em>Little bit about the data:</em> This dataset, shared on 12/12/2021, looks at factors like students’ backgrounds, academic path, and socio-economic status to predict whether they’ll drop out or succeed in their studies. With 4,424 records across 36 features, it covers students from different undergrad programs. The goal is to use machine learning to spot at-risk students early, so schools can offer support. The data has been cleaned and doesn’t have any missing values. It’s a classification task with three outcomes: dropout, still enrolled, or graduated</p>
<p><strong>Key Features</strong>:</p>
<ul>
<li>Multiclass: Both data set cater a multi class problems with <code>NObeyesdad</code> and <code>Target</code> columns</li>
<li>Mixed Data Type: A good mix of categorical and continuous variables are available for usage.</li>
<li>Sizeable: More than 2 K rows are available for testing.</li>
</ul>
</section>
</section>
<section id="exploratory-data-analysis-eda-setting-the-stage" class="level3 page-columns page-full">
<h3 class="anchored" data-anchor-id="exploratory-data-analysis-eda-setting-the-stage">Exploratory Data Analysis (EDA): Setting the Stage</h3>
<p>Before we dive into model creation, let’s explore our dataset through some quick EDA. Think of this as getting to know your non-obese friends before inviting them to a party.</p>
<section id="eda-for-obesity-data" class="level4 page-columns page-full">
<h4 class="anchored" data-anchor-id="eda-for-obesity-data">EDA for Obesity Data</h4>
<p>Here’s a brief code snippet to perform essential EDA on the Obesity dataset:</p>
<div id="1eff4574" class="cell" data-execution_count="3">
<details class="code-fold">
<summary>Loading data and generating basic descriptive</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load the Obesity data</span></span>
<span id="cb1-2">raw_df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_csv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'ObesityDataSet_raw_and_data_sinthetic.csv'</span>)</span>
<span id="cb1-3">target <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'NObeyesdad'</span></span>
<span id="cb1-4"></span>
<span id="cb1-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load Students data</span></span>
<span id="cb1-6"></span>
<span id="cb1-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load Credit data</span></span>
<span id="cb1-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># raw_data = sm.datasets.get_rdataset("credit_data",'modeldata')</span></span>
<span id="cb1-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># raw_df = raw_data.data</span></span>
<span id="cb1-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># target = 'Status'</span></span>
<span id="cb1-11"></span>
<span id="cb1-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># # Load Palmer penguins data</span></span>
<span id="cb1-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># raw_data = sm.datasets.get_rdataset("penguins",'palmerpenguins')</span></span>
<span id="cb1-14"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># raw_df = raw_data.data</span></span>
<span id="cb1-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># target = 'species'</span></span>
<span id="cb1-16"></span>
<span id="cb1-17"></span>
<span id="cb1-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># # Load Credit data</span></span>
<span id="cb1-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># raw_data = sm.datasets.get_rdataset("CreditCard",'AER')</span></span>
<span id="cb1-20"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># raw_df = raw_data.data</span></span>
<span id="cb1-21"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># target = 'card'</span></span>
<span id="cb1-22"></span>
<span id="cb1-23"></span>
<span id="cb1-24"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># setting things up for aal the next steps</span></span>
<span id="cb1-25">raw_df[target] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> raw_df[target].astype(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'category'</span>) </span>
<span id="cb1-26"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'No of data points available to work:'</span>,raw_df.shape)</span>
<span id="cb1-27">display(raw_df.head())</span>
<span id="cb1-28"></span>
<span id="cb1-29"></span>
<span id="cb1-30"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Summary statistics</span></span>
<span id="cb1-31">display(raw_df.describe())</span></code></pre></div></div>
</details>
<div class="cell-output cell-output-stdout">
<pre><code>No of data points available to work: (2111, 17)</code></pre>
</div>
<div class="cell-output cell-output-display">
<div>


<table class="dataframe table table-sm table-striped small" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">Gender</th>
<th data-quarto-table-cell-role="th">Age</th>
<th data-quarto-table-cell-role="th">Height</th>
<th data-quarto-table-cell-role="th">Weight</th>
<th data-quarto-table-cell-role="th">Famil_Hist_Owt</th>
<th data-quarto-table-cell-role="th">FAVC</th>
<th data-quarto-table-cell-role="th">FCVC</th>
<th data-quarto-table-cell-role="th">NCP</th>
<th data-quarto-table-cell-role="th">CAEC</th>
<th data-quarto-table-cell-role="th">SMOKE</th>
<th data-quarto-table-cell-role="th">CH2O</th>
<th data-quarto-table-cell-role="th">SCC</th>
<th data-quarto-table-cell-role="th">FAF</th>
<th data-quarto-table-cell-role="th">TUE</th>
<th data-quarto-table-cell-role="th">CALC</th>
<th data-quarto-table-cell-role="th">MTRANS</th>
<th data-quarto-table-cell-role="th">NObeyesdad</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">0</th>
<td>Female</td>
<td>21.0</td>
<td>1.62</td>
<td>64.0</td>
<td>yes</td>
<td>no</td>
<td>2.0</td>
<td>3.0</td>
<td>Sometimes</td>
<td>no</td>
<td>2.0</td>
<td>no</td>
<td>0.0</td>
<td>1.0</td>
<td>no</td>
<td>Public_Transportation</td>
<td>Normal_Weight</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">1</th>
<td>Female</td>
<td>21.0</td>
<td>1.52</td>
<td>56.0</td>
<td>yes</td>
<td>no</td>
<td>3.0</td>
<td>3.0</td>
<td>Sometimes</td>
<td>yes</td>
<td>3.0</td>
<td>yes</td>
<td>3.0</td>
<td>0.0</td>
<td>Sometimes</td>
<td>Public_Transportation</td>
<td>Normal_Weight</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">2</th>
<td>Male</td>
<td>23.0</td>
<td>1.80</td>
<td>77.0</td>
<td>yes</td>
<td>no</td>
<td>2.0</td>
<td>3.0</td>
<td>Sometimes</td>
<td>no</td>
<td>2.0</td>
<td>no</td>
<td>2.0</td>
<td>1.0</td>
<td>Frequently</td>
<td>Public_Transportation</td>
<td>Normal_Weight</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">3</th>
<td>Male</td>
<td>27.0</td>
<td>1.80</td>
<td>87.0</td>
<td>no</td>
<td>no</td>
<td>3.0</td>
<td>3.0</td>
<td>Sometimes</td>
<td>no</td>
<td>2.0</td>
<td>no</td>
<td>2.0</td>
<td>0.0</td>
<td>Frequently</td>
<td>Walking</td>
<td>Overweight_Level_I</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">4</th>
<td>Male</td>
<td>22.0</td>
<td>1.78</td>
<td>89.8</td>
<td>no</td>
<td>no</td>
<td>2.0</td>
<td>1.0</td>
<td>Sometimes</td>
<td>no</td>
<td>2.0</td>
<td>no</td>
<td>0.0</td>
<td>0.0</td>
<td>Sometimes</td>
<td>Public_Transportation</td>
<td>Overweight_Level_II</td>
</tr>
</tbody>
</table>

</div>
</div>
<div class="cell-output cell-output-display">
<div>


<table class="dataframe table table-sm table-striped small" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">Age</th>
<th data-quarto-table-cell-role="th">Height</th>
<th data-quarto-table-cell-role="th">Weight</th>
<th data-quarto-table-cell-role="th">FCVC</th>
<th data-quarto-table-cell-role="th">NCP</th>
<th data-quarto-table-cell-role="th">CH2O</th>
<th data-quarto-table-cell-role="th">FAF</th>
<th data-quarto-table-cell-role="th">TUE</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">count</th>
<td>2111.000000</td>
<td>2111.000000</td>
<td>2111.000000</td>
<td>2111.000000</td>
<td>2111.000000</td>
<td>2111.000000</td>
<td>2111.000000</td>
<td>2111.000000</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">mean</th>
<td>24.312600</td>
<td>1.701677</td>
<td>86.586058</td>
<td>2.419043</td>
<td>2.685628</td>
<td>2.008011</td>
<td>1.010298</td>
<td>0.657866</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">std</th>
<td>6.345968</td>
<td>0.093305</td>
<td>26.191172</td>
<td>0.533927</td>
<td>0.778039</td>
<td>0.612953</td>
<td>0.850592</td>
<td>0.608927</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">min</th>
<td>14.000000</td>
<td>1.450000</td>
<td>39.000000</td>
<td>1.000000</td>
<td>1.000000</td>
<td>1.000000</td>
<td>0.000000</td>
<td>0.000000</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">25%</th>
<td>19.947192</td>
<td>1.630000</td>
<td>65.473343</td>
<td>2.000000</td>
<td>2.658738</td>
<td>1.584812</td>
<td>0.124505</td>
<td>0.000000</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">50%</th>
<td>22.777890</td>
<td>1.700499</td>
<td>83.000000</td>
<td>2.385502</td>
<td>3.000000</td>
<td>2.000000</td>
<td>1.000000</td>
<td>0.625350</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">75%</th>
<td>26.000000</td>
<td>1.768464</td>
<td>107.430682</td>
<td>3.000000</td>
<td>3.000000</td>
<td>2.477420</td>
<td>1.666678</td>
<td>1.000000</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">max</th>
<td>61.000000</td>
<td>1.980000</td>
<td>173.000000</td>
<td>3.000000</td>
<td>4.000000</td>
<td>3.000000</td>
<td>3.000000</td>
<td>2.000000</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>

<div class="no-row-height column-margin column-container"><div class="">
<p><em>Target distribution</em></p>
<div id="0a7bee65" class="cell" data-execution_count="4">
<details class="code-fold">
<summary>Target and Correlation</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Visualize target data distribution</span></span>
<span id="cb3-2">plt.figure(figsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>))</span>
<span id="cb3-3">sns.countplot(data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>raw_df, x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>target, hue<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>target, palette<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Set2'</span>,)</span>
<span id="cb3-4">plt.title(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f'Distribution of </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>target<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> levels'</span>)</span>
<span id="cb3-5">plt.xticks(rotation<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">45</span>)</span>
<span id="cb3-6">plt.show()</span>
<span id="cb3-7"></span>
<span id="cb3-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Heatmap to check for correlations between numeric variables</span></span>
<span id="cb3-9">corr <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> raw_df.corr(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'kendall'</span>,numeric_only<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb3-10">sns.heatmap(corr, annot<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, cmap<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'coolwarm'</span>)</span>
<span id="cb3-11">plt.title(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Kendall Correlation Heatmap'</span>)</span>
<span id="cb3-12">plt.show()</span></code></pre></div></div>
</details>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><a href="index_files/figure-html/cell-4-output-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-2"><img src="https://jitinkapila.com/writing/engineering/04_crosstab_sparsity_classification/index_files/figure-html/cell-4-output-1.png" width="392" height="382" class="figure-img img-fluid"></a></p>
</figure>
</div>
</div>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><a href="index_files/figure-html/cell-4-output-2.png" class="lightbox" data-gallery="quarto-lightbox-gallery-3"><img src="https://jitinkapila.com/writing/engineering/04_crosstab_sparsity_classification/index_files/figure-html/cell-4-output-2.png" width="537" height="430" class="figure-img img-fluid"></a></p>
</figure>
</div>
</div>
</div>
<div class="callout callout-style-simple callout-none no-icon callout-titled">
<div class="callout-header d-flex align-content-center collapsed" data-bs-toggle="collapse" data-bs-target=".callout-1-contents" aria-controls="callout-1" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">None</span>Some Mode EDA for the data
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-1" class="callout-1-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<div id="594cf346" class="cell" data-execution_count="5">
<details class="code-fold">
<summary>EDA code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Visualize the distribution of numerical variables</span></span>
<span id="cb4-2">sns.pairplot(raw_df, hue<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>target, corner<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb4-3">plt.show()</span>
<span id="cb4-4"></span>
<span id="cb4-5"></span>
<span id="cb4-6"></span>
<span id="cb4-7"></span>
<span id="cb4-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Gettign Categorical data</span></span>
<span id="cb4-9">categorical_columns <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> raw_df.select_dtypes(include<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'object'</span>).columns</span>
<span id="cb4-10"></span>
<span id="cb4-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Plot categorical variables with respect to the target variable</span></span>
<span id="cb4-12"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> col <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> categorical_columns:</span>
<span id="cb4-13">    plt.figure(figsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>))</span>
<span id="cb4-14">    sns.countplot(data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>raw_df,x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>col, hue<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>target)</span>
<span id="cb4-15">    plt.title(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Countplot of </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>col<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> with respect to </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>target<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span>
<span id="cb4-16">    plt.show()</span></code></pre></div></div>
</details>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><a href="index_files/figure-html/cell-5-output-1.png" class="lightbox" data-gallery="quarto-lightbox-gallery-4"><img src="https://jitinkapila.com/writing/engineering/04_crosstab_sparsity_classification/index_files/figure-html/cell-5-output-1.png" width="2075" height="1887" class="figure-img img-fluid"></a></p>
</figure>
</div>
</div>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><a href="index_files/figure-html/cell-5-output-2.png" class="lightbox" data-gallery="quarto-lightbox-gallery-5"><img src="https://jitinkapila.com/writing/engineering/04_crosstab_sparsity_classification/index_files/figure-html/cell-5-output-2.png" width="961" height="447" class="figure-img img-fluid"></a></p>
</figure>
</div>
</div>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><a href="index_files/figure-html/cell-5-output-3.png" class="lightbox" data-gallery="quarto-lightbox-gallery-6"><img src="https://jitinkapila.com/writing/engineering/04_crosstab_sparsity_classification/index_files/figure-html/cell-5-output-3.png" width="961" height="447" class="figure-img img-fluid"></a></p>
</figure>
</div>
</div>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><a href="index_files/figure-html/cell-5-output-4.png" class="lightbox" data-gallery="quarto-lightbox-gallery-7"><img src="https://jitinkapila.com/writing/engineering/04_crosstab_sparsity_classification/index_files/figure-html/cell-5-output-4.png" width="961" height="447" class="figure-img img-fluid"></a></p>
</figure>
</div>
</div>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><a href="index_files/figure-html/cell-5-output-5.png" class="lightbox" data-gallery="quarto-lightbox-gallery-8"><img src="https://jitinkapila.com/writing/engineering/04_crosstab_sparsity_classification/index_files/figure-html/cell-5-output-5.png" width="961" height="447" class="figure-img img-fluid"></a></p>
</figure>
</div>
</div>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><a href="index_files/figure-html/cell-5-output-6.png" class="lightbox" data-gallery="quarto-lightbox-gallery-9"><img src="https://jitinkapila.com/writing/engineering/04_crosstab_sparsity_classification/index_files/figure-html/cell-5-output-6.png" width="961" height="447" class="figure-img img-fluid"></a></p>
</figure>
</div>
</div>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><a href="index_files/figure-html/cell-5-output-7.png" class="lightbox" data-gallery="quarto-lightbox-gallery-10"><img src="https://jitinkapila.com/writing/engineering/04_crosstab_sparsity_classification/index_files/figure-html/cell-5-output-7.png" width="961" height="447" class="figure-img img-fluid"></a></p>
</figure>
</div>
</div>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><a href="index_files/figure-html/cell-5-output-8.png" class="lightbox" data-gallery="quarto-lightbox-gallery-11"><img src="https://jitinkapila.com/writing/engineering/04_crosstab_sparsity_classification/index_files/figure-html/cell-5-output-8.png" width="961" height="447" class="figure-img img-fluid"></a></p>
</figure>
</div>
</div>
<div class="cell-output cell-output-display">
<div>
<figure class="figure">
<p><a href="index_files/figure-html/cell-5-output-9.png" class="lightbox" data-gallery="quarto-lightbox-gallery-12"><img src="https://jitinkapila.com/writing/engineering/04_crosstab_sparsity_classification/index_files/figure-html/cell-5-output-9.png" width="961" height="447" class="figure-img img-fluid"></a></p>
</figure>
</div>
</div>
</div>
</div>
</div>
</div>
</div></div></section>
</section>
<section id="model-creation-establishing-a-baseline" class="level3 page-columns page-full">
<h3 class="anchored" data-anchor-id="model-creation-establishing-a-baseline">Model Creation: Establishing a Baseline</h3>
<p>With our exploratory analysis complete, we’re ready to create our baseline model using logistic regression with Statsmodels. This initial model will serve as our reference point—like setting up a benchmark for your favorite video game.</p>
<div id="a6e480d4" class="cell" data-execution_count="6">
<details class="code-fold">
<summary>Splitting data and training a default Multinomila Logit model on our data</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb5-1">data_df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> raw_df.dropna().reset_index(drop<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span>
<span id="cb5-2">data_df[target] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> data_df[target].cat.codes</span>
<span id="cb5-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># X = data_df[['bill_length_mm','bill_depth_mm','flipper_length_mm','body_mass_g']] </span></span>
<span id="cb5-4"></span>
<span id="cb5-5">data_df_test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> data_df.sample(frac<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>,random_state<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>)</span>
<span id="cb5-6">data_df_train <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> data_df.drop(data_df_test.index)</span>
<span id="cb5-7"></span>
<span id="cb5-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Using MN logistic regression model using formula API</span></span>
<span id="cb5-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># This would essentially bold down to pair wise logsitic regression</span></span>
<span id="cb5-10">logit_model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sm.MNLogit.from_formula(</span>
<span id="cb5-11">    <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>target<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> ~ </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">' + '</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>join([col <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> col <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> data_df_train.columns <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> col <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!=</span> target])<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>, </span>
<span id="cb5-12">    data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>data_df_train</span>
<span id="cb5-13">).fit_regularized()</span></code></pre></div></div>
</details>
<div class="cell-output cell-output-stdout">
<pre><code>Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.17057119619320013
            Iterations: 485
            Function evaluations: 639
            Gradient evaluations: 485</code></pre>
</div>
</div>

<div class="no-row-height column-margin column-container"><div class="">
<div class="callout callout-style-simple callout-none no-icon callout-titled">
<div class="callout-header d-flex align-content-center collapsed" data-bs-toggle="collapse" data-bs-target=".callout-2-contents" aria-controls="callout-2" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">None</span>Base model summary for geeks
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-2" class="callout-2-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<div id="7fc3dbfa" class="cell" data-execution_count="7">
<details class="code-fold">
<summary>Display summary</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb7-1">display(logit_model.summary())</span></code></pre></div></div>
</details>
<div class="cell-output cell-output-display">
<table class="simpletable table table-sm table-striped small">
<caption>MNLogit Regression Results</caption>
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">Dep. Variable:</th>
<td>NObeyesdad</td>
<th data-quarto-table-cell-role="th">No. Observations:</th>
<td>1900</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Model:</th>
<td>MNLogit</td>
<th data-quarto-table-cell-role="th">Df Residuals:</th>
<td>1756</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Method:</th>
<td>MLE</td>
<th data-quarto-table-cell-role="th">Df Model:</th>
<td>138</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Date:</th>
<td>Sat, 09 May 2026</td>
<th data-quarto-table-cell-role="th">Pseudo R-squ.:</th>
<td>0.9122</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Time:</th>
<td>00:48:51</td>
<th data-quarto-table-cell-role="th">Log-Likelihood:</th>
<td>-324.09</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">converged:</th>
<td>True</td>
<th data-quarto-table-cell-role="th">LL-Null:</th>
<td>-3691.8</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Covariance Type:</th>
<td>nonrobust</td>
<th data-quarto-table-cell-role="th">LLR p-value:</th>
<td>0.000</td>
</tr>
</tbody>
</table>


<table class="simpletable table table-sm table-striped small">
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">NObeyesdad=1</th>
<th data-quarto-table-cell-role="th">coef</th>
<th data-quarto-table-cell-role="th">std err</th>
<th data-quarto-table-cell-role="th">z</th>
<th data-quarto-table-cell-role="th">P&gt;|z|</th>
<th data-quarto-table-cell-role="th">[0.025</th>
<th data-quarto-table-cell-role="th">0.975]</th>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Intercept</th>
<td>-11.2903</td>
<td>3.25e+05</td>
<td>-3.48e-05</td>
<td>1.000</td>
<td>-6.36e+05</td>
<td>6.36e+05</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Gender[T.Male]</th>
<td>-3.4851</td>
<td>0.817</td>
<td>-4.268</td>
<td>0.000</td>
<td>-5.085</td>
<td>-1.885</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Famil_Hist_Owt[T.yes]</th>
<td>-0.8162</td>
<td>0.655</td>
<td>-1.246</td>
<td>0.213</td>
<td>-2.100</td>
<td>0.468</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">FAVC[T.yes]</th>
<td>0.2636</td>
<td>0.785</td>
<td>0.336</td>
<td>0.737</td>
<td>-1.275</td>
<td>1.802</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CAEC[T.Frequently]</th>
<td>-8.2402</td>
<td>2.312</td>
<td>-3.564</td>
<td>0.000</td>
<td>-12.771</td>
<td>-3.709</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">CAEC[T.Sometimes]</th>
<td>-6.2226</td>
<td>2.232</td>
<td>-2.787</td>
<td>0.005</td>
<td>-10.598</td>
<td>-1.847</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CAEC[T.no]</th>
<td>-8.5977</td>
<td>2.889</td>
<td>-2.976</td>
<td>0.003</td>
<td>-14.260</td>
<td>-2.935</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">SMOKE[T.yes]</th>
<td>4.4919</td>
<td>3.115</td>
<td>1.442</td>
<td>0.149</td>
<td>-1.614</td>
<td>10.598</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">SCC[T.yes]</th>
<td>-0.7294</td>
<td>1.447</td>
<td>-0.504</td>
<td>0.614</td>
<td>-3.565</td>
<td>2.106</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">CALC[T.Frequently]</th>
<td>-12.6192</td>
<td>3.25e+05</td>
<td>-3.89e-05</td>
<td>1.000</td>
<td>-6.36e+05</td>
<td>6.36e+05</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CALC[T.Sometimes]</th>
<td>-13.2985</td>
<td>3.25e+05</td>
<td>-4.1e-05</td>
<td>1.000</td>
<td>-6.36e+05</td>
<td>6.36e+05</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">CALC[T.no]</th>
<td>-14.1585</td>
<td>3.25e+05</td>
<td>-4.36e-05</td>
<td>1.000</td>
<td>-6.36e+05</td>
<td>6.36e+05</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">MTRANS[T.Bike]</th>
<td>15.8909</td>
<td>2489.580</td>
<td>0.006</td>
<td>0.995</td>
<td>-4863.596</td>
<td>4895.378</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">MTRANS[T.Motorbike]</th>
<td>3.9944</td>
<td>47.659</td>
<td>0.084</td>
<td>0.933</td>
<td>-89.416</td>
<td>97.405</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">MTRANS[T.Public_Transportation]</th>
<td>4.4914</td>
<td>0.995</td>
<td>4.514</td>
<td>0.000</td>
<td>2.541</td>
<td>6.441</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">MTRANS[T.Walking]</th>
<td>4.3554</td>
<td>1.502</td>
<td>2.900</td>
<td>0.004</td>
<td>1.412</td>
<td>7.299</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Age</th>
<td>0.3721</td>
<td>0.097</td>
<td>3.833</td>
<td>0.000</td>
<td>0.182</td>
<td>0.562</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Height</th>
<td>-14.4208</td>
<td>4.118</td>
<td>-3.502</td>
<td>0.000</td>
<td>-22.492</td>
<td>-6.349</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Weight</th>
<td>1.0786</td>
<td>0.146</td>
<td>7.378</td>
<td>0.000</td>
<td>0.792</td>
<td>1.365</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">FCVC</th>
<td>-0.7754</td>
<td>0.429</td>
<td>-1.806</td>
<td>0.071</td>
<td>-1.617</td>
<td>0.066</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">NCP</th>
<td>-1.7094</td>
<td>0.491</td>
<td>-3.480</td>
<td>0.001</td>
<td>-2.672</td>
<td>-0.747</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">CH2O</th>
<td>-1.7291</td>
<td>0.578</td>
<td>-2.992</td>
<td>0.003</td>
<td>-2.862</td>
<td>-0.596</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">FAF</th>
<td>-0.1924</td>
<td>0.280</td>
<td>-0.688</td>
<td>0.491</td>
<td>-0.740</td>
<td>0.356</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">TUE</th>
<td>-0.9320</td>
<td>0.456</td>
<td>-2.043</td>
<td>0.041</td>
<td>-1.826</td>
<td>-0.038</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">NObeyesdad=2</th>
<th data-quarto-table-cell-role="th">coef</th>
<th data-quarto-table-cell-role="th">std err</th>
<th data-quarto-table-cell-role="th">z</th>
<th data-quarto-table-cell-role="th">P&gt;|z|</th>
<th data-quarto-table-cell-role="th">[0.025</th>
<th data-quarto-table-cell-role="th">0.975]</th>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Intercept</th>
<td>17.4309</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Gender[T.Male]</th>
<td>-14.0384</td>
<td>1.983</td>
<td>-7.079</td>
<td>0.000</td>
<td>-17.925</td>
<td>-10.151</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Famil_Hist_Owt[T.yes]</th>
<td>2.0527</td>
<td>1.717</td>
<td>1.195</td>
<td>0.232</td>
<td>-1.313</td>
<td>5.418</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">FAVC[T.yes]</th>
<td>0.9668</td>
<td>1.752</td>
<td>0.552</td>
<td>0.581</td>
<td>-2.468</td>
<td>4.401</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">CAEC[T.Frequently]</th>
<td>-10.0052</td>
<td>4.352</td>
<td>-2.299</td>
<td>0.021</td>
<td>-18.534</td>
<td>-1.476</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CAEC[T.Sometimes]</th>
<td>-1.0074</td>
<td>3.427</td>
<td>-0.294</td>
<td>0.769</td>
<td>-7.724</td>
<td>5.709</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">CAEC[T.no]</th>
<td>-0.4896</td>
<td>894.479</td>
<td>-0.001</td>
<td>1.000</td>
<td>-1753.637</td>
<td>1752.658</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">SMOKE[T.yes]</th>
<td>8.1410</td>
<td>4.013</td>
<td>2.029</td>
<td>0.042</td>
<td>0.277</td>
<td>16.005</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">SCC[T.yes]</th>
<td>-7.6940</td>
<td>152.983</td>
<td>-0.050</td>
<td>0.960</td>
<td>-307.535</td>
<td>292.147</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CALC[T.Frequently]</th>
<td>-2.4516</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">CALC[T.Sometimes]</th>
<td>-7.5316</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CALC[T.no]</th>
<td>-7.2301</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">MTRANS[T.Bike]</th>
<td>-11.9350</td>
<td>8.09e+07</td>
<td>-1.47e-07</td>
<td>1.000</td>
<td>-1.59e+08</td>
<td>1.59e+08</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">MTRANS[T.Motorbike]</th>
<td>10.9226</td>
<td>48.493</td>
<td>0.225</td>
<td>0.822</td>
<td>-84.123</td>
<td>105.968</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">MTRANS[T.Public_Transportation]</th>
<td>11.1756</td>
<td>1.750</td>
<td>6.387</td>
<td>0.000</td>
<td>7.746</td>
<td>14.605</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">MTRANS[T.Walking]</th>
<td>1.7281</td>
<td>2.759</td>
<td>0.626</td>
<td>0.531</td>
<td>-3.679</td>
<td>7.135</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Age</th>
<td>0.8111</td>
<td>0.132</td>
<td>6.139</td>
<td>0.000</td>
<td>0.552</td>
<td>1.070</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Height</th>
<td>-184.0385</td>
<td>14.746</td>
<td>-12.481</td>
<td>0.000</td>
<td>-212.939</td>
<td>-155.138</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Weight</th>
<td>3.9438</td>
<td>0.288</td>
<td>13.688</td>
<td>0.000</td>
<td>3.379</td>
<td>4.508</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">FCVC</th>
<td>0.8915</td>
<td>1.014</td>
<td>0.879</td>
<td>0.379</td>
<td>-1.095</td>
<td>2.878</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">NCP</th>
<td>-1.1415</td>
<td>0.711</td>
<td>-1.605</td>
<td>0.109</td>
<td>-2.536</td>
<td>0.253</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CH2O</th>
<td>-1.5390</td>
<td>0.876</td>
<td>-1.756</td>
<td>0.079</td>
<td>-3.256</td>
<td>0.179</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">FAF</th>
<td>-1.5295</td>
<td>0.591</td>
<td>-2.586</td>
<td>0.010</td>
<td>-2.689</td>
<td>-0.370</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">TUE</th>
<td>-0.5710</td>
<td>0.840</td>
<td>-0.680</td>
<td>0.497</td>
<td>-2.217</td>
<td>1.075</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">NObeyesdad=3</th>
<th data-quarto-table-cell-role="th">coef</th>
<th data-quarto-table-cell-role="th">std err</th>
<th data-quarto-table-cell-role="th">z</th>
<th data-quarto-table-cell-role="th">P&gt;|z|</th>
<th data-quarto-table-cell-role="th">[0.025</th>
<th data-quarto-table-cell-role="th">0.975]</th>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Intercept</th>
<td>-138.5068</td>
<td>1.47e+07</td>
<td>-9.41e-06</td>
<td>1.000</td>
<td>-2.89e+07</td>
<td>2.89e+07</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Gender[T.Male]</th>
<td>-16.6365</td>
<td>8.279</td>
<td>-2.010</td>
<td>0.044</td>
<td>-32.863</td>
<td>-0.410</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Famil_Hist_Owt[T.yes]</th>
<td>2.3538</td>
<td>11.601</td>
<td>0.203</td>
<td>0.839</td>
<td>-20.384</td>
<td>25.092</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">FAVC[T.yes]</th>
<td>-8.7785</td>
<td>5.476</td>
<td>-1.603</td>
<td>0.109</td>
<td>-19.512</td>
<td>1.955</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CAEC[T.Frequently]</th>
<td>-71.7022</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">CAEC[T.Sometimes]</th>
<td>-3.9034</td>
<td>4.734</td>
<td>-0.824</td>
<td>0.410</td>
<td>-13.183</td>
<td>5.376</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CAEC[T.no]</th>
<td>7.7265</td>
<td>895.063</td>
<td>0.009</td>
<td>0.993</td>
<td>-1746.566</td>
<td>1762.019</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">SMOKE[T.yes]</th>
<td>3.5306</td>
<td>19.342</td>
<td>0.183</td>
<td>0.855</td>
<td>-34.379</td>
<td>41.440</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">SCC[T.yes]</th>
<td>-19.4879</td>
<td>154.607</td>
<td>-0.126</td>
<td>0.900</td>
<td>-322.512</td>
<td>283.536</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">CALC[T.Frequently]</th>
<td>-43.6020</td>
<td>1.48e+07</td>
<td>-2.95e-06</td>
<td>1.000</td>
<td>-2.9e+07</td>
<td>2.9e+07</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CALC[T.Sometimes]</th>
<td>-45.7496</td>
<td>1.47e+07</td>
<td>-3.11e-06</td>
<td>1.000</td>
<td>-2.88e+07</td>
<td>2.88e+07</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">CALC[T.no]</th>
<td>-28.2183</td>
<td>1.43e+07</td>
<td>-1.97e-06</td>
<td>1.000</td>
<td>-2.81e+07</td>
<td>2.81e+07</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">MTRANS[T.Bike]</th>
<td>0.0376</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">MTRANS[T.Motorbike]</th>
<td>-2.3812</td>
<td>1.05e+11</td>
<td>-2.27e-11</td>
<td>1.000</td>
<td>-2.06e+11</td>
<td>2.06e+11</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">MTRANS[T.Public_Transportation]</th>
<td>22.5234</td>
<td>6.664</td>
<td>3.380</td>
<td>0.001</td>
<td>9.463</td>
<td>35.584</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">MTRANS[T.Walking]</th>
<td>-5.3334</td>
<td>33.279</td>
<td>-0.160</td>
<td>0.873</td>
<td>-70.560</td>
<td>59.893</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Age</th>
<td>2.5106</td>
<td>0.964</td>
<td>2.605</td>
<td>0.009</td>
<td>0.621</td>
<td>4.400</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Height</th>
<td>-278.9439</td>
<td>44.201</td>
<td>-6.311</td>
<td>0.000</td>
<td>-365.576</td>
<td>-192.312</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Weight</th>
<td>7.1539</td>
<td>1.394</td>
<td>5.132</td>
<td>0.000</td>
<td>4.422</td>
<td>9.886</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">FCVC</th>
<td>4.1064</td>
<td>3.285</td>
<td>1.250</td>
<td>0.211</td>
<td>-2.333</td>
<td>10.546</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">NCP</th>
<td>-1.5637</td>
<td>2.424</td>
<td>-0.645</td>
<td>0.519</td>
<td>-6.315</td>
<td>3.187</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">CH2O</th>
<td>-13.4088</td>
<td>5.560</td>
<td>-2.412</td>
<td>0.016</td>
<td>-24.306</td>
<td>-2.511</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">FAF</th>
<td>-9.8534</td>
<td>4.356</td>
<td>-2.262</td>
<td>0.024</td>
<td>-18.390</td>
<td>-1.316</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">TUE</th>
<td>-5.6951</td>
<td>3.292</td>
<td>-1.730</td>
<td>0.084</td>
<td>-12.147</td>
<td>0.757</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">NObeyesdad=4</th>
<th data-quarto-table-cell-role="th">coef</th>
<th data-quarto-table-cell-role="th">std err</th>
<th data-quarto-table-cell-role="th">z</th>
<th data-quarto-table-cell-role="th">P&gt;|z|</th>
<th data-quarto-table-cell-role="th">[0.025</th>
<th data-quarto-table-cell-role="th">0.975]</th>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Intercept</th>
<td>-87.3214</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Gender[T.Male]</th>
<td>-200.3037</td>
<td>5.41e+07</td>
<td>-3.7e-06</td>
<td>1.000</td>
<td>-1.06e+08</td>
<td>1.06e+08</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Famil_Hist_Owt[T.yes]</th>
<td>-30.9252</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">FAVC[T.yes]</th>
<td>-53.1818</td>
<td>3.98e+07</td>
<td>-1.34e-06</td>
<td>1.000</td>
<td>-7.8e+07</td>
<td>7.8e+07</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">CAEC[T.Frequently]</th>
<td>-28.5483</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CAEC[T.Sometimes]</th>
<td>-21.5821</td>
<td>5.38e+07</td>
<td>-4.01e-07</td>
<td>1.000</td>
<td>-1.05e+08</td>
<td>1.05e+08</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">CAEC[T.no]</th>
<td>-2.2000</td>
<td>4.62e+29</td>
<td>-4.76e-30</td>
<td>1.000</td>
<td>-9.06e+29</td>
<td>9.06e+29</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">SMOKE[T.yes]</th>
<td>-6.0944</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">SCC[T.yes]</th>
<td>-12.3054</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CALC[T.Frequently]</th>
<td>-6.2460</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">CALC[T.Sometimes]</th>
<td>-37.2004</td>
<td>2.12e+08</td>
<td>-1.76e-07</td>
<td>1.000</td>
<td>-4.15e+08</td>
<td>4.15e+08</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CALC[T.no]</th>
<td>-64.5032</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">MTRANS[T.Bike]</th>
<td>-0.2989</td>
<td>1.92e+53</td>
<td>-1.56e-54</td>
<td>1.000</td>
<td>-3.76e+53</td>
<td>3.76e+53</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">MTRANS[T.Motorbike]</th>
<td>-0.2031</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">MTRANS[T.Public_Transportation]</th>
<td>-57.6854</td>
<td>7.04e+07</td>
<td>-8.2e-07</td>
<td>1.000</td>
<td>-1.38e+08</td>
<td>1.38e+08</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">MTRANS[T.Walking]</th>
<td>-7.4464</td>
<td>2.03e+15</td>
<td>-3.66e-15</td>
<td>1.000</td>
<td>-3.98e+15</td>
<td>3.98e+15</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Age</th>
<td>-9.3747</td>
<td>103.246</td>
<td>-0.091</td>
<td>0.928</td>
<td>-211.733</td>
<td>192.984</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Height</th>
<td>-174.4727</td>
<td>592.866</td>
<td>-0.294</td>
<td>0.769</td>
<td>-1336.469</td>
<td>987.523</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Weight</th>
<td>8.7405</td>
<td>35.222</td>
<td>0.248</td>
<td>0.804</td>
<td>-60.293</td>
<td>77.774</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">FCVC</th>
<td>49.0613</td>
<td>3.02e+04</td>
<td>0.002</td>
<td>0.999</td>
<td>-5.91e+04</td>
<td>5.92e+04</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">NCP</th>
<td>2.3650</td>
<td>4572.743</td>
<td>0.001</td>
<td>1.000</td>
<td>-8960.047</td>
<td>8964.777</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CH2O</th>
<td>-18.5809</td>
<td>34.347</td>
<td>-0.541</td>
<td>0.589</td>
<td>-85.900</td>
<td>48.738</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">FAF</th>
<td>-65.1761</td>
<td>262.887</td>
<td>-0.248</td>
<td>0.804</td>
<td>-580.424</td>
<td>450.072</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">TUE</th>
<td>-44.3721</td>
<td>285.217</td>
<td>-0.156</td>
<td>0.876</td>
<td>-603.387</td>
<td>514.643</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">NObeyesdad=5</th>
<th data-quarto-table-cell-role="th">coef</th>
<th data-quarto-table-cell-role="th">std err</th>
<th data-quarto-table-cell-role="th">z</th>
<th data-quarto-table-cell-role="th">P&gt;|z|</th>
<th data-quarto-table-cell-role="th">[0.025</th>
<th data-quarto-table-cell-role="th">0.975]</th>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Intercept</th>
<td>-12.5683</td>
<td>3.25e+05</td>
<td>-3.87e-05</td>
<td>1.000</td>
<td>-6.36e+05</td>
<td>6.36e+05</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Gender[T.Male]</th>
<td>-6.8149</td>
<td>1.085</td>
<td>-6.282</td>
<td>0.000</td>
<td>-8.941</td>
<td>-4.689</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Famil_Hist_Owt[T.yes]</th>
<td>-0.5822</td>
<td>0.790</td>
<td>-0.737</td>
<td>0.461</td>
<td>-2.130</td>
<td>0.966</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">FAVC[T.yes]</th>
<td>2.6008</td>
<td>0.978</td>
<td>2.660</td>
<td>0.008</td>
<td>0.684</td>
<td>4.517</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CAEC[T.Frequently]</th>
<td>-7.2298</td>
<td>2.507</td>
<td>-2.884</td>
<td>0.004</td>
<td>-12.143</td>
<td>-2.316</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">CAEC[T.Sometimes]</th>
<td>-2.8197</td>
<td>2.413</td>
<td>-1.168</td>
<td>0.243</td>
<td>-7.550</td>
<td>1.910</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CAEC[T.no]</th>
<td>-3.8181</td>
<td>3.143</td>
<td>-1.215</td>
<td>0.224</td>
<td>-9.977</td>
<td>2.341</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">SMOKE[T.yes]</th>
<td>3.1451</td>
<td>3.296</td>
<td>0.954</td>
<td>0.340</td>
<td>-3.314</td>
<td>9.604</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">SCC[T.yes]</th>
<td>2.1647</td>
<td>1.617</td>
<td>1.339</td>
<td>0.181</td>
<td>-1.004</td>
<td>5.334</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">CALC[T.Frequently]</th>
<td>-9.0315</td>
<td>3.25e+05</td>
<td>-2.78e-05</td>
<td>1.000</td>
<td>-6.36e+05</td>
<td>6.36e+05</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CALC[T.Sometimes]</th>
<td>-9.1446</td>
<td>3.25e+05</td>
<td>-2.82e-05</td>
<td>1.000</td>
<td>-6.36e+05</td>
<td>6.36e+05</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">CALC[T.no]</th>
<td>-10.7708</td>
<td>3.25e+05</td>
<td>-3.32e-05</td>
<td>1.000</td>
<td>-6.36e+05</td>
<td>6.36e+05</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">MTRANS[T.Bike]</th>
<td>19.0425</td>
<td>2489.581</td>
<td>0.008</td>
<td>0.994</td>
<td>-4860.446</td>
<td>4898.531</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">MTRANS[T.Motorbike]</th>
<td>1.6235</td>
<td>47.716</td>
<td>0.034</td>
<td>0.973</td>
<td>-91.899</td>
<td>95.146</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">MTRANS[T.Public_Transportation]</th>
<td>5.9777</td>
<td>1.209</td>
<td>4.946</td>
<td>0.000</td>
<td>3.609</td>
<td>8.346</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">MTRANS[T.Walking]</th>
<td>4.3596</td>
<td>1.776</td>
<td>2.454</td>
<td>0.014</td>
<td>0.878</td>
<td>7.841</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Age</th>
<td>0.4878</td>
<td>0.106</td>
<td>4.597</td>
<td>0.000</td>
<td>0.280</td>
<td>0.696</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Height</th>
<td>-50.0157</td>
<td>6.721</td>
<td>-7.442</td>
<td>0.000</td>
<td>-63.188</td>
<td>-36.844</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Weight</th>
<td>1.7920</td>
<td>0.168</td>
<td>10.651</td>
<td>0.000</td>
<td>1.462</td>
<td>2.122</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">FCVC</th>
<td>-0.8369</td>
<td>0.601</td>
<td>-1.393</td>
<td>0.164</td>
<td>-2.014</td>
<td>0.341</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">NCP</th>
<td>-1.4453</td>
<td>0.554</td>
<td>-2.608</td>
<td>0.009</td>
<td>-2.531</td>
<td>-0.359</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">CH2O</th>
<td>-1.7648</td>
<td>0.679</td>
<td>-2.601</td>
<td>0.009</td>
<td>-3.095</td>
<td>-0.435</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">FAF</th>
<td>-0.5613</td>
<td>0.374</td>
<td>-1.499</td>
<td>0.134</td>
<td>-1.295</td>
<td>0.172</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">TUE</th>
<td>-0.7982</td>
<td>0.555</td>
<td>-1.439</td>
<td>0.150</td>
<td>-1.886</td>
<td>0.289</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">NObeyesdad=6</th>
<th data-quarto-table-cell-role="th">coef</th>
<th data-quarto-table-cell-role="th">std err</th>
<th data-quarto-table-cell-role="th">z</th>
<th data-quarto-table-cell-role="th">P&gt;|z|</th>
<th data-quarto-table-cell-role="th">[0.025</th>
<th data-quarto-table-cell-role="th">0.975]</th>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Intercept</th>
<td>-2.1693</td>
<td>6.28e+06</td>
<td>-3.45e-07</td>
<td>1.000</td>
<td>-1.23e+07</td>
<td>1.23e+07</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Gender[T.Male]</th>
<td>-6.6857</td>
<td>1.207</td>
<td>-5.537</td>
<td>0.000</td>
<td>-9.052</td>
<td>-4.319</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Famil_Hist_Owt[T.yes]</th>
<td>1.9296</td>
<td>1.076</td>
<td>1.793</td>
<td>0.073</td>
<td>-0.179</td>
<td>4.038</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">FAVC[T.yes]</th>
<td>-0.4617</td>
<td>1.141</td>
<td>-0.405</td>
<td>0.686</td>
<td>-2.698</td>
<td>1.775</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">CAEC[T.Frequently]</th>
<td>-5.5324</td>
<td>3.264</td>
<td>-1.695</td>
<td>0.090</td>
<td>-11.930</td>
<td>0.866</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CAEC[T.Sometimes]</th>
<td>0.7854</td>
<td>3.044</td>
<td>0.258</td>
<td>0.796</td>
<td>-5.181</td>
<td>6.752</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">CAEC[T.no]</th>
<td>1.7141</td>
<td>3.934</td>
<td>0.436</td>
<td>0.663</td>
<td>-5.997</td>
<td>9.426</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">SMOKE[T.yes]</th>
<td>7.0398</td>
<td>3.570</td>
<td>1.972</td>
<td>0.049</td>
<td>0.043</td>
<td>14.036</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">SCC[T.yes]</th>
<td>1.3664</td>
<td>2.012</td>
<td>0.679</td>
<td>0.497</td>
<td>-2.577</td>
<td>5.309</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CALC[T.Frequently]</th>
<td>-2.1001</td>
<td>6.28e+06</td>
<td>-3.34e-07</td>
<td>1.000</td>
<td>-1.23e+07</td>
<td>1.23e+07</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">CALC[T.Sometimes]</th>
<td>-4.6772</td>
<td>6.28e+06</td>
<td>-7.45e-07</td>
<td>1.000</td>
<td>-1.23e+07</td>
<td>1.23e+07</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CALC[T.no]</th>
<td>-4.1972</td>
<td>6.28e+06</td>
<td>-6.68e-07</td>
<td>1.000</td>
<td>-1.23e+07</td>
<td>1.23e+07</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">MTRANS[T.Bike]</th>
<td>-21.8420</td>
<td>6.54e+09</td>
<td>-3.34e-09</td>
<td>1.000</td>
<td>-1.28e+10</td>
<td>1.28e+10</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">MTRANS[T.Motorbike]</th>
<td>3.2252</td>
<td>47.781</td>
<td>0.068</td>
<td>0.946</td>
<td>-90.423</td>
<td>96.873</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">MTRANS[T.Public_Transportation]</th>
<td>8.8055</td>
<td>1.416</td>
<td>6.219</td>
<td>0.000</td>
<td>6.030</td>
<td>11.581</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">MTRANS[T.Walking]</th>
<td>1.2540</td>
<td>2.256</td>
<td>0.556</td>
<td>0.578</td>
<td>-3.168</td>
<td>5.676</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Age</th>
<td>0.7030</td>
<td>0.116</td>
<td>6.086</td>
<td>0.000</td>
<td>0.477</td>
<td>0.929</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Height</th>
<td>-104.6838</td>
<td>9.021</td>
<td>-11.605</td>
<td>0.000</td>
<td>-122.364</td>
<td>-87.003</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Weight</th>
<td>2.6259</td>
<td>0.190</td>
<td>13.819</td>
<td>0.000</td>
<td>2.253</td>
<td>2.998</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">FCVC</th>
<td>0.1776</td>
<td>0.764</td>
<td>0.232</td>
<td>0.816</td>
<td>-1.320</td>
<td>1.675</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">NCP</th>
<td>-1.8276</td>
<td>0.608</td>
<td>-3.007</td>
<td>0.003</td>
<td>-3.019</td>
<td>-0.636</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CH2O</th>
<td>-1.8930</td>
<td>0.757</td>
<td>-2.502</td>
<td>0.012</td>
<td>-3.376</td>
<td>-0.410</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">FAF</th>
<td>-1.0280</td>
<td>0.438</td>
<td>-2.347</td>
<td>0.019</td>
<td>-1.887</td>
<td>-0.169</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">TUE</th>
<td>0.1282</td>
<td>0.670</td>
<td>0.191</td>
<td>0.848</td>
<td>-1.186</td>
<td>1.442</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</div></div><section id="evaluating-model-performance" class="level4">
<h4 class="anchored" data-anchor-id="evaluating-model-performance">Evaluating Model Performance</h4>
<p>To gauge our models’ effectiveness, we’ll employ various metrics such as accuracy, precision, recall, and F1-score. A confusion matrix will help visualize how well our models perform in classifying outcomes—think of it as a report card for your model!</p>
<div id="6bcea57d" class="cell" data-execution_count="8">
<details class="code-fold">
<summary>Evaluating the Logit model</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb8-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Predict on test data</span></span>
<span id="cb8-2">base_preds <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> logit_model.predict(data_df_test).idxmax(axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb8-3">y_test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> data_df_test[target]</span>
<span id="cb8-4"></span>
<span id="cb8-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Evaluate the model</span></span>
<span id="cb8-6">accuracy_orig <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> accuracy_score(y_test, base_preds)</span>
<span id="cb8-7">report_orig <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> classification_report(y_test, base_preds)</span>
<span id="cb8-8"></span>
<span id="cb8-9"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Accuracy:"</span>, accuracy_orig)</span>
<span id="cb8-10"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Classification Report:"</span>)</span>
<span id="cb8-11"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(report_orig)</span></code></pre></div></div>
</details>
<div class="cell-output cell-output-stdout">
<pre><code>Accuracy: 0.909952606635071
Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.86      0.89        29
           1       0.86      0.83      0.84        29
           2       0.95      0.91      0.93        45
           3       0.94      0.97      0.95        31
           4       1.00      0.96      0.98        27
           5       0.83      0.90      0.86        21
           6       0.84      0.93      0.89        29

    accuracy                           0.91       211
   macro avg       0.91      0.91      0.91       211
weighted avg       0.91      0.91      0.91       211
</code></pre>
</div>
</div>
</section>
</section>
<section id="looking-for-some-improvments" class="level3 page-columns page-full">
<h3 class="anchored" data-anchor-id="looking-for-some-improvments">Looking for some <em>Improvments!</em></h3>
<section id="feature-selection-using-crosstab-sparsity" class="level4 page-columns page-full">
<h4 class="anchored" data-anchor-id="feature-selection-using-crosstab-sparsity">Feature Selection Using CrossTab Sparsity</h4>
<p>Now comes the exciting part—using CrossTab Sparsity to refine our feature selection process! It’s like cleaning up your closet and only keeping the clothes that spark joy (thank you, Marie Kondo). <sup>1</sup></p>
<div class="no-row-height column-margin column-container"><div id="fn1"><p><sup>1</sup>&nbsp;This is based on work in Unique Metric for Health Analysis with Optimization of Clustering Activity and Cross Comparison of Results from Different Approach. <a href="https://arxiv.org/abs/1810.03419">Paper Link</a></p></div></div><p><a href="https://gist.github.com/jkapila/83bb8f6461ec91bfced437762f2c9220">Code is here!</a></p>
</section>
<section id="standared-steps-for-feature-selection" class="level4 page-columns page-full">
<h4 class="anchored" data-anchor-id="standared-steps-for-feature-selection">Standared Steps for Feature Selection</h4>
<ol type="1">
<li><strong>Calculate CrossTab Sparsity</strong>: For each feature against the target variable.</li>
<li><strong>Select Features</strong>: Based on sparsity scores that indicate significant interactions with the target variable.</li>
<li><strong>Recreate Models</strong>: Train new models using only the selected features—less is often more!</li>
</ol>
<p>Here we go!!!</p>
<div class="page-columns page-full">
<div id="c09c1400" class="cell page-columns page-full" data-execution_count="10">
<details class="code-fold">
<summary>Doing what needs to Done Code ;)</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb10-1">sns.set_style(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"white"</span>)</span>
<span id="cb10-2">sns.set_context(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"paper"</span>)</span>
<span id="cb10-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Calculating Crostab sparsity for each Column</span></span>
<span id="cb10-4">results <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> crosstab_sparsity(data_df_train.iloc[:,:<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>],data_df_train[target],numeric_bin<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'decile'</span>)</span>
<span id="cb10-5"></span>
<span id="cb10-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># presenting results for consumption</span></span>
<span id="cb10-7">df_long <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.melt(results[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'scores'</span>], id_vars<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Columns'</span>], value_vars<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'seggregation'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'explaination'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'metric'</span>],</span>
<span id="cb10-8">                  var_name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Metric'</span>, value_name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'values'</span>)</span>
<span id="cb10-9"></span>
<span id="cb10-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Adding jitter: small random noise to 'Columns' (x-axis)</span></span>
<span id="cb10-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># df_long['values_jittered'] = df_long['Value'] + np.random.uniform(-0.1, 0.1, size=len(df_long))</span></span>
<span id="cb10-12"></span>
<span id="cb10-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create a seaborn scatter plot with jitter, more professional color palette, and transparency</span></span>
<span id="cb10-14">plt.figure(figsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>))</span>
<span id="cb10-15">sns.scatterplot(x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Columns'</span>, y<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'values'</span>, hue<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Metric'</span>, style<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Metric'</span>,</span>
<span id="cb10-16">        data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>df_long, s<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, alpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.7</span>, palette<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'deep'</span>)</span>
<span id="cb10-17"></span>
<span id="cb10-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Title and labels</span></span>
<span id="cb10-19">plt.title(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Metrics by Columns'</span>, fontsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">16</span>)</span>
<span id="cb10-20">plt.xticks(rotation<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">45</span>) </span>
<span id="cb10-21">plt.xlabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Columns'</span>, fontsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>)</span>
<span id="cb10-22">plt.ylabel(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Value'</span>, fontsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>)</span>
<span id="cb10-23"></span>
<span id="cb10-24"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Display legend outside the plot for better readability</span></span>
<span id="cb10-25">plt.legend(title<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Metric'</span>, loc<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'upper right'</span>, fancybox<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>, framealpha<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>)</span>
<span id="cb10-26"></span>
<span id="cb10-27"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Show the plot</span></span>
<span id="cb10-28">plt.tight_layout()</span>
<span id="cb10-29">plt.show()</span></code></pre></div></div>
</details>
<div class="cell-output cell-output-stdout">
<pre><code>CSP calculated with decile for breaks!

Scores for 7 groups(s) is : 140.96057955229762</code></pre>
</div>
<div class="cell-output cell-output-display page-columns page-full">
<div class="page-columns page-full">
<figure class="figure page-columns page-full">
<p class="page-columns page-full"><a href="index_files/figure-html/cell-10-output-2.png" class="lightbox page-columns page-full" data-gallery="quarto-lightbox-gallery-13"><img src="https://jitinkapila.com/writing/engineering/04_crosstab_sparsity_classification/index_files/figure-html/cell-10-output-2.png" width="1143" height="472" class="figure-img column-page-right"></a></p>
</figure>
</div>
</div>
</div>
</div>
</section>
<section id="and-drum-rolls-pelase" class="level4 page-columns page-full">
<h4 class="anchored" data-anchor-id="and-drum-rolls-pelase">And Drum Rolls pelase!!!</h4>
<p>Using just top 5 varaibles we are getting almost similar or better overall accuracy. This amounts to greatly simplifing the models and clearly explain why some variable are not useful for modeling.</p>
<div id="9b922400" class="cell" data-execution_count="11">
<details class="code-fold">
<summary>And finally training and evaluating with drum rolls</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb12-1">logit_model_rev <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sm.MNLogit.from_formula(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>target<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> ~ </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">' + '</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">.</span>join(results[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'scores'</span>].loc[:<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Columns'</span>].values)<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>, </span>
<span id="cb12-2">    data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>data_df_train</span>
<span id="cb12-3">).fit_regularized()</span>
<span id="cb12-4"></span>
<span id="cb12-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Predict on test data</span></span>
<span id="cb12-6">challenger_preds <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> logit_model_rev.predict(data_df_test).idxmax(axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb12-7">y_test <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> data_df_test[target]</span>
<span id="cb12-8"></span>
<span id="cb12-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Evaluate the model</span></span>
<span id="cb12-10">accuracy_new <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> accuracy_score(y_test, challenger_preds)</span>
<span id="cb12-11">report_new <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> classification_report(y_test, challenger_preds)</span>
<span id="cb12-12"></span>
<span id="cb12-13"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Accuracy:"</span>, accuracy_new)</span>
<span id="cb12-14"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Classification Report:"</span>)</span>
<span id="cb12-15"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(report_new)</span></code></pre></div></div>
</details>
<div class="cell-output cell-output-stdout">
<pre><code>Singular matrix E in LSQ subproblem    (Exit mode 5)
            Current function value: nan
            Iterations: 470
            Function evaluations: 1227
            Gradient evaluations: 470
Accuracy: 0.9383886255924171
Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.97      0.95        29
           1       0.93      0.93      0.93        29
           2       0.96      1.00      0.98        45
           3       0.93      0.90      0.92        31
           4       0.93      0.93      0.93        27
           5       0.90      0.90      0.90        21
           6       0.96      0.90      0.93        29

    accuracy                           0.94       211
   macro avg       0.94      0.93      0.93       211
weighted avg       0.94      0.94      0.94       211
</code></pre>
</div>
<div class="cell-output cell-output-stderr">
<pre><code>/home/jitin/Documents/applications/perceptions/.venv/lib/python3.12/site-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "</code></pre>
</div>
</div>

<div class="no-row-height column-margin column-container"><div class="">
<div class="callout callout-style-simple callout-none no-icon callout-titled">
<div class="callout-header d-flex align-content-center collapsed" data-bs-toggle="collapse" data-bs-target=".callout-3-contents" aria-controls="callout-3" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon no-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">None</span>Summary of retrained model
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-3" class="callout-3-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<div id="956cce05" class="cell" data-execution_count="12">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb15-1">display(logit_model_rev.summary())</span></code></pre></div></div>
</details>
<div class="cell-output cell-output-display">
<table class="simpletable table table-sm table-striped small">
<caption>MNLogit Regression Results</caption>
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">Dep. Variable:</th>
<td>NObeyesdad</td>
<th data-quarto-table-cell-role="th">No. Observations:</th>
<td>1900</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Model:</th>
<td>MNLogit</td>
<th data-quarto-table-cell-role="th">Df Residuals:</th>
<td>1858</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Method:</th>
<td>MLE</td>
<th data-quarto-table-cell-role="th">Df Model:</th>
<td>36</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Date:</th>
<td>Sat, 09 May 2026</td>
<th data-quarto-table-cell-role="th">Pseudo R-squ.:</th>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Time:</th>
<td>00:48:52</td>
<th data-quarto-table-cell-role="th">Log-Likelihood:</th>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">converged:</th>
<td>False</td>
<th data-quarto-table-cell-role="th">LL-Null:</th>
<td>-3691.8</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Covariance Type:</th>
<td>nonrobust</td>
<th data-quarto-table-cell-role="th">LLR p-value:</th>
<td>nan</td>
</tr>
</tbody>
</table>


<table class="simpletable table table-sm table-striped small">
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">NObeyesdad=1</th>
<th data-quarto-table-cell-role="th">coef</th>
<th data-quarto-table-cell-role="th">std err</th>
<th data-quarto-table-cell-role="th">z</th>
<th data-quarto-table-cell-role="th">P&gt;|z|</th>
<th data-quarto-table-cell-role="th">[0.025</th>
<th data-quarto-table-cell-role="th">0.975]</th>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Intercept</th>
<td>58.1248</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">TUE</th>
<td>0.1130</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CH2O</th>
<td>-0.8634</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">FAF</th>
<td>0.1425</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Age</th>
<td>0.0579</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Height</th>
<td>-76.5735</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Weight</th>
<td>1.3337</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">NObeyesdad=2</th>
<th data-quarto-table-cell-role="th">coef</th>
<th data-quarto-table-cell-role="th">std err</th>
<th data-quarto-table-cell-role="th">z</th>
<th data-quarto-table-cell-role="th">P&gt;|z|</th>
<th data-quarto-table-cell-role="th">[0.025</th>
<th data-quarto-table-cell-role="th">0.975]</th>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Intercept</th>
<td>328.4616</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">TUE</th>
<td>2.2275</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CH2O</th>
<td>-1.4150</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">FAF</th>
<td>-1.3585</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Age</th>
<td>0.1537</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Height</th>
<td>-426.3945</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Weight</th>
<td>5.3584</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">NObeyesdad=3</th>
<th data-quarto-table-cell-role="th">coef</th>
<th data-quarto-table-cell-role="th">std err</th>
<th data-quarto-table-cell-role="th">z</th>
<th data-quarto-table-cell-role="th">P&gt;|z|</th>
<th data-quarto-table-cell-role="th">[0.025</th>
<th data-quarto-table-cell-role="th">0.975]</th>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Intercept</th>
<td>306.6447</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">TUE</th>
<td>-7.8630</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CH2O</th>
<td>-21.0118</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">FAF</th>
<td>-11.3624</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Age</th>
<td>2.4017</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Height</th>
<td>-710.3867</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Weight</th>
<td>10.1072</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">NObeyesdad=4</th>
<th data-quarto-table-cell-role="th">coef</th>
<th data-quarto-table-cell-role="th">std err</th>
<th data-quarto-table-cell-role="th">z</th>
<th data-quarto-table-cell-role="th">P&gt;|z|</th>
<th data-quarto-table-cell-role="th">[0.025</th>
<th data-quarto-table-cell-role="th">0.975]</th>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Intercept</th>
<td>352.4249</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">TUE</th>
<td>-9.2469</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CH2O</th>
<td>-20.6780</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">FAF</th>
<td>-14.7525</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Age</th>
<td>2.1487</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Height</th>
<td>-758.2318</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Weight</th>
<td>10.5011</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">NObeyesdad=5</th>
<th data-quarto-table-cell-role="th">coef</th>
<th data-quarto-table-cell-role="th">std err</th>
<th data-quarto-table-cell-role="th">z</th>
<th data-quarto-table-cell-role="th">P&gt;|z|</th>
<th data-quarto-table-cell-role="th">[0.025</th>
<th data-quarto-table-cell-role="th">0.975]</th>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Intercept</th>
<td>126.2892</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">TUE</th>
<td>0.5832</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CH2O</th>
<td>-0.8764</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">FAF</th>
<td>-0.1920</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Age</th>
<td>0.0719</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Height</th>
<td>-160.2982</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Weight</th>
<td>2.3663</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">NObeyesdad=6</th>
<th data-quarto-table-cell-role="th">coef</th>
<th data-quarto-table-cell-role="th">std err</th>
<th data-quarto-table-cell-role="th">z</th>
<th data-quarto-table-cell-role="th">P&gt;|z|</th>
<th data-quarto-table-cell-role="th">[0.025</th>
<th data-quarto-table-cell-role="th">0.975]</th>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Intercept</th>
<td>207.3760</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">TUE</th>
<td>1.6561</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">CH2O</th>
<td>-0.6583</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">FAF</th>
<td>-0.1243</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Age</th>
<td>0.1042</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">Height</th>
<td>-266.6050</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Weight</th>
<td>3.6160</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
<td>nan</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</div></div></section>
</section>
<section id="impact-on-model-accuracy" class="level3">
<h3 class="anchored" data-anchor-id="impact-on-model-accuracy">Impact on Model Accuracy</h3>
<p>After applying feature selection based on CrossTab Sparsity, we’ll compare the accuracy of our new models against our baseline models. This comparison will reveal how effectively CrossTab Sparsity enhances classification performance.</p>
<section id="results-and-discussion-unveiling-insights" class="level4">
<h4 class="anchored" data-anchor-id="results-and-discussion-unveiling-insights">Results and Discussion: Unveiling Insights</h4>
<p><strong>Model Comparison Table</strong></p>
<p>After implementing CrossTab Sparsity in our feature selection process, let’s take a look at the results:</p>
<div id="96236f1e" class="cell" data-execution_count="13">
<details class="code-fold">
<summary>Comparision Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb16-1">metrics <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb16-2">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Metric"</span>: [<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Accuracy"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Precision"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Recall"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"F1-Score"</span>],</span>
<span id="cb16-3">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Baseline Model with all Parameters"</span>: [</span>
<span id="cb16-4">        accuracy_score(y_test, base_preds),</span>
<span id="cb16-5">        precision_score(y_test, base_preds, average<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'weighted'</span>),</span>
<span id="cb16-6">        recall_score(y_test, base_preds, average<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'weighted'</span>),</span>
<span id="cb16-7">        f1_score(y_test, base_preds, average<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'weighted'</span>),</span>
<span id="cb16-8">    ],</span>
<span id="cb16-9">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Challenger Model with only 5 Variables"</span>: [</span>
<span id="cb16-10">        accuracy_score(y_test, challenger_preds),</span>
<span id="cb16-11">        precision_score(y_test, challenger_preds, average<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'weighted'</span>),</span>
<span id="cb16-12">        recall_score(y_test, challenger_preds, average<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'weighted'</span>),</span>
<span id="cb16-13">        f1_score(y_test, challenger_preds, average<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'weighted'</span>),</span>
<span id="cb16-14">    ]</span>
<span id="cb16-15">}</span>
<span id="cb16-16">display(pd.DataFrame(metrics).<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>).set_index(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Metric'</span>).T)</span></code></pre></div></div>
</details>
<div class="cell-output cell-output-display">
<div>


<table class="dataframe table table-sm table-striped small" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th">Metric</th>
<th data-quarto-table-cell-role="th">Accuracy</th>
<th data-quarto-table-cell-role="th">Precision</th>
<th data-quarto-table-cell-role="th">Recall</th>
<th data-quarto-table-cell-role="th">F1-Score</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">Baseline Model with all Parameters</th>
<td>0.9100</td>
<td>0.9123</td>
<td>0.9100</td>
<td>0.9103</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">Challenger Model with only 5 Variables</th>
<td>0.9384</td>
<td>0.9384</td>
<td>0.9384</td>
<td>0.9381</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<p><strong>Insights Gained</strong></p>
<p>Through this analysis, several key insights emerge:</p>
<div id="379d63a1" class="cell" data-execution_count="14">
<div class="cell-output cell-output-stdout">
<pre><code>Reduction of similar accuracy from 16 to 5 i.e 68.75% reduction</code></pre>
</div>
</div>
<ol type="1">
<li><strong>Feature Interactions Matter</strong>: The selected features based on CrossTab Sparsity significantly improved model accuracy—like finding out which ingredients make your favorite dish even better!</li>
<li><strong>Simplicity is Key</strong>: By focusing on relevant features, we enhance accuracy while simplifying model interpretation—because nobody likes unnecessary complexity.</li>
<li><strong>Real-World Applications</strong>: These findings have practical implications in fields such as environmental science where classification plays a critical role—helping us make better decisions for our planet.</li>
</ol>
</section>
</section>
<section id="conclusion-the-road-ahead" class="level3">
<h3 class="anchored" data-anchor-id="conclusion-the-road-ahead">Conclusion: The Road Ahead</h3>
<p>In conclusion, this blog has illustrated how CrossTab Sparsity can be a game-changer in classification tasks using the Obesity dataset. By leveraging this metric for feature selection, we achieved notable improvements in model performance—proof that sometimes less really is more!</p>
<p><strong>Future Work: Expanding Horizons</strong></p>
<p>As we look ahead, there are exciting avenues to explore:</p>
<ul>
<li>Investigating regression problems using CrossTab Sparsity.</li>
<li>Comparing its effectiveness with other feature selection methods such as Recursive Feature Elimination (RFE) or comparision with other feature selection mehtods.</li>
</ul>
<p>By continuing this journey into data science, we not only enhance our technical skills but also contribute valuable insights that can drive meaningful change in various industries.</p>


</section>


<a onclick="window.scrollTo(0, 0); return false;" id="quarto-back-to-top"><i class="bi bi-arrow-up"></i> Back to top</a> ]]></description>
  <category>classification</category>
  <category>technical</category>
  <guid>https://jitinkapila.com/writing/engineering/04_crosstab_sparsity_classification/</guid>
  <pubDate>Mon, 02 Jan 2023 18:30:00 GMT</pubDate>
</item>
<item>
  <title>CrossTab Sparsity</title>
  <dc:creator>Jitin Kapila</dc:creator>
  <link>https://jitinkapila.com/writing/engineering/03_crosstab_sparsity/</link>
  <description><![CDATA[ 





<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="img/pexels-n-voitkevich.jpg" class="lightbox" data-gallery="quarto-lightbox-gallery-1" title="Photo by Nataliya Vaitkevich"><img src="https://jitinkapila.com/writing/engineering/03_crosstab_sparsity/img/pexels-n-voitkevich.jpg" class="img-fluid figure-img" style="width: 600px; overflow: hidden; position: relative;" alt="Photo by Nataliya Vaitkevich"></a></p>
<figcaption>Photo by <a href="https://www.pexels.com/photo/empty-crossroads-in-hills-5712829/">Nataliya Vaitkevich</a></figcaption>
</figure>
</div>
<section id="introduction" class="level2">
<h2 class="anchored" data-anchor-id="introduction">Introduction</h2>
<p>Cluster analysis has always fascinated me as a window into the hidden structures of data. During my collaboration with Kumarjit Pathak, we grappled with a persistent challenge in unsupervised learning: <strong>how to objectively evaluate clustering quality across different algorithms</strong>. Traditional metrics like the Silhouette Index or Bayesian Information Criterion felt restrictive—they were siloed within specific methodologies, making cross-algorithm comparisons unreliable.</p>
<p>This frustration led us to develop a <strong>universal cluster evaluation metric</strong>, detailed in our paper <em>“Cross Comparison of Results from Different Clustering Approaches”</em>. Our goal was to create a framework that transcends algorithmic biases, enabling:<br>
- Direct comparison of K-Means vs GMM vs DBSCAN vs PAM vs SOM vs Anything results<br>
- Identification of variables muddying cluster separation<br>
- Automated determination of optimal cluster counts</p>
<p>In this blog, I’ll walk you through our journey—from conceptualization to real-world validation—and share insights that didn’t make it into the final paper.</p>
</section>
<section id="the-birth-of-the-metric-a-first-person-perspective" class="level2 page-columns page-full">
<h2 class="anchored" data-anchor-id="the-birth-of-the-metric-a-first-person-perspective">The Birth of the Metric: A First-Person Perspective</h2>
<p><strong><em>Why Existing Methods Fell Short</em></strong><br>
Early in our research, we cataloged limitations of popular evaluation techniques:</p>
<ol type="1">
<li>Method Dependency<br>
</li>
</ol>
<ul>
<li>Silhouette scores worked beautifully for K-Means but faltered with Gaussian Mixture Models (GMM).<br>
</li>
<li>Probability-based metrics like BIC couldn’t handle distance-based clusters.</li>
</ul>
<ol start="2" type="1">
<li><p>Noise Blindness<br>
Noisy variables often contaminated clusters, but traditional methods required manual outlier detection.</p></li>
<li><p>Subjective Optimization* Elbow plots and dendrograms left too much room for human interpretation.</p></li>
</ol>
<section id="our-aha-moment---crosstab-sparsity" class="level3 page-columns page-full">
<h3 class="anchored" data-anchor-id="our-aha-moment---crosstab-sparsity">Our “Aha!” Moment - Crosstab Sparsity</h3>
<div class="page-columns page-full">
<div class="quarto-figure quarto-figure-center page-columns page-full">
<figure class="figure page-columns page-full">
<p class="page-columns page-full"><a href="img/Picture 1.png" class="lightbox page-columns page-full" data-gallery="quarto-lightbox-gallery-2" title="Best Cluster for K-means Using Crosstab sparsity"><img src="https://jitinkapila.com/writing/engineering/03_crosstab_sparsity/img/Picture 1.png" class="img-fluid figure-img column-page-right" alt="Best Cluster for K-means Using Crosstab sparsity"></a></p>
<figcaption>Best Cluster for K-means Using Crosstab sparsity</figcaption>
</figure>
</div>
</div>
<p>While analyzing cross-tab matrices of variable distributions across clusters, we noticed a pattern: <strong>well-segregated clusters consistently showed higher frequencies along matrix diagonals</strong>. This inspired our two-part metric:</p>

<div class="no-row-height column-margin column-container"><div class="">
<ol type="1">
<li><p><strong>Segregation Factor</strong>:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Simplified calculation from our codebase  </span></span>
<span id="cb1-2">median <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.median(cross_tab)  </span>
<span id="cb1-3">N_vk <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>(cross_tab <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> median)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Count "well-segregated" instances  </span></span></code></pre></div></div></li>
<li><p><strong>Explanation Factor</strong>:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb2-1">explanation <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.log(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(data) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> (bins <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> clusters))  </span></code></pre></div></div></li>
</ol>
</div></div><ol type="1">
<li><strong>Segregation Factor</strong>: Measures how distinctly clusters separate data points. We used the median (not mean) to avoid skew from outlier-dominated matrices.</li>
<li><strong>Explanation Factor</strong>: Quantifies how well clusters capture data variability. The logarithmic term penalizes overfitting—a critical insight from debugging early over-segmented clusters.</li>
</ol>
<p><strong>And the Final Formula</strong>:<br>
For variable <img src="https://latex.codecogs.com/png.latex?v"> with <img src="https://latex.codecogs.com/png.latex?k"> clusters:<br>
<img src="https://latex.codecogs.com/png.latex?%0AS_v%5Ek%20=%20%5Cunderbrace%7B%5Cfrac%7BN_v%5Ek%7D%7B%5Cmax(l,%20k)%7D%7D_%7B%5Ctext%7BSegregation%7D%7D%20%5Ctimes%20%5Cunderbrace%7B%5Cln%5Cleft(%5Cfrac%7BN_d%7D%7Bl%20%5Ctimes%20k%7D%5Cright)%7D_%7B%5Ctext%7BExplanation%7D%7D%0A"></p>
<p>where:<br>
- <img src="https://latex.codecogs.com/png.latex?N_v%5Ek">: Segregated instances (values above cross-tab matrix median)<br>
- <img src="https://latex.codecogs.com/png.latex?l">: Number of value intervals for variable <img src="https://latex.codecogs.com/png.latex?v"><br>
- <img src="https://latex.codecogs.com/png.latex?N_d">: Total observations</p>
<p>This formulation ensures <strong>algorithmic invariance</strong>, allowing comparison across methods like K-Means (distance-based) and GMM (probability-based). Also, now you can see from the formula two scenarios happens: 1. If each variable crosstab is too dense then their is no separation between classes 2. If each variable crosstab is too sparse then we loose on explanation.</p>
<p>Hence the curve reaches a maximum and then falls down giving use the separability that the cluster can produce:</p>
</section>
</section>
<section id="case-study-vehicle-silhouettes-through-my-eyes" class="level2 page-columns page-full">
<h2 class="anchored" data-anchor-id="case-study-vehicle-silhouettes-through-my-eyes">Case Study: Vehicle Silhouettes (Through My Eyes)</h2>
<section id="the-dataset-that-almost-broke-us" class="level4 page-columns page-full">
<h4 class="anchored" data-anchor-id="the-dataset-that-almost-broke-us">The Dataset That Almost Broke Us</h4>
<p>We tested our metric on a vehicle silhouette dataset with 18 shape-related features (e.g., compactness, circularity). Initially, inconsistent results plagued us—until we realized our binning strategy for continuous variables was flawed.</p>

<div class="no-row-height column-margin column-container"><div class="">
<p><strong>Key Adjustments</strong>:<br>
- Switched from equal-width to <strong>quantile-based binning</strong> (10 bins per variable).<br>
- For categorical variables, retained native levels instead of coercing bins.</p>
</div></div></section>
<section id="the-breakthrough" class="level4 page-columns page-full">
<h4 class="anchored" data-anchor-id="the-breakthrough">The Breakthrough</h4>
<p>After refining the preprocessing:</p>
<blockquote class="blockquote">
<p>Optimal Clusters: Our metric plateaued at <img src="https://latex.codecogs.com/png.latex?k=6"> , aligning perfectly with known vehicle categories (sedans, trucks, etc.).<br>
Noise Detection: Variables like <em>Max.LWR</em> (length-width ratio) scored poorly, revealing inconsistent clustering. We later found this was due to manufacturers’ design variances.</p>
</blockquote>
<p>Finding best cluster for K-Means alone:</p>
<div class="page-columns page-full">
<div class="quarto-figure quarto-figure-center page-columns page-full">
<figure class="figure page-columns page-full">
<p class="page-columns page-full"><a href="img/Kmeans on Vehicle.png" class="lightbox page-columns page-full" data-gallery="quarto-lightbox-gallery-3" title="Best Cluster for PAM method Using Crosstab sparsity"><img src="https://jitinkapila.com/writing/engineering/03_crosstab_sparsity/img/Kmeans on Vehicle.png" class="img-fluid figure-img column-page-right" alt="Best Cluster for PAM method Using Crosstab sparsity"></a></p>
<figcaption>Best Cluster for PAM method Using Crosstab sparsity</figcaption>
</figure>
</div>
</div>
<p>Comparing all cluster methods and find the optimal one:</p>
<div class="page-columns page-full">
<div class="quarto-figure quarto-figure-center page-columns page-full">
<figure class="figure page-columns page-full">
<p class="page-columns page-full"><a href="img/Comparision across many methods.png" class="lightbox page-columns page-full" data-gallery="quarto-lightbox-gallery-4" title="Optimal Cluster for many methods"><img src="https://jitinkapila.com/writing/engineering/03_crosstab_sparsity/img/Comparision across many methods.png" class="img-fluid figure-img column-page-right" alt="Optimal Cluster for many methods"></a></p>
<figcaption>Optimal Cluster for many methods</figcaption>
</figure>
</div>
</div>
<p><strong>The chunkiest part</strong> : Understanding your variable for separateness. This gives direct insight of what variable in your data is most critical separator.</p>
<div class="page-columns page-full">
<div class="quarto-figure quarto-figure-center page-columns page-full">
<figure class="figure page-columns page-full">
<p class="page-columns page-full"><a href="img/Variable segregation.png" class="lightbox page-columns page-full" data-gallery="quarto-lightbox-gallery-5" title="All kind of variable scored against Metrics"><img src="https://jitinkapila.com/writing/engineering/03_crosstab_sparsity/img/Variable segregation.png" class="img-fluid figure-img column-page-right" alt="All kind of variable scored against Metrics"></a></p>
<figcaption>All kind of variable scored against Metrics</figcaption>
</figure>
</div>
</div>
</section>
</section>
<section id="comparative-advantages-and-creativity-at-work" class="level2">
<h2 class="anchored" data-anchor-id="comparative-advantages-and-creativity-at-work">Comparative Advantages and Creativity at Work</h2>
<p><strong><em>Comparative Advantage Over Traditional Metrics</em></strong></p>
<table class="table">
<colgroup>
<col style="width: 24%">
<col style="width: 25%">
<col style="width: 25%">
<col style="width: 25%">
</colgroup>
<thead>
<tr class="header">
<th><strong>Feature/Scenario</strong></th>
<th><strong>Silhouette Index</strong></th>
<th><strong>Davies-Bouldin</strong></th>
<th><strong>Crosstab Sparsity</strong></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Algorithm Agnostic</td>
<td>❌ (Distance-based only)</td>
<td>❌</td>
<td>✔️</td>
</tr>
<tr class="even">
<td>Handles Mixed Data</td>
<td>❌</td>
<td>❌</td>
<td>✔️</td>
</tr>
<tr class="odd">
<td>Identifies Noisy Vars</td>
<td>❌</td>
<td>❌</td>
<td>✔️</td>
</tr>
<tr class="even">
<td>Optimal Cluster Detection</td>
<td>Manual elbow analysis</td>
<td>Manual analysis</td>
<td>Automated plateau detection</td>
</tr>
<tr class="odd">
<td>Mixed Algorithms</td>
<td>Failed (GMM vs K-Means)</td>
<td>Failed (needs numerical data)</td>
<td>Achieved 92% consistency[1]</td>
</tr>
<tr class="even">
<td>Noisy Variables</td>
<td>Manual outlier removal</td>
<td>Manual outlier removal</td>
<td>Auto-detected (e.g., Max.LWR)</td>
</tr>
<tr class="odd">
<td>Optimal Cluster Detection</td>
<td>Subjective elbow plots</td>
<td>Subjective to Elbow plots</td>
<td>Objective plateau detection</td>
</tr>
</tbody>
</table>
<p><br></p>
<p>Our creativity yielding boons. We wanted a simple metric to judge different kind of cluster, but we got much more from our experiments and work on this metric:</p>
<ol type="1">
<li>Variable-Level Diagnostics: Low <img src="https://latex.codecogs.com/png.latex?S_v%5Ek"> scores pinpoint variables muddying cluster separation.<br>
</li>
<li>Cross-Method Benchmarking: Compare K-Means (distance) vs GMM (probability) vs hierarchical vs partition clustering fairly using a unified score.<br>
</li>
<li>Scale Invariance: Logarithmic term makes scores comparable across datasets of varying sizes.<br>
</li>
<li>Debug Cluster Quality: Identify and remove noisy variables preemptively<br>
</li>
<li>Automate Model Selection: Objectively choose between K-Means, GMM, PAM, Agglomerative.</li>
</ol>
</section>
<section id="lessons-learned-and-future-vision" class="level2">
<h2 class="anchored" data-anchor-id="lessons-learned-and-future-vision">Lessons Learned and Future Vision</h2>
<p><strong>Few take away from these experiments</strong><br>
1. Binning Sensitivity: Quantile-based binning was transformation. Equal-width bins distorted scores for skewed variables.<br>
2. Categorical Handling: Native levels for categorical outperformed frequency-based grouping.<br>
3. Non-Parametric Approach: This approach allowed us to make sense of data without being tied down by assumptions. We have seen how this metric can be a game-changer for statisticians, providing insights not just into cluster behavior but also into rare event modeling.</p>
<p>The plots from these experiments not only clarify how clusters behave but also offer valuable insights for identifying outliers. I believe there’s exciting potential to extend this metric into classification and value estimation modeling. Imagine using it as a loss function in both linear and non-linear methods to achieve better data segmentation! Thing for another blog someday!</p>
<section id="a-personal-reflection" class="level4">
<h4 class="anchored" data-anchor-id="a-personal-reflection">A Personal Reflection</h4>
<p>Developing this metric taught me that <strong>simplicity often masks depth</strong>. A two-component formula now underpins clustering decisions in industries we never imagined—from fraud detection to genomics. Yet, I’m most proud of how it democratizes cluster analysis: business analysts at our partner firms now optimize clusters without PhD-level stats.</p>
<p>You can find implementation of <a href="https://gist.github.com/jkapila/83bb8f6461ec91bfced437762f2c9220">python code here</a></p>
<p><em>This blog synthesizes findings from our original paper, available <a href="https://arxiv.org/abs/1810.03419">here</a>. For a deeper dive into the math, check Section 3 of the paper.</em></p>
<p><strong>To my readers</strong>: Have you tried implementing cross-algorithm clustering? Share your war stories in the comments—I’d love to troubleshoot together!</p>


</section>
</section>

<a onclick="window.scrollTo(0, 0); return false;" id="quarto-back-to-top"><i class="bi bi-arrow-up"></i> Back to top</a> ]]></description>
  <category>clustering</category>
  <category>technical</category>
  <guid>https://jitinkapila.com/writing/engineering/03_crosstab_sparsity/</guid>
  <pubDate>Mon, 02 May 2022 18:30:00 GMT</pubDate>
</item>
<item>
  <title>A flow to Test Your Hypothesis in Python</title>
  <dc:creator>Jitin Kapila</dc:creator>
  <link>https://jitinkapila.com/writing/engineering/02_hypothesis_test/</link>
  <description><![CDATA[ 





<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="pexels-tara-winstead-7722866-thumb.jpg" class="lightbox" data-gallery="quarto-lightbox-gallery-1" title="Hypothesis testing Photo by Tara Winstead"><img src="https://jitinkapila.com/writing/engineering/02_hypothesis_test/pexels-tara-winstead-7722866-thumb.jpg" class="w-100 img-fluid figure-img" alt="Hypothesis testing Photo by Tara Winstead"></a></p>
<figcaption>Hypothesis testing <a href="https://www.pexels.com/photo/text-7722866/">Photo by Tara Winstead</a></figcaption>
</figure>
</div>
<section id="overview" class="level2 page-columns page-full">
<h2 class="anchored" data-anchor-id="overview">Overview</h2>
<p>All the practitioners of data science always hit one giant thing to do with data and you know it well its <em>EDA -Exploratory Data Analysis</em>. This word <em>EDA</em><sup>1</sup> was coined by Tukey himself in his seminal book published in 1983. But do you think that before that <em>EDA</em> doesn’t existed ?</p>
<div class="no-row-height column-margin column-container"><div id="fn1"><p><sup>1</sup>&nbsp;Emerson, J. D., &amp; Hoaglin, D. C. (1983). Stem-and-leaf displays. In D. C. Hoaglin, F. Mosteller, &amp; J. W. Tukey (Eds.) Understanding Robust and Exploratory Data Analysis, pp.&nbsp;7–32. New York: Wiley. <a href="https://www.wiley.com/en-in/Understanding+Robust+and+Exploratory+Data+Analysis-p-9780471384915">Book is here.</a></p></div></div><p>Well glad you thought. Before that all were doing what is called as <em><em>Hypothesis Testing</em></em>. Yes, before this the race was majorly to fit the data and make most unbiased and robust estimate. But remember one thing when you talk about <em>Hypothesis Testing</em> it was always and majorly would be related to <em>RCTs (Randomized Controlled Trials)</em> a.k.a Randomized Clinical Trials and is <em>Gold Standard</em> of data.</p>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center collapsed" data-bs-toggle="collapse" data-bs-target=".callout-1-contents" aria-controls="callout-1" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>More on RCTs and ODs
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-1" class="callout-1-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<p>Now let me now not hijack the discussion to what is <em>RCTs</em> and <em>Observational Data (ODs)</em> as it is more of <em>Philosophical Reasoning</em> rather than other quality of data, but essentially what we are trying to find is that can we by, using stats, identify <em>interesting patterns</em> in data.</p>
<p>The only thing happens wit RCT data is that we tend to believe these interesting patterns coincide with some sort of <em>‘Cause-Effect’</em> kind of relationship. But essentially due to bia nature of ODs, we certainly cant conclude this. And hence, can only find <em>interesting</em> patterns.</p>
</div>
</div>
</div>
<p>Lets move on. The big question is, for whatever reason you are doing <em>HT</em> , you are doing it for finding <em>something intreating</em>. And that something interesting is usually found by using <em><em>Post-Hoc Tests</em></em>. Now there are variety of <em>Post-Hocs</em> available but what is more know and hence easily found to be implemented in <em>Tukey’s HSD</em>.</p>
<p>So lets directly jump to how to follow this procedure. We’ll be using <code>bioinfokit</code> for this, as it is much simpler wrapper around whats implemented in <code>statsmodels</code>.</p>
</section>
<section id="what-are-the-results" class="level2">
<h2 class="anchored" data-anchor-id="what-are-the-results">What are the results</h2>
<p>Pheww… Thats too much code right. But that would save a lot of your time in real life. So in real life you would write code as 3 steps below:</p>
<div id="7964598c" class="cell" data-execution_count="2">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># import libraries</span></span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb1-3"></span>
<span id="cb1-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Getting car data from UCI</span></span>
<span id="cb1-5">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_csv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'</span>,</span>
<span id="cb1-6">                 sep<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">s+'</span>,header<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>,</span>
<span id="cb1-7">                 names<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mpg'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'cylinders'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'displacement'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'horsepower'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'weight'</span>,</span>
<span id="cb1-8">                 <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'acceleration'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'model_year'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'origin'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'car_name'</span>])</span>
<span id="cb1-9">df.head()</span>
<span id="cb1-10"></span>
<span id="cb1-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Syntax to do anove with validating the assumption, doing test and a post-hoc</span></span>
<span id="cb1-12">results <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> do_anova_test(df<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>df, res_var<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mpg'</span>,xfac_var<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'cylinders'</span>, </span>
<span id="cb1-13">                        anova_model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'mpg ~ C(cylinders)+C(origin)+C(cylinders):C(origin)'</span>,</span>
<span id="cb1-14">                        ss_typ<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>, result_full<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span></code></pre></div></div>
</details>
</div>
<p>Results form the <code>do_anova_test</code></p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode numberSource {markdown} number-lines code-with-copy"><code class="sourceCode"><span id="cb2-1">Levens Test Result:</span>
<span id="cb2-2">                 Parameter    Value</span>
<span id="cb2-3">0      Test statistics (W)  14.5856</span>
<span id="cb2-4">1  Degrees of freedom (Df)   4.0000</span>
<span id="cb2-5">2                  p value   0.0000</span>
<span id="cb2-6"></span>
<span id="cb2-7">Bartletts Test Result:</span>
<span id="cb2-8">                 Parameter    Value</span>
<span id="cb2-9">0      Test statistics (T)  61.2143</span>
<span id="cb2-10">1  Degrees of freedom (Df)   4.0000</span>
<span id="cb2-11">2                  p value   0.0000</span>
<span id="cb2-12"></span>
<span id="cb2-13">ANOVA\ANCOVA Test Result:</span>
<span id="cb2-14">                           df     sum_sq    mean_sq         F  PR(&gt;F)      n2</span>
<span id="cb2-15">Intercept                 1.0  6195.1701  6195.1701  296.3452  0.0000  0.2727</span>
<span id="cb2-16">C(cylinders)              4.0  7574.5864  1893.6466   90.5824  0.0000  0.3334</span>
<span id="cb2-17">C(origin)                 2.0   241.0703   120.5351    5.7658  0.0034  0.0106</span>
<span id="cb2-18">C(cylinders):C(origin)    8.0   577.4821    72.1853    3.4530  0.0046  0.0254</span>
<span id="cb2-19">Residual                389.0  8132.1404    20.9052       NaN     NaN     NaN</span>
<span id="cb2-20"></span>
<span id="cb2-21">Tukey HSD Result:</span>
<span id="cb2-22">   group1  group2     Diff    Lower    Upper  q-value  p-value</span>
<span id="cb2-23">0       8       4  14.3237  12.8090  15.8383  36.6527   0.0010</span>
<span id="cb2-24">1       8       6   5.0226   3.1804   6.8648  10.5671   0.0010</span>
<span id="cb2-25">2       8       3   5.5869  -0.7990  11.9728   3.3909   0.1183</span>
<span id="cb2-26">3       8       5  12.4036   5.0643  19.7428   6.5503   0.0010</span>
<span id="cb2-27">4       4       6   9.3011   7.6765  10.9256  22.1910   0.0010</span>
<span id="cb2-28">5       4       3   8.7368   2.4102  15.0633   5.3524   0.0017</span>
<span id="cb2-29">6       4       5   1.9201  -5.3676   9.2078   1.0212   0.9000</span>
<span id="cb2-30">7       6       3   0.5643  -5.8486   6.9772   0.3410   0.9000</span>
<span id="cb2-31">8       6       5   7.3810   0.0182  14.7437   3.8854   0.0491</span>
<span id="cb2-32">9       3       5   6.8167  -2.7539  16.3873   2.7606   0.2919</span></code></pre></div></div>
<p>Nice!!!</p>
<p><br></p>
<p>And plotting is even easier</p>
<div id="dc6233eb" class="cell" data-execution_count="3">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Numbers are clumsy for most. Making more interpretable plot on above results.</span></span>
<span id="cb3-2">plot_hsd(results.tukeyhsd.sort_values(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Diff'</span>), title<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Tukey HSD resutls Anova of MPG ~ Cylinder"</span>)</span></code></pre></div></div>
</details>
</div>
<p>Results form the <code>plot_hsd</code></p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="TukeyHSD.png" class="lightbox" data-gallery="quarto-lightbox-gallery-2" title="Tukey’s HSD comparison based on Anova Results"><img src="https://jitinkapila.com/writing/engineering/02_hypothesis_test/TukeyHSD.png" class="img-fluid figure-img" alt="Tukey’s HSD comparison based on Anova Results"></a></p>
<figcaption><strong>Tukey’s HSD comparison based on Anova Results</strong></figcaption>
</figure>
</div>
<p>Plots look good with ‘p-values’.</p>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p><span data-bg="lightgreen">Now since we applied the above to a <em><em>Non RCT</em></em> we cannot conclude that Difference in mpg based on cylinder is huge specially as number of cylinders goes up. But this statement might not be as explicit as might be appearing from plot. Unless you have a strong believe that the data follows with rules and assumptions of RCTs, we should be only seeking <em>interesting</em> as in <em>associated</em> results and not <em>cause-effect</em> results.</span></p>
</section>
<section id="give-me-the-code" class="level2">
<h2 class="anchored" data-anchor-id="give-me-the-code">Give me “The Code”</h2>
<div class="tabset-margin-container"></div><div class="panel-tabset">
<ul class="nav nav-tabs"><li class="nav-item"><a class="nav-link active" id="tabset-1-1-tab" data-bs-toggle="tab" data-bs-target="#tabset-1-1" aria-controls="tabset-1-1" aria-selected="true" href="">Performing Anova</a></li><li class="nav-item"><a class="nav-link" id="tabset-1-2-tab" data-bs-toggle="tab" data-bs-target="#tabset-1-2" aria-controls="tabset-1-2" aria-selected="false" href="">Plotting Results</a></li></ul>
<div class="tab-content">
<div id="tabset-1-1" class="tab-pane active" aria-labelledby="tabset-1-1-tab">
<div id="2c5deadc" class="cell" data-execution_count="4">
<details class="code-fold">
<summary>Anova Test <code>anova_test.py</code></summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> bioinfokit <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> analys</span>
<span id="cb4-2"></span>
<span id="cb4-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb4-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> scipy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> stats</span>
<span id="cb4-5"></span>
<span id="cb4-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">class</span> KeyResults:</span>
<span id="cb4-7">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb4-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    A basic class to hold all the results</span></span>
<span id="cb4-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    """</span></span>
<span id="cb4-10">    </span>
<span id="cb4-11">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">__init__</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>,result_full):</span>
<span id="cb4-12">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.keys <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb4-13">        <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.result_full <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> result_full</span>
<span id="cb4-14">    </span>
<span id="cb4-15">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> add_result(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>,name,result):</span>
<span id="cb4-16">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> name <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'tukeyhsd'</span>:</span>
<span id="cb4-17">            <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.keys.append(name)</span>
<span id="cb4-18">            <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">setattr</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, name, result)</span>
<span id="cb4-19">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">elif</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.result_full:</span>
<span id="cb4-20">            <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>.keys.append(name)</span>
<span id="cb4-21">            <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">setattr</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">self</span>, name, result)</span>
<span id="cb4-22"></span>
<span id="cb4-23"></span>
<span id="cb4-24"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Anova test code</span></span>
<span id="cb4-25"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> do_anova_test(df, res_var, xfac_var, anova_model,ss_typ<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>,</span>
<span id="cb4-26">                  effectsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'n2'</span>,result_full<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>,add_res<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>):</span>
<span id="cb4-27">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb4-28"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Do all sequential anova tests</span></span>
<span id="cb4-29"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    </span></span>
<span id="cb4-30"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Step 1) Leven's/ bartellet test for checking weather variance is homogenous or not</span></span>
<span id="cb4-31"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Step 2) Main ANOVA/ANCOVA test</span></span>
<span id="cb4-32"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    Step 3) Tukey's HSD for individual combinations</span></span>
<span id="cb4-33"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    </span></span>
<span id="cb4-34"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    :param df: Pandas DataFrame holding all the columns</span></span>
<span id="cb4-35"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    :param res_var: Variable for which we are checking ANOVA</span></span>
<span id="cb4-36"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    :param xfac_var: Grouping Variables for which we want to do the comparisons</span></span>
<span id="cb4-37"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    :param anova_model: SM formula for the model. This is life savour to make all things work</span></span>
<span id="cb4-38"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    :param result_full: To provide all the results of intermediate steps</span></span>
<span id="cb4-39"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    </span></span>
<span id="cb4-40"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    """</span></span>
<span id="cb4-41"></span>
<span id="cb4-42">    results <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> KeyResults(result_full)</span>
<span id="cb4-43">    </span>
<span id="cb4-44">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># initialize stat method</span></span>
<span id="cb4-45">    res <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> analys.stat()</span>
<span id="cb4-46">    </span>
<span id="cb4-47">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># doing levens test</span></span>
<span id="cb4-48">    res.levene(df<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>df, res_var<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>res_var,xfac_var<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>xfac_var)</span>
<span id="cb4-49">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Leven</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\'</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">s Test Result:'</span>)</span>
<span id="cb4-50">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(res.levene_summary)</span>
<span id="cb4-51">    results.add_result(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'levene'</span>,res.levene_summary)</span>
<span id="cb4-52"></span>
<span id="cb4-53">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># doing bartlett test</span></span>
<span id="cb4-54">    res.bartlett(df<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>df, res_var<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>res_var,xfac_var<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>xfac_var)</span>
<span id="cb4-55">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Bartlett</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\'</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">s Test Result:'</span>)</span>
<span id="cb4-56">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(res.bartlett_summary)</span>
<span id="cb4-57">    results.add_result(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'bartlett'</span>,res.bartlett_summary)</span>
<span id="cb4-58">    </span>
<span id="cb4-59">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># doing anova / ancova</span></span>
<span id="cb4-60">    res.anova_stat(df<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>df, res_var<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>res_var, anova_model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>anova_model,ss_typ<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>ss_typ)</span>
<span id="cb4-61">    aov_res <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> res.anova_summary</span>
<span id="cb4-62">    </span>
<span id="cb4-63">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Add effect sizes</span></span>
<span id="cb4-64">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> effectsize <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"n2"</span>:</span>
<span id="cb4-65">        all_effsize <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (aov_res[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sum_sq'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> aov_res[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sum_sq'</span>].<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>()).to_numpy()</span>
<span id="cb4-66">        all_effsize[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.nan</span>
<span id="cb4-67">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span>:</span>
<span id="cb4-68">        ss_resid <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> aov_res[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sum_sq'</span>].iloc[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]</span>
<span id="cb4-69">        all_effsize <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> aov_res[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sum_sq'</span>].<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">apply</span>(<span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">lambda</span> x: x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> (x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> ss_resid)).to_numpy()</span>
<span id="cb4-70">        all_effsize[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.nan</span>
<span id="cb4-71">    aov_res[effectsize] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> all_effsize</span>
<span id="cb4-72">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#aov_res['bw_'] = res.anova_model_out.params.iloc[-1]</span></span>
<span id="cb4-73">    aov_res <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> aov_res.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>)</span>
<span id="cb4-74">    </span>
<span id="cb4-75">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># printing results</span></span>
<span id="cb4-76">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">ANOVA</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">ANCOVA Test Result:'</span>)</span>
<span id="cb4-77">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(aov_res)</span>
<span id="cb4-78">    results.add_result(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'anova'</span>,res.anova_summary.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>))</span>
<span id="cb4-79">    results.add_result(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'anova_model'</span>,res.anova_model_out)</span>
<span id="cb4-80">    </span>
<span id="cb4-81">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># doing tukey's hsd top compare the groups</span></span>
<span id="cb4-82">    res.tukey_hsd(df<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>df, res_var<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>res_var,xfac_var<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>xfac_var, anova_model<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>anova_model,ss_typ<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>ss_typ)</span>
<span id="cb4-83">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">Tukey HSD Result:'</span>)</span>
<span id="cb4-84">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(res.tukey_summary.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>))</span>
<span id="cb4-85">    results.add_result(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'tukeyhsd'</span>,res.tukey_summary.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>))</span>
<span id="cb4-86">    </span>
<span id="cb4-87">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># add all result componets again if needed </span></span>
<span id="cb4-88">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> add_res:</span>
<span id="cb4-89">        results.add_result(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'allresult'</span>,res)</span>
<span id="cb4-90">    </span>
<span id="cb4-91">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> results</span></code></pre></div></div>
</details>
</div>
</div>
<div id="tabset-1-2" class="tab-pane" aria-labelledby="tabset-1-2-tab">
<div id="72816376" class="cell" data-execution_count="5">
<details class="code-fold">
<summary>Plotting results <code>plot_hsd.py</code></summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode numberSource python number-lines code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> seaborn <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> sns</span>
<span id="cb5-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> matplotlib.pyplot <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> plt</span>
<span id="cb5-3"></span>
<span id="cb5-4">plt.style.use(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'seaborn-bright'</span>)</span>
<span id="cb5-5"></span>
<span id="cb5-6"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> plot_hsd(hsdres,p_cutoff<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span>,title<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>,ax<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>,figsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7</span>)):</span>
<span id="cb5-7">     <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">"""</span></span>
<span id="cb5-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">     Do plotting of tukeyhsd results</span></span>
<span id="cb5-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    </span></span>
<span id="cb5-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">  </span></span>
<span id="cb5-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    :param hsdres: 'tukeyhsd' result form the do_anova_test function</span></span>
<span id="cb5-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    :param p_cutoff: Cutoff at which we get say a combination is significant</span></span>
<span id="cb5-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    :param title: Title of the plot</span></span>
<span id="cb5-14"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    :param ax: Define or get the matplotlib axes</span></span>
<span id="cb5-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    :param figsize: Mention Figure size to draw</span></span>
<span id="cb5-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    </span></span>
<span id="cb5-17"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">    """</span></span>
<span id="cb5-18"></span>
<span id="cb5-19">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> ax <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">is</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>:</span>
<span id="cb5-20">        fig,axp <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> plt.subplots(figsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>figsize)</span>
<span id="cb5-21">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span>:</span>
<span id="cb5-22">        axp <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ax</span>
<span id="cb5-23">    </span>
<span id="cb5-24">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># helper func</span></span>
<span id="cb5-25">    p_ind <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">lambda</span> x : <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'+'</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.05</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'*'</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.01</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> (<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'**'</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.001</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'***'</span>)))</span>
<span id="cb5-26">    label_gen  <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">lambda</span> x: <span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"$</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>x[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> - </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>x[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> |</span><span class="er" style="color: #AD0000;
background-color: null;
font-style: inherit;">\</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> p:</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>x[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>]<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:0.2f}{</span>p_ind(x[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>])<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:5s}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">$"</span></span>
<span id="cb5-27">    </span>
<span id="cb5-28">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#setting values</span></span>
<span id="cb5-29">    mask <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> hsdres[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'p-value'</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;=</span> p_cutoff</span>
<span id="cb5-30">    yticklabs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> hsdres[[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'group1'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'group2'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'p-value'</span>]].<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">apply</span>(label_gen,axis<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>).values</span>
<span id="cb5-31">    ys <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.arange(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(hsdres))</span>
<span id="cb5-32">    </span>
<span id="cb5-33">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># adding plot to axes</span></span>
<span id="cb5-34">    axp.errorbar(hsdres[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span>mask][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Diff'</span>],ys[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span>mask],xerr<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>np.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">abs</span>(hsdres[<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span>mask][[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Lower'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Upper"</span>]]).values.T,</span>
<span id="cb5-35">                fmt<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'o'</span>, color<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'black'</span>, ecolor<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'lightgray'</span>, elinewidth<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, capsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb5-36">    axp.errorbar(hsdres[mask][<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Diff'</span>],ys[mask],xerr<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>np.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">abs</span>(hsdres[mask][[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Lower'</span>,<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Upper"</span>]]).values.T,</span>
<span id="cb5-37">                fmt<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'o'</span>, color<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'red'</span>, ecolor<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'pink'</span>, elinewidth<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>, capsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span>
<span id="cb5-38">    axp.axvline(x<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,linestyle<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'--'</span>,c<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'skyblue'</span>)</span>
<span id="cb5-39">    axp.set_yticks([])</span>
<span id="cb5-40">    (l,u) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> axp.get_xlim()</span>
<span id="cb5-41">    axp.set_xlim(l<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.5</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>l,u)</span>
<span id="cb5-42">    (l,u) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> axp.get_xlim()</span>
<span id="cb5-43">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> idx,labs <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">enumerate</span>(yticklabs):</span>
<span id="cb5-44">        axp.text(l<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span>l,ys[idx],labs)</span>
<span id="cb5-45">    axp.set_yticklabels([])</span>
<span id="cb5-46">    </span>
<span id="cb5-47">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># finally doing what is needed</span></span>
<span id="cb5-48">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> ax <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">is</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>:</span>
<span id="cb5-49">        plt.title(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">''</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> title <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">is</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> title,fontsize<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">14</span>)</span>
<span id="cb5-50">        plt.show()</span>
<span id="cb5-51">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span>:</span>
<span id="cb5-52">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">return</span> axp</span></code></pre></div></div>
</details>
</div>
</div>
</div>
</div>
<p>Hope this give you kickstart to find you intresting patterns. Happy Learning!</p>


</section>


<a onclick="window.scrollTo(0, 0); return false;" id="quarto-back-to-top"><i class="bi bi-arrow-up"></i> Back to top</a> ]]></description>
  <category>technical</category>
  <guid>https://jitinkapila.com/writing/engineering/02_hypothesis_test/</guid>
  <pubDate>Mon, 09 Aug 2021 18:30:00 GMT</pubDate>
</item>
<item>
  <title>Adaptive Regression</title>
  <dc:creator>Jitin Kapila</dc:creator>
  <link>https://jitinkapila.com/writing/engineering/01_adaptive_regression/</link>
  <description><![CDATA[ 





<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="img/pexels-zlfdmr23-20692065-thumb.jpg" class="lightbox" data-gallery="quarto-lightbox-gallery-1" title="Adapting path through mountains! Photo by Zülfü Demir📸"><img src="https://jitinkapila.com/writing/engineering/01_adaptive_regression/img/pexels-zlfdmr23-20692065-thumb.jpg" class="w-100 img-fluid figure-img" alt="Adapting path through mountains! Photo by Zülfü Demir📸"></a></p>
<figcaption>Adapting path through mountains! <a href="https://www.pexels.com/photo/view-of-empty-railway-and-a-hill-in-distance-20692065/">Photo by Zülfü Demir📸</a></figcaption>
</figure>
</div>
<!-- 
<div class="parallax-container"><div class="parallax-image-container"></div></div>
<style>
.parallax-container {
  position: relative;
}

.parallax-image-container {
  background-image: url(https://github.com/holtzy/dataviz-inspiration/blob/main/public/misc/overview1.png?raw=true);
  height: 50px;
  background-attachment: fixed;
  opacity: 1;
  background-position: center;
  background-repeat: no-repeat;
  background-size: screen;
}
</style>


<div class="parallax-container"><div class="parallax-image-container"></div></div> 
-->
<section id="introduction" class="level2 page-columns page-full">
<h2 class="anchored" data-anchor-id="introduction">Introduction</h2>
<p>Here I am trying to express our logic to find such Observation. Lets dive in.</p>
<p>There are different value estimation technique like regression analysis and time-series analysis. Everyone of us has experimented on regression using OLS ,MLE, Ridge, LASSO, Robust etc., and also might have evaluated them using RMSE (Root Mean/Median Square Error), MAD (Mean/Median Absolute Deviation), MAE (Mean / Median Absolute Error) and MAPE (Mean/Median Absolute Percentage Error), etc…</p>
<p>But all of these gives a single point estimate that what is the overall error looks like. Just a different thought!! can we be sure that this single value of MAPE or MAE? How easy it is to infer that our trained model has fitted well across the distribution of dependent variable?</p>

<div class="no-row-height column-margin column-container"><div class="">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="img/Anscombe_Data.png" class="lightbox" data-gallery="quarto-lightbox-gallery-2" title="Plot of Anscombe’s Quartet"><img src="https://jitinkapila.com/writing/engineering/01_adaptive_regression/img/Anscombe_Data.png" class="img-fluid figure-img" alt="Plot of Anscombe’s Quartet"></a></p>
<figcaption>Plot of Anscombe’s Quartet</figcaption>
</figure>
</div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="img/Anscombe_Stats.png" class="lightbox" data-gallery="quarto-lightbox-gallery-3" title="Some Descriptive Stats for Anscombe’s Quartet"><img src="https://jitinkapila.com/writing/engineering/01_adaptive_regression/img/Anscombe_Stats.png" class="img-fluid figure-img" alt="Some Descriptive Stats for Anscombe’s Quartet"></a></p>
<figcaption>Some Descriptive Stats for Anscombe’s Quartet</figcaption>
</figure>
</div>
</div></div><p>Let me give you a pretty small data-set to play with “The Anscombe’s quartet”. This is a very famous data-set by Francis Anscombe. Please refer the plots below to understand the distribution of y1, y2, y3, y4. Isn’t it different?</p>
<p>Would the measure of central tendency and disportion be same for this data? I am sure none of us would believe but to our utter surprise we see all the descriptive stats are kind of same. Don’t believe me !!! Please see the results below ( Source: Wikipedia ):</p>
</section>
<section id="so-what-we-do-now" class="level2">
<h2 class="anchored" data-anchor-id="so-what-we-do-now">So what we do Now!</h2>
<p>Astonished !!! Don’t be. This is what has been hiding behind those numbers. And this is why we really won’t be able to cross certain performance level. Unless you change some features or even do a lot of hyper parameter tuning, your results won’t vary much.</p>
<p>If you look at the average value of MAPE in each decile you would see an interesting pattern. Let us show you what we see that pattern. One day while working on a business problem where I was using regression on a discussion with Kumarjit, we deviced a different way of model diagnosis. We worked together to give this a shape and build on it.</p>
<p><a href="img/Pre_Mape_Plot.png" class="lightbox" data-gallery="quarto-lightbox-gallery-4"><img src="https://jitinkapila.com/writing/engineering/01_adaptive_regression/img/Pre_Mape_Plot.png" class="img-fluid"></a></p>
<p>As you can see it is absolutely evident that either of the side in the distribution of MAPE values is going wild!!!!!!! <strong><em>Still overall MAPE is good (18%).</em></strong></p>
<!-- <div class="parallax-container"><div class="parallax-image-container"></div></div> -->
</section>
<section id="seeking-scope-of-improvement" class="level2 page-columns page-full">
<h2 class="anchored" data-anchor-id="seeking-scope-of-improvement">Seeking Scope of Improvement</h2>
<p>We worked together to build a different framework to address such issues on the go and reduce the MAPE deterioration on the edge of the distribution.</p>
<p>This problems gives rise to a concept we named as <strong>Distribution Assertive Regression (DAR).</strong></p>
<p>DAR is a framework that is based on cancelling the weakness of one point summaries by using the classical concepts of <strong><em>Reliability Engineering : The Bath Tub Curve.</em></strong></p>

<div class="no-row-height column-margin column-container"><div class="">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="img/Image_2.png" class="lightbox" data-gallery="quarto-lightbox-gallery-5" title="Plot for Classical Bath Tub Curve using a Hazard Function"><img src="https://jitinkapila.com/writing/engineering/01_adaptive_regression/img/Image_2.png" class="img-fluid figure-img" alt="Plot for Classical Bath Tub Curve using a Hazard Function"></a></p>
<figcaption>Plot for Classical Bath Tub Curve using a Hazard Function</figcaption>
</figure>
</div>
</div></div><p>The Specialty of this curve is that it gives you the likelihood which areas one tends to have high failure rates. In our experiments when we replace failure with MAPE value and the Time with sorted (ascending) value of target / dependent variable, we observe the same phenomenon. This is likely to happen because most of regression techniques assumes Normal (Gaussian) Distribution of data and fits itself towards the central tendency of this distribution.</p>
<p>Because of this tendency, any regression methods tends to learn less about data which are away from the central tendency of the target.</p>
<p>Lets look at BostonHousing data from “mlbench” package in R.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="img/Plot_Bathtub.png" class="lightbox" data-gallery="quarto-lightbox-gallery-6" title="Plot for MAPE Bath Tub Curve for Decile Split “mdev” from Data"><img src="https://jitinkapila.com/writing/engineering/01_adaptive_regression/img/Plot_Bathtub.png" class="img-fluid figure-img" alt="Plot for MAPE Bath Tub Curve for Decile Split “mdev” from Data"></a></p>
<figcaption>Plot for MAPE Bath Tub Curve for Decile Split “mdev” from Data</figcaption>
</figure>
</div>
<p>Here the MAPE is calculated for each decile split of ordered target variable. As you can observe it is following the bath tub curve. Hence the validates our hypothesis that the regression method is not able to understand much about the data at the either ends of the distribution.</p>
</section>
<section id="final-analysis" class="level2">
<h2 class="anchored" data-anchor-id="final-analysis">Final Analysis</h2>
<p>Now the DAR framework essentially fixes this weakness of regression method and understands the behavior of data which is stable and can be tweak in a fashion that can be use in general practice.</p>
<p>Plot of MAPE Bath Tub Curve after applying DAR Framework for Decile Split “mdev” from Data</p>
<p><a href="img/Post_Mape_Plot.png" class="lightbox" data-gallery="quarto-lightbox-gallery-7"><img src="https://jitinkapila.com/writing/engineering/01_adaptive_regression/img/Post_Mape_Plot.png" class="img-fluid"></a></p>
<p>How this framework with same method reduced MAPEs so much and made model much more stable…?? Well here it is:</p>
<p>The DAR framework splits the data at either ends of the order target variable and performs regression on these “split” data individually. This inherently reduces the so called “noise” part of the data and treat it as an individual data.</p>
<!-- <div class="parallax-container"><div class="parallax-image-container"></div></div> -->
</section>
<section id="scoring-on-new-data" class="level2">
<h2 class="anchored" data-anchor-id="scoring-on-new-data">Scoring on New Data</h2>
<p>Now you might be thinking while applying regression this sounds good but how will one score this on new data. Well to answer that we used our most simple yet very effective friend “KNN” (Though any multiclass Classifier can be used here). So ideally scoring involves two step method :</p>
<ol type="1">
<li>Score new value against each KNN / Multiclass Classifier model of the data</li>
<li>Based on closeness we score it with the regression method used for that part of data.</li>
</ol>
<p>So now we know how we can improve the prediction power of data for regression.</p>
<!-- <div class="parallax-container"><div class="parallax-image-container"></div></div> -->
</section>
<section id="code-and-flowchart" class="level2">
<h2 class="anchored" data-anchor-id="code-and-flowchart">Code and Flowchart</h2>
<p>If things are simple lets keep it simple. Refer flowchart and code below for implementation of this framework. <a href="https://arxiv.org/abs/1805.01618">Paper here!</a></p>
<div class="tabset-margin-container"></div><div class="panel-tabset">
<ul class="nav nav-tabs"><li class="nav-item"><a class="nav-link active" id="tabset-1-1-tab" data-bs-toggle="tab" data-bs-target="#tabset-1-1" aria-controls="tabset-1-1" aria-selected="true" href="">R code</a></li><li class="nav-item"><a class="nav-link" id="tabset-1-2-tab" data-bs-toggle="tab" data-bs-target="#tabset-1-2" aria-controls="tabset-1-2" aria-selected="false" href="">Python code</a></li><li class="nav-item"><a class="nav-link" id="tabset-1-3-tab" data-bs-toggle="tab" data-bs-target="#tabset-1-3" aria-controls="tabset-1-3" aria-selected="false" href="">Here is the Flow Chart</a></li></ul>
<div class="tab-content">
<div id="tabset-1-1" class="tab-pane active" aria-labelledby="tabset-1-1-tab">
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center collapsed" data-bs-toggle="collapse" data-bs-target=".callout-1-contents" aria-controls="callout-1" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>Click to Expand
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-1" class="callout-1-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<script src="https://gist.github.com/jkapila/ccc3d0f05fce86ea3075dc7190f8c181.js"></script>
</div>
</div>
</div>
</div>
<div id="tabset-1-2" class="tab-pane" aria-labelledby="tabset-1-2-tab">
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center collapsed" data-bs-toggle="collapse" data-bs-target=".callout-2-contents" aria-controls="callout-2" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Tip</span>Click to Expand
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-2" class="callout-2-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<script src="https://gist.github.com/jkapila/b97d881e2ae8b75141184ac0f7831601.js"></script>
</div>
</div>
</div>
</div>
<div id="tabset-1-3" class="tab-pane" aria-labelledby="tabset-1-3-tab">
<div class="cell" data-layout-align="center">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">graph TB
    
    subgraph Testing
        p1(Finding bucket of model to choose)
        p1 --&gt; p2([Making predictions &lt;br&gt; based on selected model for inference])
        p2 --&gt; p3(Consolidate final score of prediction)
    end

    subgraph Training
        md([Fitting a &lt;br&gt;Regression model])==&gt; di
        di{Binning Data via &lt;br/&gt; evaluating Distribution &lt;br/&gt; MAPE values }
        di --&gt; md2([Fitting a Buckteing model &lt;br/&gt; to Binned MAPE Buckets])
        md2 --&gt; md3([Fitting Regression &lt;br&gt; Models on Binned Data])
        md == Keeping main&lt;br/&gt;model ==&gt; ro        
        md3 ==&gt; ro(Final Models &lt;br&gt; Binning Data Models + &lt;br&gt; Set of Regressoin Models)
    end

    
    od([Data Input]) -- Training&lt;br&gt; Data--&gt; md
    od -- Testing&lt;br&gt; Data--&gt; p1
    ro -.-&gt; p1
    ro -.-&gt; p2

    classDef green fill:#9f6,stroke:#333,stroke-width:2px;
    classDef yellow fill:#ff6,stroke:#333,stroke-width:2px;
    classDef blue fill:#00f,stroke:#333,stroke-width:2px,color:#fff;
    classDef orange fill:#f96,stroke:#333,stroke-width:4px;
    class md,md2,md3 green
    class di orange
    class p1,p2 yellow
    class ro,p3 blue

</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
</div>
</div>
</div>


</section>

<a onclick="window.scrollTo(0, 0); return false;" id="quarto-back-to-top"><i class="bi bi-arrow-up"></i> Back to top</a> ]]></description>
  <category>technical</category>
  <guid>https://jitinkapila.com/writing/engineering/01_adaptive_regression/</guid>
  <pubDate>Mon, 30 Apr 2018 18:30:00 GMT</pubDate>
</item>
</channel>
</rss>
