TrieKNN: Unleashing KNN’s Power on Mixed Data Types
Discover TrieKNN, a novel approach to extend the K-Nearest Neighbors algorithm to datasets with both categorical and numerical features. Learn how it works, see it in action, and explore its potential.
We’ll dissect the limitations of traditional KNN when faced with mixed data types.
Introduce TrieKNN, a Trie-based approach that elegantly handles mixed data.
Walk through the implementation and training of a TrieKNN model.
Evaluate its performance and discuss its potential impact.
The Allure and Limitation of KNN
In the realm of machine learning, the K-Nearest Neighbors (KNN) algorithm stands out for its intuitive nature and ease of implementation. Its principle is simple: classify a data point based on the majority class among its ‘k’ nearest neighbors in the feature space. This non-parametric approach makes no assumptions about the underlying data distribution, rendering it versatile for various applications. KNN is very popular, but it comes with some limitations.
However, KNN’s Achilles’ heel lies in its reliance on distance metrics, which are inherently designed for numerical data. Real-world datasets often contain a mix of numerical and categorical features, posing a significant challenge for KNN. How do you measure the distance between ‘red’ and ‘blue,’ or ‘large’ and ‘small’?
Prior Art
Several strategies have been proposed to adapt KNN for mixed data:
One-Hot Encoding: Converts categorical features into numerical vectors, but can lead to high dimensionality.
Distance Functions for Mixed Data: Develops and apply custom distance metrics that can handle both numerical and categorical features such as HEOM and many others.
Using mean/mode values: Replace the missing values with mean/mode.
These methods often involve compromises, either distorting the data’s inherent structure or adding computational overhead.
Enter TrieKNN: A Novel Approach
What if we could cleverly sidestep the distance calculation problem for categorical features, while still leveraging KNN’s power? TrieKNN offers just that—a way to perform KNN on any mixed data!
TrieKNN combines the strengths of Trie data structures and KNN to handle mixed data types gracefully. Here’s the core idea:
Trie-Based Categorical Encoding: A Trie is used to store the categorical features of the data. Each node in the Trie represents a category.
Leaf-Node KNN Models: At the leaf nodes of the Trie, where specific combinations of categorical features are found, we fit individual KNN models using only the numerical features.
Weighted Prediction: To classify a new data point, we traverse the Trie based on its categorical features. At each level, we calculate a weighted distance based on available data, ending in a probability score in each leaf node.
Why This Works
No Direct Distance Calculation for Categorical Features: The Trie structure implicitly captures the relationships between categorical values.
Localized KNN Models: By fitting KNN models at the leaf nodes, we ensure that distance calculations are performed only on relevant numerical features.
Scalability: The Trie structure efficiently handles a large number of categorical features and values.
Building a TrieKNN Model
Let’s dive into the implementation. We’ll start with the TrieNode and Trie classes, then move on to the KNN model and the training/prediction process.
Trie Implementation
Code
import numpy as npfrom collections import Counterclass TrieNode:def__init__(self):self.children = {} # Dictionary to store child nodesself.is_end_of_word =False# True if the node is the end of a wordself.count =0# Count of how many times a word has been insertedself.class_counts = {} # Class countsself.class_weights = {}self.model =None# Model at leaf nodesself.indexes = [] # Store data indexes belonging to this leafself.labels = [] # Store data indexes belonging to this leafself.node_weight =Noneclass Trie:def__init__(self):self.root = TrieNode() # Root node of the Trieself.data_index =0# Initialize data indexdef insert(self, word_val, model): current_node =self.root word, val = word_val current_node.count +=1# Adding class countsif val notin current_node.class_counts: current_node.class_counts[val] =0 current_node.class_counts[val] +=1for char in word:# If the character is not in children, add a new TrieNodeif char notin current_node.children: current_node.children[char] = TrieNode() current_node = current_node.children[char]# Adding count of instances current_node.count +=1# adding class countsif val notin current_node.class_counts: current_node.class_counts[val] =0 current_node.class_counts[val] +=1# Mark the end of the word and increment count current_node.is_end_of_word =True current_node.indexes.append(self.data_index) # Store the data index current_node.labels.append(val) current_node.model = modelself.data_index +=1# Increment data indexdef search(self, word): current_node =self.rootfor char in word:# If the character doesn't exist in the children, the word doesn't existif char notin current_node.children:returnFalse current_node = current_node.children[char]# Return True if it's the end of a word and the word existsreturn current_node.is_end_of_worddef count_word(self, word): current_node =self.rootfor char in word:# If the character doesn't exist, the word doesn't existif char notin current_node.children:return0, current_node.class_counts # Correctly return class_counts current_node = current_node.children[char]# Return the count of the wordreturn current_node.count, current_node.class_countsdef display(self):# Recursively display the treedef _display(node, word):if node.is_end_of_word:print(f"Data: {word}, Count: {node.count}, Indexes: {len(node.indexes)} Classes :{node.class_counts} weights:{len(node.class_weights)}") # Display indexes toofor char, child in node.children.items(): _display(child, word + char) # corrected the display _display(self.root, "")defapply(self, func):""" Applies a function to all models in the leaf nodes. """def _apply(node):if node.is_end_of_word and node.model isnotNone: func(node)for child in node.children.values(): _apply(child) _apply(self.root)def apply_weight_to_indexes(self, weight):""" Applies a weight to the indexes based on the percentage of data available. """def _apply_weight_to_indexes(node):if node.is_end_of_word: total_count =sum(self.root.children[child].count for child inself.root.children) percentage = node.count / total_count if total_count >0else0 weighted_indexes = [(index, weight * percentage) for index in node.indexes] node.class_weights = weighted_indexes # Corrected this linefor child in node.children.values(): _apply_weight_to_indexes(child) _apply_weight_to_indexes(self.root)
KNN Model
Code
class KNNModel:def__init__(self, k=5):self.data =Noneself.labels = []self.k = kdef fit(self, data, indexes, labels):# print("Fitting model with indexes:", len(indexes), "labels:", len(labels))self.data = data[indexes].astype(float)self.labels = np.array(labels).astype(float)def predict(self, data):# print("Predicting with data:", data) dist_ind = np.sqrt(np.sum((self.data - data) **2, axis=1) **2) # euclidean distance main_arr = np.column_stack((self.labels, dist_ind)) # labels with distance main = main_arr[main_arr[:, 1].argsort()] # sorting based on distance count = Counter(main[0:self.k, 0]) # counting labels sums = np.array(list(count.values())) # getting countsreturn sums / np.sum(sums) # prediction as probability
Training and Evaluation
Here’s how we train and evaluate the TrieKNN model:
The predictions will vary on each run. From this we can see that we can use KNN on mixed data types.
Conclusion: A Promising Path Forward
TrieKNN presents a compelling solution for extending the applicability of KNN to mixed data types. By leveraging the Trie data structure, it avoids direct distance calculations on categorical features, enabling the use of localized KNN models for numerical data.
Further research could explore:
Optimizing the weighting scheme for combining predictions from different Trie levels.
Comparing TrieKNN’s performance against other mixed-data KNN approaches on benchmark datasets.
Extending TrieKNN to handle missing data and noisy categorical features.
TrieKNN opens up new possibilities for applying KNN in domains where mixed data types are prevalent, such as healthcare, e-commerce, and social science.