Encoding fixed length high cardinality non-numeric columns for a ML algorithm

Figure 1: An simple road network represented as a Graph, where points of interest are nodes and roads connecting them are edges is shown on the left. The corresponding adjacency matrix is shown on the right
from sklearn.cluster import KMeans
import numpy as np
X = np.array([
[0, 1, 0],
[0, 0, 1],
[1, 1, 0]
])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
Figure 2: Sample tabular data which is used as a running example in this article
Figure 3: One-hot encoded browser column results in additional columns equal to its cardinality
  1. Bucketing or Hashing
  2. Character Encoding
  3. Embeddings

Bucketing or Hashing

Figure 3: Hashing IPv4 Addresses into fixed number of buckets

Character Encodings

Figure 4: Expanding IPv4 Address to fixed 12 characters long by padding each section of it with zero(s). Also showing the cardinality of each character which is at most 10
def transform_ip(ip):
"""
If IPv4, equalizes each group and left zero pads to match IPv6 length
If IPv6, converts all to lower case
"""
IPV6_LENGTH = 39
IPV4_GROUP_LENGTH = 3 # each group in IPv4 is of this length
if len(ip) < IPV6_LENGTH:
# IPv4 address
groups = ip.split( "." )
equalize_group_length = "".join( map( lambda group: group.zfill(3), groups ))
left_pad_with_zeros = list( equalize_group_length ).zfill( IPV6_LENGTH )
return left_pad_with_zeros
else:
return list(ip.lower())
from sklearn.preprocessing import CategoricalEncoderdef one_hot_ip(df):
"""
Converts the ipAddress column of pandas DataFrame df, to one-hot
Also returns the encoder used
"""
enc = CategoricalEncoder()
ip_df = df.ipAddress.apply( lambda ip: transform_ip(ip) ).apply( pd.Series ) # creates separate columns for each char in IP
X_ip = enc.fit_transform( ip_df ).toarray()
return X_ip, enc
pip install git+git://github.com/scikit-learn/scikit-learn.git

Embeddings

Enter Deep Neural Nets

Figure 5: Neural Network architecture showing the training process of Embeddings for IP Addresses. Click here to see a bigger picture.

Conclusion

  • One-hot encoding being the simplest of all, works well if the cardinality is small and produces very sparse matrixes.
  • Bucketing solves this problem by reducing the cardinality but may introduce unwanted data skews if not careful.
  • Character encoding solves the cardinality problem by using these facts of input strings - low cardinality for each character and the fixed length nature of inputs. To some extent domain knowledge is needed to ensure if such an encoding is applicable.
  • Embeddings offload the responsibility of encoding completely to the neural network. Higher quality and compact latent features are obtained for free. However that needs the encoding problem to be modeled like a supervised problem and may require a lot of data as there are lot more parameters or knobs for the neural network to tune.
Comparison of Character One-Hot Encoding with Embedding + Hashing encoding schemes. Shows when each technique works the best

--

--

--

Applied Deep Learning Engineer | LinkedIn

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Unsplash Quarterly: Best of 2018 | Winter

Why ethical reasoning should be an essential capability for Data Science teams

The Life Buckets Project: The Motivation

Automatically finding the best Neural Network for your GAN

Redis: Unsafe At Any Speed

Create your first report in Power BI

Read tables from docx file to pandas DataFrames

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Nitin Pasumarthy

Nitin Pasumarthy

Applied Deep Learning Engineer | LinkedIn

More from Medium

AllenNLP Data Processing Pipeline

How to Install and Launch Cloudera Environment on Windows

A Deep Dive into Machine Learning Pipeline with ML Ops architecture, Model registry, and Feature…

Release of Kubeflow 1.5 — what it means for your business?