Encoding fixed length high cardinality non-numeric columns for a ML algorithm

Figure 1: An simple road network represented as a Graph, where points of interest are nodes and roads connecting them are edges is shown on the left. The corresponding adjacency matrix is shown on the right
from sklearn.cluster import KMeans
import numpy as np
X = np.array([
[0, 1, 0],
[0, 0, 1],
[1, 1, 0]
])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
Figure 2: Sample tabular data which is used as a running example in this article
Figure 3: One-hot encoded browser column results in additional columns equal to its cardinality
  1. Bucketing or Hashing
  2. Character Encoding
  3. Embeddings

Bucketing or Hashing

Figure 3: Hashing IPv4 Addresses into fixed number of buckets

Character Encodings

Figure 4: Expanding IPv4 Address to fixed 12 characters long by padding each section of it with zero(s). Also showing the cardinality of each character which is at most 10
def transform_ip(ip):
"""
If IPv4, equalizes each group and left zero pads to match IPv6 length
If IPv6, converts all to lower case
"""
IPV6_LENGTH = 39
IPV4_GROUP_LENGTH = 3 # each group in IPv4 is of this length
if len(ip) < IPV6_LENGTH:
# IPv4 address
groups = ip.split( "." )
equalize_group_length = "".join( map( lambda group: group.zfill(3), groups ))
left_pad_with_zeros = list( equalize_group_length ).zfill( IPV6_LENGTH )
return left_pad_with_zeros
else:
return list(ip.lower())
from sklearn.preprocessing import CategoricalEncoderdef one_hot_ip(df):
"""
Converts the ipAddress column of pandas DataFrame df, to one-hot
Also returns the encoder used
"""
enc = CategoricalEncoder()
ip_df = df.ipAddress.apply( lambda ip: transform_ip(ip) ).apply( pd.Series ) # creates separate columns for each char in IP
X_ip = enc.fit_transform( ip_df ).toarray()
return X_ip, enc
pip install git+git://github.com/scikit-learn/scikit-learn.git

Embeddings

Enter Deep Neural Nets

Figure 5: Neural Network architecture showing the training process of Embeddings for IP Addresses. Click here to see a bigger picture.

Conclusion

  • One-hot encoding being the simplest of all, works well if the cardinality is small and produces very sparse matrixes.
  • Bucketing solves this problem by reducing the cardinality but may introduce unwanted data skews if not careful.
  • Character encoding solves the cardinality problem by using these facts of input strings - low cardinality for each character and the fixed length nature of inputs. To some extent domain knowledge is needed to ensure if such an encoding is applicable.
  • Embeddings offload the responsibility of encoding completely to the neural network. Higher quality and compact latent features are obtained for free. However that needs the encoding problem to be modeled like a supervised problem and may require a lot of data as there are lot more parameters or knobs for the neural network to tune.
Comparison of Character One-Hot Encoding with Embedding + Hashing encoding schemes. Shows when each technique works the best

--

--

--

Applied Deep Learning Engineer | LinkedIn

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Barba-group reproducibility syllabus

Customer Segmentation Report for Arvato Financial Services

A Decent Guide to DataFrames in Spark 3.0 for Beginners

Create beautiful and interactive Chord Diagrams using Python

New model suggests investors use FT to make sense of pandemic market chaos

This chart shows the new lows the West Texas Intermediate oil price hit back in April, when prices turned negative

Accelerate your business outcomes with Enterprise Intelligence Hub powered by xDM

It’s Okay to Break the Rules…Sometimes

EPL Fantasy GW32 Recap and GW33 Algo Picks

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Nitin Pasumarthy

Nitin Pasumarthy

Applied Deep Learning Engineer | LinkedIn

More from Medium

Knowledge for Development, A Machine Learning Approach

Can we use ML to distinguish between handwriting?

Data Science Ideas Using Graphs: Line Graphs and Edge Clustering

The Truth about Advanced Analytical Software and Manufacturing