Python Quickie: Normalize Text Easily

Remove unwanted accents or invalid chars from any unicode data string by normalizing it. Learn how to normalize text in this quickie. Fast and Easy. 

Today I’ll show you how to normalize text (of any unicode string) with python using the unicodedata module. You’ll need this, if you want to remove accents and other special chars. For me it’s a pretty common thing, since I live in Switzerland. Some sites still can’t really interpret Zürich because of the ü.

Normalize Text Cuz Encoding

Imagine you fill a dropdown on your website (to select the location) from a database, containing Zürich. But you’re website doesn’t support Ü. A good option would be to just store it in the database as Zurich, or change the encoding of the website, since we don’t really need to stick with the old encodings in the first place… Now that I declared this tutorial pointless by telling you two easy approaches on how to solve it anyways, let’s start with it  😛 .

All you need to do is to loop every string (in this case the name of the location) through a normalize section before adding it to the dropdown. But first you’ll have to import the module of course.

import unicodedata

The normalizing process should look a little bit like this (This is a hardcoded example):

data = u'Zürich'
normal = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore')

The value of normal will be b'Zurich'. You can test it here.

Normalize Text NFKD

NFKD stands for Normalization Form Compatibility Decomposition and you probably didn’t even read the whole acronym. It decomposes “invalid” characters by compatibility and more  ➡ TL;DR: It makes magic happen to words.

If you’re looking for a more detailed explanation/ discussion check out this stack overflow thread.