marko devcic

Software Engineer
  • github:
    deva666
  • email:
    madevcic {at} gmail.com
Zero Width Space

Zero Width Space

Posted on 12/29/2021

An invisible Unicode character that can waste a lot of your time



Look at these two strings, #co‚Äčol and #cool. They are the same, right?
Well, not exactly. If we count the characters we'll end up in 5 chars in both.
But if we let our computer count them, then the first string has 6 and the second one 5 characters.

If we inspect the individual unicode character codes, for example with JavaScript which uses UTF-16 encoding, we will get the following array [35, 99, 111, 8203, 111, 108] with this code let charArray = Array.from(s1).map(c => c.charCodeAt(0));.
Or, same in hex ['0023', '0063', '006F', '200B', '006F', '006C']. Now, if we inspect each of these Unicode characters will see that 200B is the intruder.

Enter the Zero Width Space. Wikipedia says it is used for word boundaries.

Great, but it can also ruin your day. Let's say you call your API for some data by a Primary Key which is a String and get something completely different than what you are seeing directly in the database. You debug the API and everything seems to be OK. Then after checking the Database you realize that you have two different rows with what seems to be the same Primary Key. How can this be, have you found a bug in your DB? Before you open a ticket with your DB system, you want to double check and compare the hash codes of two seemingly identical Primary Keys. And they are not the same, and after inspecting the characters one by one and wasting a couple of hours you find out about Zero Width Space.

Now, should you be removing all zero width spaces characters in your Input TextFields? Probably not, how can you be sure that a user didn't put it intentionally?

And even if you want to remove it, how can you do it? If it is at the end of the string, do you think calling trim on it will remove it? Try this JavaScript code yourself let trimmed = '#cool' + '\u200B'.trim() and then dump the character codes to array.
Zero Width Space is still at then end. I've tried to strip it with JavaScript, Java8, Kotlin 1.5, Dart 2.12 and Python 3.10. None of them removed it.

But it can be easily done with Regular Expressions. This one finds all visible and non visible empty spaces (tabs and spaces) [\s\u200B-\u200D\uFEFF]. Or this one will find only zero width spaces (yes, there are more than one) [\u200B-\u200D\uFEFF].