Why do witches burn?

Because they’re made of wood, like bridges, and so float on water, like churches, very small rocks, gravy, and a duck.

(This pages contains a few weird characters like kanjis that may not display properly on your computer. Don’t worry, it’s not essential to the text)
(J’ai mis ce texte dans la version française aussi par flemme de le traduire. Vous y arriverez, allez-y !)

Unicode is the future (even the present), but please, not in identifiers…


You can use (almost) any utf-8 character in a Javascript variable…

I wanted to create a one-letter object being a big lib of functionnalities, like jQuery’s « $ » object was.
I happily created my object « ø », then I wondered if programmers that would use my class did have this character on their keyboard, or if they’d hate me while copy-pasting this variable all the time…

I love utf-8 but it’s true that we can program all we want with a small set of characters you can find into ASCII printable characters : about 130 of them. That’s in fact more than enough. Why making things more complicated ?
Advantages : having cool and shorter variables names.
Disadvantages : variables you can’t understand or you simply can’t write.
Clearly the drawbacks are too important.
It’s already hard enough to understand what the vars « data » and « arr_list » contains, what are we going to do of « øû$ » ?

The short advantage arrives only in javascript and it’s uber-power powerfuls lib-objects, like « $ » (aka jQuery).
It’s used so much that it needs to be typed fast.
That’s I wanted to created my object « ø », because « $ » and « $$ » are already taken. But what is available on your keyboard ? It turns out « £ », « ø », « µ », « § » weren’t on my colleagues’ keyboards.
So utf-8 in variable isn’t even so useful.
Don’t use them.


This came when people talked about email validation. The official explanation of a valid email takes several pages of explaining, and to be honest, it looks like anything with at least one « @ » is a valid email.
And that, is utterly stupid.

Emails should be looked as simple identifiers. Something easy to copy from a business card or even to remember. Like a domain name.

In the real world, most emails are simple. If you get an account at most webmails, they will limit the characters you can use. The only case someone got a valid email like « i’m great »@my.domain.is.cool.moadsqkfj.com is a geek on his own server. I don’t care about playing with the limits of a technology, when I validate an email, I expect it to be a real-life email.

So, not the validation that is used by PHP:
Comments on php.net says that me@localhost and « this is a valid email@[]{}and it should be seen as such »@example.com
Are seen as valid by filter_var($email, FILTER_VALIDATE_EMAIL)

So, not the HTML5 validation by <input type="email" />, which is this regexp:

Here is what I use and will continue to:

The complex and fun names can still be used on the left part of a “To:”. Example:
Sir Jöhñ Åbæ-Ølýk’n Jr. <john.abae-olykn@gmail.com>

The simplest reason? I don’t know how to type Å or 漢 on my keyboard. If I absolutely need to be able to type them to send an email, I won’t be able to send this email.

Domains names

See previous paragraph.

I know registrars who come up with new TLDs every day just want to make more money.
I don’t want any utf8 domains.
And I’m sorry for non-latin languaged people, but be honest : you’ve learnt english and use it on the web just like all of us. You don’t need www.ドラゴンボール.com

Last reason: identical-looking characters

Unicode is a load of fun and got many tricky characters. In text, they are very useful. In identifiers, they are horrible.

And we don’t need unicode for that. For example: « Iol ».
Saw the problem? It all depends on the font of your favorite browser. There is an uppercase « i » and a lowercase « L ». Two identical letters in many fonts, for no good reason.
I received a password, displayed in a way I couldn’t copy-paste it. It contained an uppercase « i ». I wrote it 100 times, with a lowercase « L ». And it never worked, and I didn’t understand why.

Now, this is bothering enough. Now, imagine with the thousands of utf8 characters. See :
ー (12540), 一 (19968), — (8212), ― (8213)
− (8722), ‒ (8210), – (8211)
– (45), ‐ (8208), ‑ (8209)

These are cyrilic letters: асеорхуАВСЕНІЈКМОРЅТХ. So А (1040) isn’t A (65).
This site exists: www.axe.com, this one does not: www.ахе.com

Between the two As, are inserted my favorite characters: A​A (zero-width space) and A­A (soft hyphen). Yes, there is a character there.
You can see their usage if you resize your browser and see how this very long line with very long words cuts on the right: zero_width_space_A​A_zero_width_space soft_hyphen_A­A_soft_hyphen zero_width_space_A​A_zero_width_space soft_hyphen_A­A_soft_hyphen zero_width_space_A​A_zero_width_space soft_hyphen_A­A_soft_hyphen zero_width_space_A​A_zero_width_space soft_hyphen_A­A_soft_hyphen zero_width_space_A​A_zero_width_space soft_hyphen_A­A_soft_hyphen zero_width_space_A​A_zero_width_space soft_hyphen_A­A_soft_hyphen zero_width_space_A​A_zero_width_space soft_hyphen_A­A_soft_hyphen. It simply cuts the line at space, and between AA, either just cutting (zero width space) or adding a hyphen in the process (soft hyphen).
Can you imagine these characters in email or unique identifier? You shouldn’t.
Next time you send an email or a piece of code by email or thru skype (any program handling utf8), remember to randomly put these characters in the text for a good (sic) joke :)

I think it’s clear enough. Make simple emails and urls. Don’t code with weirds characters. Life will be better.

3 Responses to “Make simple Variables and Identifiers”

  1. Best method to validate an email: send something to the email address.
    If some action/answer: this was clearly a good email address. And an active one !

    Xavier Nicollet

  2. Best, but not simplest, for sure!
    It asks for a lot or ressources and skills, whereas a simple regexp can be shared among small developpers.


  3. Guenhwyvar

Leave a Reply