Thursday, January 15, 2015

Understanding character encodings - Why is it important ??

The XML declaration at the top of an XML document contains encoding information like this :

<?xml version="1.0" encoding="UTF-8" ?>

Please note how I have highlighted the latter part of the above.

This encoding is not just about xml, but whenever dealing with textual data.

(I remember it happended because an end-user was using a German Address and I had written some bad code, that was supposed to truncate this address and I completely messed up)

It was time to see what this was all about and why it really mattered.

This is how I would put it: Without this information , it is impossible to send or receive and parse an
xml (or any other text )consistently and correctly at all times, because it will not be possible to put/read this xml into/from memory/disk in a consistent manner.

This article :

http://www.joelonsoftware.com/articles/Unicode.html is  a great quick tutorial into character sets, character encodings and its importance.

I recently encountered an issue where-in an end system kept complaining about some junk characters our xml had, the problem came out to be character encodings, in our case , they were saving/reading a UTF-8 encoded xml as a Latin-1 or ISO-8859-1, and it generally works because for quite a lot of characters both the character encodings use a similar scheme but alas not for all, and wherever these two schemes differed they started seeing invalid characters.

Here , I would like to mention another encoding ( not a character encoding though ) that does confuse novice users a lot , Base64. It servers a different purpose, that of representing binary data as a character string , and is generally used while transmitting binary data in a textual medium such as xml or email, the encoding uses 6 bit blocks leading to 64 different characters. It is also made sure , that these 64 characters are common to most encodings and are printable.
It generally utilises the characters, a-z, A-Z, 0-9, + and /
The = is used as a padding character.

Added Note: 
For the Oracle JMS Adapter , the property jca.message.encoding is applicable for both inbound and outbound messages.
Similarly, For Oracle FTP/File Adapters, the property 'character set' or encoding can be used.


No comments:

Post a Comment