The XML declaration at the top of an XML document contains encoding information like this :
<?xml version="1.0" encoding="UTF-8" ?>
Please note how I have highlighted the latter part of the above.
This encoding is not just about xml, but whenever dealing with textual data.
(I remember it happended because an end-user was using a German Address and I had written some bad code, that was supposed to truncate this address and I completely messed up)
It was time to see what this was all about and why it really mattered.
This is how I would put it: Without this information , it is impossible to send or receive and parse an
xml (or any other text )consistently and correctly at all times, because it will not be possible to put/read this xml into/from memory/disk in a consistent manner.
This article :
http://www.joelonsoftware.com/articles/Unicode.html is a great quick tutorial into character sets, character encodings and its importance.
I recently encountered an issue where-in an end system kept complaining about some junk characters our xml had, the problem came out to be character encodings, in our case , they were saving/reading a UTF-8 encoded xml as a Latin-1 or ISO-8859-1, and it generally works because for quite a lot of characters both the character encodings use a similar scheme but alas not for all, and wherever these two schemes differed they started seeing invalid characters.
Here , I would like to mention another encoding ( not a character encoding though ) that does confuse novice users a lot , Base64. It servers a different purpose, that of representing binary data as a character string , and is generally used while transmitting binary data in a textual medium such as xml or email, the encoding uses 6 bit blocks leading to 64 different characters. It is also made sure , that these 64 characters are common to most encodings and are printable.
It generally utilises the characters, a-z, A-Z, 0-9, + and /
The = is used as a padding character.
Added Note:
<?xml version="1.0" encoding="UTF-8" ?>
Please note how I have highlighted the latter part of the above.
This encoding is not just about xml, but whenever dealing with textual data.
(I remember it happended because an end-user was using a German Address and I had written some bad code, that was supposed to truncate this address and I completely messed up)
It was time to see what this was all about and why it really mattered.
This is how I would put it: Without this information , it is impossible to send or receive and parse an
xml (or any other text )consistently and correctly at all times, because it will not be possible to put/read this xml into/from memory/disk in a consistent manner.
This article :
http://www.joelonsoftware.com/articles/Unicode.html is a great quick tutorial into character sets, character encodings and its importance.
I recently encountered an issue where-in an end system kept complaining about some junk characters our xml had, the problem came out to be character encodings, in our case , they were saving/reading a UTF-8 encoded xml as a Latin-1 or ISO-8859-1, and it generally works because for quite a lot of characters both the character encodings use a similar scheme but alas not for all, and wherever these two schemes differed they started seeing invalid characters.
Here , I would like to mention another encoding ( not a character encoding though ) that does confuse novice users a lot , Base64. It servers a different purpose, that of representing binary data as a character string , and is generally used while transmitting binary data in a textual medium such as xml or email, the encoding uses 6 bit blocks leading to 64 different characters. It is also made sure , that these 64 characters are common to most encodings and are printable.
It generally utilises the characters, a-z, A-Z, 0-9, + and /
The = is used as a padding character.
Added Note:
For the Oracle JMS Adapter , the property
jca.message.encoding
is applicable for both inbound and outbound messages.
Similarly, For Oracle FTP/File Adapters, the property 'character set' or encoding can be used.
No comments:
Post a Comment