Post on 01-Jan-2017
Adobe® Marketing Cloud
Multi-Byte Characters
Contents
Multi-Byte Character Sets................................................................................................3
Web Page Encodings and Character Sets.......................................................................4ISO-8859-1 Encoding and Character Set...................................................................................................................4
CP1252 Windows-1252 Character Set.....................................................................................................................10
UTF-8 Encoding Unicode Character Set..................................................................................................................11
Analytics Report Suites - Standard ISO and Multi-byte Enabled................................12
Using the charSet Property............................................................................................13
Analytics Display Language..........................................................................................14
Character Codes 128-255 - ISO vs. UTF-8......................................................................15
Variable Lengths.............................................................................................................16
Enabling Multi-Byte Support.........................................................................................17Supported Character Sets............................................................................................................................................17
Contact and Legal Information.....................................................................................20
Multi-Byte CharactersLast updated 2/11/2015
Multi-Byte Character SetsAnalytics allows data to be captured and reported in multiple languages, which allows international sites to be easily tagged withAnalytics code, and generate reports that reflect the site content as displayed to the user. A single report suite can be used tocollect and report data in multiple languages.
Properly utilizing the internationalization capability of Analytics involves coordination of the report suite configuration, webpage encoding and the Analytics property charSet.
For example, if the sites mysite.com (English), mysite.co.jp (Japanese), and mysite.co.kr (Korean) are all sendingdata to a single global report suite, Analytics can display the English, Japanese, and Korean data simultaneously in a single report.
In addition to collecting and displaying international data, the Analytics interface can be displayed in several languages, includingEnglish, German, Japanese, Chinese, and Korean.
3Multi-Byte Character Sets
Web Page Encodings and Character SetsWeb pages display textual data by converting numeric character codes to physical characters based on the page encoding, whichdefines the range of available characters that can be properly displayed on the page.
The page encoding is set with one of the following three methods.
• Using a <META> tag inside the <HEAD> tag of the page, for example, <META http-equiv="Content-Type"content="text/html;charSet=ISO-8859-1">
• Within the http header, for example, Content-Type: text/html; charSet=ISO-8859-1• By browser auto-detection; If methods one and two are not used, modern browsers will attempt to detect the page encoding
based on the content or simply use a default encoding based on user preferences.
For greater visibility of the page encoding, Adobe recommends using the first method whenever possible. The third methodmay be unreliable for international sites and should be avoided whenever possible.
For additional information on encodings and character sets, refer to http://www.w3.org/International/tutorials/tutorial-char-enc/.
ISO-8859-1 Encoding and Character Set
The most commonly used encoding for Latin based languages (English, French, Spanish, etc.) is "ISO-8859-1," which is one ofmany standards that use single-byte encodings.
Each character is represented by one (and only one) byte of data. Therefore, single-byte encodings, including ISO-8859-1, islimited to 256 displayable characters.
The following table contains the complete set of characters that are available within ISO-8859-1.
Character DescriptionCharacterCharacter Number
N/Anon-displayed control codes0-31
spacespace32
exclamation point!33
straight quote marks"34
hash mark/number sign#35
dollar sign$36
percent sign%37
ampersand&38
straight quote mark/apostrophe'39
left parenthesis(40
right parenthesis)41
asterisk*42
plus sign+43
comma,44
hyphen-45
period.46
slash/47
4Web Page Encodings and Character Sets
Character DescriptionCharacterCharacter Number
zero048
one149
two250
three351
four452
five553
six654
seven755
eight856
nine957
colon:58
semi-colon;59
less than sign<60
equals sign=61
greater than sign>62
question mark?63
commercial "at" sign@64
uppercase AA65
uppercase BB66
uppercase CC67
uppercase DD68
uppercase EE69
uppercase FF70
uppercase GG71
uppercase HH72
uppercase II73
uppercase JJ74
uppercase KK75
uppercase LL76
uppercase MM77
uppercase NN78
uppercase OO79
uppercase PP80
uppercase QQ81
5Web Page Encodings and Character Sets
Character DescriptionCharacterCharacter Number
uppercase RR82
uppercase SS83
uppercase TT84
uppercase UU85
uppercase VV86
uppercase WW87
uppercase XX88
uppercase YY89
uppercase ZZ90
left square bracket[91
backslash\92
right square bracket]93
caret^94
underscore bar_95
grave accent`96
lowercase aa97
lowercase bb98
lowercase cc99
lowercase dd100
lowercase ee101
lowercase ff102
lowercase gg103
lowercase hh104
lowercase ii105
lowercase jj106
lowercase kk107
lowercase ll108
lowercase mm109
lowercase nn110
lowercase oo111
lowercase pp112
lowercase qq113
lowercase rr114
lowercase ss115
6Web Page Encodings and Character Sets
Character DescriptionCharacterCharacter Number
lowercase tt116
lowercase uu117
lowercase vv118
lowercase ww119
lowercase xx120
lowercase yy121
lowercase zz122
left curly brace{123
solid vertical bar/pipe|124
right curly brace}125
tilde~126
N/Aunused127-159
non-breaking spacespace160
inverted exclamation point¡161
cents sign¢162
pound sterling sign£163
general currency sign¤164
yen sign¥165
broken vertical bar¦166
section§167
umlaut/dieresis¨168
copyright symbol©169
feminine ordinalª170
left angle quote marks«171
not sign¬172
soft hyphens hyphen173
registered symbol®174
macron accent¯175
degree sign°176
plus or minus±177
superscript 2²178
superscript 3³179
acute accent´180
micro sign (Greek mu)µ181
7Web Page Encodings and Character Sets
Character DescriptionCharacterCharacter Number
paragraph sign¶182
middle dot·183
cedilla¸184
superscript 1¹185
masculine ordinalº186
right angle quote marks»187
fraction one-fourth¼188
fraction one-half½189
fraction three-fourths¾190
inverted question mark¿191
uppercase A, grave accentÀ192
uppercase A, acute accentÁ193
uppercase A, circumflex accentÂ194
uppercase A, tildeÃ195
uppercase A, umlaut/dieresisÄ196
uppercase A, ringÅ197
uppercase AE ligature, diphthongÆ198
uppercase C, cedillaÇ199
uppercase E, grave accentÈ200
uppercase E, acute accentÉ201
uppercase E, circumflex accentÊ202
uppercase E, umlaut/dieresisË203
uppercase I, grave accentÌ204
uppercase I, acute accentÍ205
uppercase I, circumflex accentÎ206
uppercase I, umlaut/dieresisÏ207
uppercase Eth, IcelandicÐ208
uppercase N, tildeÑ209
uppercase O, grave accentÒ210
uppercase O, acute accentÓ211
uppercase O, circumflex accentÔ212
uppercase O, tildeÕ213
uppercase O, umlaut/dieresisÖ214
multiplication sign×215
8Web Page Encodings and Character Sets
Character DescriptionCharacterCharacter Number
uppercase O, slashØ216
uppercase U, grave accentÙ217
uppercase U, acute accentÚ218
uppercase U, circumflex accentÛ219
uppercase U, umlaut/dieresisÜ220
uppercase Y, acute accentÝ221
uppercase Thorn, IcelandicÞ222
small sharp s, Germanß223
lowercase a, grave accentà224
lowercase a, acute accentá225
lowercase a, circumflex accentâ226
lowercase a, tildeã227
lowercase a, umlaut/dieresisä228
lowercase a, ringå229
lowercase ae ligature, diphthongæ230
lowercase c, cedillaç231
lowercase e, grave accentè232
lowercase e, acute accenté233
lowercase e, circumflex accentê234
lowercase e, umlaut/dieresisë235
lowercase i, grave accentì236
lowercase i, acute accentí237
lowercase i, circumflex accentî238
lowercase i, umlaut/dieresisï239
lowercase eth, Icelandicð240
lowercase n, tildeñ241
lowercase o, grave accentò242
lowercase o, acute accentó243
lowercase o, circumflex accentô244
lowercase o, tildeõ245
lowercase o, umlaut/dieresisö246
division sign÷247
lowercase o, slash/null setø248
lowercase u, grave accentù249
9Web Page Encodings and Character Sets
Character DescriptionCharacterCharacter Number
lowercase u, acute accentú250
lowercase u, circumflex accentû251
lowercase u, umlaut dieresisü252
lowercase y, acute accentý253
small thorn, Icelandicþ254
lowercase y, umlaut/dieresisÿ255
CP1252 Windows-1252 Character Set
The CP1252 encoding and character set (otherwise known as the Windows-1252 or simply Windows character set) is a supersetof ISO-8859-1.
The CP1252 characte rset was developed by Microsoft and is used primarily by Microsoft Windows systems. This encoding usesthe 128-159 code range to display additional characters not included in the ISO-8859-1 character set.
Character DescriptionCharacterCharacter Number
Euro currency symbol€128
129
single low-9 quotation mark'130
Latin letter f with hookƒ131
double low-9 quotation mark"132
horizontal elipsis…133
dagger†134
double dagger‡135
modifier letter circumflex accentˆ136
per mille sign‰137
Latin letter S with caronŠ138
single left angle quotation mark‹139
Latin ligature OEŒ140
141
Latin letter Z with caronŽ142
143
144
left single quotation mark'145
right single quotation mark'146
left double quotation mark“147
right double quotation mark”148
10Web Page Encodings and Character Sets
Character DescriptionCharacterCharacter Number
bullet•149
endash–150
emdash—151
small tilde˜152
trademark sign˜153
Latin letter s with caronš154
single right angle quotation mark›155
Latin ligature oeœ156
157
Latin letter z with caronž158
Latin letter Y with dieresisŸ159
Note: Since this character set is not standardized across all platforms and browsers, these character codes are not validHTML, though they will display properly on some systems and browsers. Use of these character codes will result in inconsistentdisplay across browser versions and operating systems. To properly display these characters requires a more advancedcharacter set and encoding, such as UTF-8 Encoding Unicode Character Set.
UTF-8 Encoding Unicode Character Set
UTF-8 encoding is quickly becoming the standard for displaying multilingual (as well as mathematical and scientific) data onthe web. UTF-8 is based on the standardized (but evolving) Unicode character set.
Unicode is an advanced character set that as of version 4.0, includes more than 70,000 characters from nearly all written languages.UTF-8 is one of the most common encoding methods used to convert Unicode character codes into a data byte sequence. Unlikesingle-byte encoding methods, each character can consist of one to four bytes of data in Unicode.
For more information on Unicode and UTF-8, refer to the following web sites.
• http://www.unicode.org• http://en.wikipedia.org/wiki/Unicode• http://en.wikipedia.org/wiki/UTF-8
11Web Page Encodings and Character Sets
Analytics Report Suites - Standard ISO and Multi-byteEnabledEach Analytics report suite is configured to be either standard (or ISO) or a multi-byte (UTF-8/localized) report suite.
This setting determines what encoding is to be used to store and display Analytics data. A standard report suite uses ISO-8859-1encoding while a multi-byte suite uses UTF-8 encoding. Any characters that are not in the ISO-8859-1 character set (includingthose in the CP 1252 character set) will not display properly in a standard ISO report suite. Some of these non-supportedcharacters might cause display problems such as line breaks, odd characters, or even truncation of the value passed to Analytics.
If the data you are passing to Analytics contains any characters not in the ISO-8859-1 character set, you should use a multi-bytereport suite. Contact your Implementation Consultant or Adobe Client Care to make the change. A report suite can be changedfrom standard to multi-byte, and vice-versa. However, for data that has already been collected, characters above ISO 127 mightnot display properly after the change is made. The best practice is to determine the needed report suite type when the reportsuite is created.
12Analytics Report Suites - Standard ISO and Multi-byteEnabled
Using the charSet PropertyThe charSet property, which is normally set in the JavaScript file, is used by Analytics to convert incoming data into UTF-8 forstorage and reporting by Analytics.
Note: The charSet property is required when sending data to a multi-byte report suite and should never be used with astandard report suite. Setting the charSet property with a standard ISO report suite can result in variable truncation orunexpected character conversion.
The value of the charSet property should match the web page encoding in the META tag or http header, even though the syntaxmay differ slightly. Although the META tag may use an alias for the encoding, the value of charSet should use the preferred (orofficial) name of the encoding.
Some of the more common encodings with their preferred name and aliases are listed in the following table.
AliasesPreferred Name
ISO_8859-1, CP819, latin1ISO-8859-1
ISO_8859-2, latin2ISO-8859-2
ISO_8859-5, cyrillicISO-8859-5
Big-5Big5
SJISShift_JIS
Because numerous encodings and aliases exist, contact your Implementation Consultant or Adobe Customer Care to confirmthe proper value for charSet if it does not appear in the table above.
If a site has different web encodings on different pages, or a single JavaScript file is used for multiple sites, the charSet propertycan be set to a default value in the JavaScript file and then reset on specific pages as needed to override the default; for example,s.charSet="UTF-8" or s.charSet="SJIS.".
Any non-blank value of the charSet parameter will cause data to be converted into UTF-8 for storage. Any characters in the128-255 range will be converted to the proper UTF-8 two-byte sequence and stored. These characters will not display properlyin a standard report suite. Therefore, the charSet property should never be used with a standard report suite.
Likewise, a blank value of the charSet parameter will bypass the data conversion process, and any characters in the range 128-255will be stored as a single byte. These characters will not display properly in a multi-byte report suite since the single-byte codesfor these characters are not valid UTF-8. Therefore, the charSet parameter should always be used with a multi-byte report suite.Additionally, the proper value should be used with respect to the web page encoding.
13Using the charSet Property
Analytics Display LanguageThe Analytics interface can be displayed in alternate languages using the Language menu in the interface.
Selecting any option other than English causes Analytics to display using UTF-8 encoding. Displaying a standard report suiteusing a setting other than English might cause some data to display improperly.
14Analytics Display Language
Character Codes 128-255 - ISO vs. UTF-8Characters in the range 1-127 are represented by the same byte sequence (actually a single byte) in ISO-8859-1 and UTF-8.However, the characters in the range 128-255 (including all diacritical characters (accent marks)) are represented by a singlebyte in ISO-8859-1 and two bytes in UTF-8.
The difference becomes apparent when changing the report suite type. For collected data, characters in the 128-255 range thatdisplay properly in a standard report suite will not display properly in a multi-byte report suite. Any of these characters thatdisplay properly in a multi-byte report suite will not display properly in a standard report suite. Determining the proper reportsuite type before collecting data is absolutely critical.
15Character Codes 128-255 - ISO vs. UTF-8
Variable LengthsFor a standard report suite, all characters occupy a single byte by definition. When sending data to a standard report suite, allvariable length limits expressed in bytes have the same length limit in characters.
For a multi-byte report suite, data is stored at UTF-8. Each character in UTF-8 encoding can occupy one to four bytes of data,which means all Analytics variables may have their length limit as low as 25 characters. Additionally, the limit on the numberof characters is determined by the characters themselves. For example, in UTF-8 you could have a page name consisting of 100characters "A." However, the character "A" would have a limit of only 50 characters since its character code (192) requires twobytes for storage.
Languages such as French and Spanish frequently make use of diacritical characters. Since each of these characters occupies twobytes of data when stored as UTF-8, variable length limits become an issue. With languages such as Japanese and Chinese, theissue is more profound since each variable can be limited to as little as 25 characters.
Compounding the issue is that if you simply pass a longer variable to Analytics, the string will be truncated at the byte limitwhen the data is stored, which has the potential of changing the last character displayed since the database may only containthe entire character byte sequence. For web pages using UTF-8 encoding, you can only use JavaScript to properly limit a variableto a set number of bytes before sending it to Analytics. However, this technique may not be possible with other encodings suchas Big5 or Shift-JIS.
Each Analytics variable has a defined length limit expressed in bytes. For standard report suites, each character is representedby a single byte; therefore, a variable with a limit of 100 bytes also has a limit of 100 characters. However, multi-byte reportsuites store data as UTF-8, which expresses each character with one to four bytes of data. This action effectively limits somevariables to as little as 25 characters with languages such as Japanese and Chinese that commonly use between two and fourbytes per character.
The character limit is directly related to the characters being used, which makes a predetermined character limit difficult todetermine. For multi-byte report suites, the best practice is to limit Analytics variables to the specific number of bytes for thevariable before passing data to Analytics.
16Variable Lengths
Enabling Multi-Byte SupportSteps to enable multi-byte support.
1. The multi-byte pages must use a standard language encoding character set.
2. The Analytics report suite must be multi-byte enabled.
3. The Analytics code (charSet) must be set to the correct language identifier for a given language-encoded page.
The JS file must define the charSet variable. (All pageviews and traffic are assumed to be standard 7-bit ASCII unless otherwisespecified.) Setting the charSet variable, tells the Analytics engine what language should be translated into UTF-8. Some languageidentifiers used in meta-tags or JavaScript variables do not match up with the Analytics conversion filter. Supported CharacterSets describes the character sets currently supported by Analytics.
Supported Character Sets
List of other single-byte and multi-byte encodings that are used on the web.
Some of the more common additional encodings include the following:
Character Set3-Character LanguageCode
Language2-Character CodeCountry
Big5chiHK Trad ChinesehkHong Kong
Big5chiTW Trad ChinesetwTaiwan
EUC-KRkorKoreankrKorea
GB2312chiSimp ChinesecnChina
ISO-8859-1engEnglishaaAfrica
ISO-8859-1freFrenchaaAfrica
ISO-8859-1spaLA SpanisharArgentina
ISO-8859-1engEnglishauAustralia
ISO-8859-1gerGermanatAustria
ISO-8859-1dutDutchbeBelgium
ISO-8859-1freFrenchbeBelgium
ISO-8859-1spaLA SpanishboBolivia
ISO-8859-1porBR PortuguesebrBrazil
ISO-8859-1freCanadian FrenchcaCanada
ISO-8859-1engEnglishcaCanada
ISO-8859-1engEnglishcbCaribbean
ISO-8859-1spaLA SpanishnsCentral America
ISO-8859-1spaLA SpanishclChile
ISO-8859-1spaLA SpanishcoColumbia
ISO-8859-1danDanishdkDenmark
17Enabling Multi-Byte Support
Character Set3-Character LanguageCode
Language2-Character CodeCountry
ISO-8859-1spaLA SpanishecEcuador
ISO-8859-1finFinnishfiFinland
ISO-8859-1freFrenchfrFrance
ISO-8859-1gerGermandeGermany
ISO-8859-1engEnglishhkHong Kong
ISO-8859-1engEnglishinIndia
ISO-8859-1engEnglishidIndonesia
ISO-8859-1engEnglishieIreland
ISO-8859-1itaItalianitItaly
ISO-8859-1engEnglishmyMalaysia
ISO-8859-1spaLA SpanishmxMexico
ISO-8859-1engEnglishmeMiddle East
ISO-8859-1dutDutchniNetherlands
ISO-8859-1engEnglishnzNew Zealand
ISO-8859-1norNorwegiannoNorway
ISO-8859-1spaLA SpanishpyParaguay
ISO-8859-1spaLA SpanishpePeru
ISO-8859-1engEnglishphPhilippines
ISO-8859-1porPT PortugueseptPortugal
ISO-8859-1spaLA SpanishprPuerto Rico
ISO-8859-1engEnglishsgSingapore
ISO-8859-1engEnglishzaSouth Africa
ISO-8859-1spaSpanishesSpain
ISO-8859-1sweSwedishseSweden
ISO-8859-1freFrenchchSwitzerland
ISO-8859-1gerGermanchSwitzerland
ISO-8859-1engEnglishthThailand
ISO-8859-1engEnglishukUnited Kingdom
ISO-8859-1engEnglishusUnited States
ISO-8859-1spaLA SpanishuyUruguay
ISO-8859-1spaLA SpanishveVenezuela
ISO-8859-1engEnglishvnVietnam
ISO-8859-10estEstonianeeEstonia
ISO-8859-2croCroatianhrCroatia
18Enabling Multi-Byte Support
Character Set3-Character LanguageCode
Language2-Character CodeCountry
ISO-8859-2czeCzechczCzech Republic
ISO-8859-2hunHungarianhuHungary
ISO-8859-2polPolishplPoland
ISO-8859-2romRomanianroRomania
ISO-8859-2slkSlovakskSlovak Republic
ISO-8859-2slvSloveniansiSlovenia
ISO-8859-4litLithuanianltLithuania
ISO-8859-5bulBulgarianbgBulgaria
Windows-1257ukrRussianuaUkraine
Windows-1257rusRussianruRussian Federation
Windows-1257greGreekgrGreece
Windows-1257turTurkishtrTurkey
Windows-1257hebHebrewilIsrael
Windows-1257latLatvianlvLatvia
SJISjpnJapanesejpJapan
19Enabling Multi-Byte Support
Contact and Legal InformationInformation to help you contact Adobe and to understand the legal issues concerning your use of this product and documentation.
Help & Technical Support
The Adobe Marketing Cloud Customer Care team is here to assist you and provides a number of mechanisms by which theycan be engaged:
• Check the Marketing Cloud help pages for advice, tips, and FAQs• Ask us a quick question on Twitter @AdobeMktgCare• Log an incident in our customer portal• Contact the Customer Care team directly• Check availability and status of Marketing Cloud Solutions
Service, Capability & Billing
Dependent on your solution configuration, some options described in this documentation might not be available to you. Aseach account is unique, please refer to your contract for pricing, due dates, terms, and conditions. If you would like to add toor otherwise change your service level, or if you have questions regarding your current service, please contact your AccountManager.
Feedback
We welcome any suggestions or feedback regarding this solution. Enhancement ideas and suggestions for Adobe Analytics canbe added to our Customer Idea Exchange.
Legal
© 2015 Adobe Systems Incorporated. All Rights Reserved.Published by Adobe Systems Incorporated.
Terms of Use | Privacy Center
Adobe and the Adobe logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United Statesand/or other countries.
All third-party trademarks are the property of their respective owners.
20Contact and Legal Information