Post on 02-Apr-2015
description
®
Unicode Support in Oracle9i Database
®
Topics
• Customer New Requirement
• Existing Unicode Support in 8i
• New Unicode Features in 9i
• Character Semantics Support in 9i
• Unicode reliable data type as NCHAR in 9i
• VARCHAR2 vs. NVARCHAR2 for Unicode
• UTF-8 or UTF-16 for NCHAR
• Unicode Access Interface
• Unicode Migration and Compatibility
• Conclusion
®
Requirements
• Consistent character length semantics– Column definition close to visual length
– SQL standard recommends varchar2(10) to hold 10 characters
• Reliable Unicode data type– To be independent of database character set
– To enable gradually migration to Unicode
– To enable third vendor/component for Unicode
• Easy programming for other environments– Java, XML, Window NT
• High information density for performance– Storage efficiency for Asian or European data
®
Existing Unicode Support in 8i
• UTF-8 as database character set– UTF8 for ASCII based platform
– UTFE for EBCDIC based platform
• UTF-8 as client character set
• UTF-16 for OCI bind/define buffer
• UCHAR/UVARCHAR as UTF-16 in ProC
• UTF-16 for ODBC and OLEDB
• UTF-16 for JDBC
• Unicode binary sort
®
New Unicode features in Oracle9i
• Character semantics support for text column
• Reliable Unicode datatype as NCHAR
• UTF-16 support for Oracle Call Interface(OCI)
• Complete Unicode support for ODBC/OLEDB /JDBC
• Unicode and ISO14651 based multilingual sort
• Unicode enabled Oracle utilities such as SQL*Loader
• Unicode based locale builder for locale customization
®
Length Semantics Support in Oracle9i
• A new semantics– CHAR ( size [BYTE | CHAR] )
– VARCHAR2 ( size [BYTE | CHAR] )
• It meets Ansi SQL standard– the size is defined in character in the standard,
but most vender implemented in byte
• It fulfills Customer’s requirement– Portable database schema
– Character set independent
– Same data size across server, client, and third middle tier
– Easy migration to Unicode support
®
Character Semantics Support in 9i - Cont.
• Character semantics column– Explicit with quantifier: varchar2(30 char)
– Implicit with NLS_LENGTH_SEMANTICS setting to ‘char’ for varchar2(30)
• Same semantics support for PL/SQL variable
• Character length constraint checking
• SQL functions for different flavor of length semantics in Unicode
– like/like2/like4/likec
– lengthb/length/length2/length4/lengthc
– substrb/substr/substr2/substr4/substrc
– compose/decompose
®
Character Semantics Support in 9i - Cont.
• UTF-16 semantics – UTF8 encodes surrogate by a pair of three bytes
– It has the same semantics as UTF-16 and has the match between varchar2(10 char) and wchar(10)
– It has the same binary sorting order as UTF-16
• UTF-32 semantics– AL32UTF8 follows UTF-8 standard by encoding
surrogate in 4 bytes
– It has the same semantics as UTF-32 in coding point and the same binary order
• Conversion between UTF8 and AL32UTF8– AL32UTF8 can be used at client for the UTF-8
compliance
®
Reliable Unicode Data Type Support
• NCHAR, NVARCHAR2, NCLOB - 8i NCHAR: any fixed width character set
- Defined in SQL standard
• Unicode Character set encoding - UTF-8, UTF-16
- Independent on DB character set
• Character Length Semantics Only - Avoid migration issues in the future
• Support Unicode in non-Unicode database
®
Inter-operability With Other Data Types• Explicit Conversion Functions
- TO_NCHAR()
- TO_CHAR()
- ROWIDTONCHAR()
- CHARTOROWID()
- TO_CLOB()
- TO_NCLOB()
- TO_NUMBER()
- TO_DATE()
- TO_TIMESTAMP()
- TO_TIMESTAMP_TZ()
- TO_YMINTERVAL()
…...
®
Inter-operability - cont.
• Implicit Conversion
- Between NCHAR and CHAR types
- Between NCHAR and NUMBER, DATE, ROWID, RAW,
CLOBs etc.
• Conversion Direction: - Insert/select into/update/assignment operations:
convert to target
- Comparison, concatenation: SQL CHAR to SQL NCHAR avoid any data loss
- SQL function: convert to first string parameter
• Makes migration to SQL NCHAR much easier
®
Data Loss Exception Handling
• NLS Parameter:
- NLS_NCHAR_CONV_EXCP
- Dynamically changed in each session
- Effective for both explicit and implicit conversions
• Smoothness of operation vs. accuracy of operation
®
SQL Unicode String Processing
• Same level of support as CHAR
- Can use NCHAR same way as CHAR.
• SQL functions support for NCHAR
- SUBSTR, LENGTH, INSTR, LIKE, CONCAT,
LPAD/RPAD, LTRIM, RTRIM, NLS_SORT,
NLS_UPPER, NLS_LOWER etc.
- UNISTR, ASCIISTR
• Mixed type arguments
- CONCAT(nchar,char) - result type is based on first string parameter
• Easy programming
®
Unicode Database vs. Unicode Data Type
• Codepoint semantics for UTF8 will make Oracle database a virtual UTF-16 database
– There is no need to use NVARCHAR2 unless it is for the storage compression for Asian data
– The migration effort is minimum as there is no need to convert VARCHAR2 into NVARCHAR2
– It is recommended to use one-step migration for a new system
• NCHAR/NVARCHAR2– It allows incremental migration to Unicode
– It is always a Unicode column
– It can use UTF-16 encoding natively
®
NCHAR Choice between UTF8 and AL16UTF16
• UTF-8 - ASCII compatible
- Internet friendly: HTML, XML etc.
- More space efficient for western languages
• UTF-16 - More space efficient for Asian languages
- Faster in string processing
- Supported by JAVA, WINDOWS etc.
®
Programming Interfaces
• OCI Unicode Support - Support UTF-16 bind/define buffers
- Unicode meta data, SQL_TEXT, error
messages through mode parameter
- Unicode interface support independent on
server or client character set
- Character length semantics
• PL/SQL
• Pro*C/C++: Unicode support through UCHAR, UVARCHAR
• JDBC
• ODBC/OLEDB
®
Migration, Conversion and Compatibility
• Old NCHAR to 9i NCHAR migration
• Migration to Unicode Columns
ALTER TABLE tname MODIFY col (NCHAR(n))
• Convert whole database to Unicode database
®
Migration, Conversion and Compatibility
• Character length semantics - Database schema
ALTER TABLE tname MODIFY col (CHAR(n CHAR))
- Modify application to be in sync with database length semantics
Example: PL/SQL migration
1. Set NLS_LENGTH_SEMANTICS to CHAR
2. Apply %ROWTYPE, or explicit CHAR quantifier
3. Change substrb, lengthb and instrb to substr, length and
instr
®
Summary
• A flexible and complete Unicode support– Character semantics on UTF-8 or Unicode data
type
– All major access interfaces support Unicode
• High performance by high information density– UTF-8 for Western scripts
– UTF-16 for Asian scripts
• Easy programming– Same length semantics between database and
other components
• Easy migration– One step migration or gradual migration