21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of...

22
1 t International Unicode Conference Dublin, Ireland, May 2002 Optimizing the Usage of Normalization Vladimir Weinstein [email protected] Globalization Center of Competency, San Jose, C

Transcript of 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of...

Page 1: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

121st International Unicode Conference Dublin, Ireland, May 2002

Optimizing the Usage of Normalization

Vladimir Weinstein

[email protected]

Globalization Center of Competency, San Jose, CA

Page 2: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

221st International Unicode Conference Dublin, Ireland, May 2002

Introduction

1. Unicode standard has multiple ways to encode equivalent strings

résumé re sumé re sume NFD: NFC: résume

2. Accents that don’t interact are put into a unique order

Page 3: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

321st International Unicode Conference Dublin, Ireland, May 2002

Introduction (contd.)

• Normalization provides a way to transform a string to an unique form (NFD, NFC)

• Strings that can be transformed to the same form are called canonically equivalent

• Time-critical applications need to minimize the number of passes over the text

• ICU gives a number of tools to deal with this problem

• We will use collation (language-sensitive string comparison) as an example

Page 4: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

421st International Unicode Conference Dublin, Ireland, May 2002

Avoiding Normalization

• Force users to provide already normalized data

• The performance problem does not go away

• When the strings are processed many times, it could be beneficial to normalize them beforehand

• Forcing users to provide a specific form can be unpopular

Page 5: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

521st International Unicode Conference Dublin, Ireland, May 2002

Check for Normalized Text

• Most strings are already in normalized form• Quick Check is significantly faster than the full

normalization• Needs canonical class data and additional data

for checking the relation between a code point and a normalization form

• Algorithm in UAX #15 Annex 8 (http://www.unicode.org/unicode/reports/tr15/#Annex8)

Page 6: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

621st International Unicode Conference Dublin, Ireland, May 2002

Normalize Incrementally

• Instead of normalizing the whole string at once, normalize one piece at a time

• This technique is usually combined with an incremental Quick Check

• Useful for procedures with early exit, such as string comparing or scanning

• Normalizes up to the next safe point

Page 7: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

721st International Unicode Conference Dublin, Ireland, May 2002

Incremental Normalization: Example

re sume résumé

re sume

résumé

Initial string

Normalize just the parts that fail quick check

Non incremental normalization

Quick check

Incremental normalization

If normalized regularly, the whole string is processed by normalization

Page 8: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

821st International Unicode Conference Dublin, Ireland, May 2002

Optimized Concatenation

• Simple concatenation of two normalized strings can yield a string that is not normalized

• One option is to normalize the result

• Unnecessarily duplicates normalization

Page 9: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

921st International Unicode Conference Dublin, Ireland, May 2002

Optimized Concatenation: Example

Find boundaries Concatenate then normalize

Concatenate and normalize up to the boundaries

re sumé+

re sumé

résumé

r sumé+e

r sumée

résumé

• It is enough to normalize the boundary parts• Incremental normalization is used • Much faster than redoing the whole resulting

string

Page 10: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

1021st International Unicode Conference Dublin, Ireland, May 2002

Accepting the FCD Form

• Fast Composed or Decomposed form is a partially normalized form

• Not unique

• More lenient than NFD or NFC form• It requires that the procedure has support

for all the canonically equivalent strings on input

• It is possible to quick check the FCD format

Page 11: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

1121st International Unicode Conference Dublin, Ireland, May 2002

FCD Form: Examples

SEQUENCE FCD NFC NFD

A-ring Y Y

Angstrom Y

A + ring Y Y

A + grave Y Y

A-ring + grave Y

A + cedilla + ring Y Y

A + ring + cedilla

A-ring + cedilla Y

Page 12: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

1221st International Unicode Conference Dublin, Ireland, May 2002

Canonical Closure

• Preprocessing data to support the FCD form

• Ensures that if data is assigned to a sequence (or a code point) it will also be assigned to all canonically equivalent FCD sequences

Å = X A+ = XÅ = X,=>

A-ring (U+00C5)

Angstrom sign (U+212B)

A + combining ring above (U+0041 U+030A)

Page 13: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

1321st International Unicode Conference Dublin, Ireland, May 2002

Collation

• Locale specific sorting of strings

• Relation between code points and collation elements

• Context sensitive:– Contractions: H < Z, but CZ < CH

РExpansions: OE < Π< OF

– Both: カー < カイ or キー > キイ

See “Collation in ICU” by Mark Davis

Page 14: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

1421st International Unicode Conference Dublin, Ireland, May 2002

Collation Implementation in ICU

• Two modes of operation:– Normalization OFF: expects the users to pass in FCD strings

– Normalization ON: accepts any strings

• Some locales require normalization to be turned on

• Canonical closure done for contractions and regular mappings

• Two important services– Sort key generation

– String compare function

More about ICU at the end of presentation

Page 15: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

1521st International Unicode Conference Dublin, Ireland, May 2002

FCD Support in Collation

• Much higher performance

• Values assigned to a code point or a contraction are equal to those for its FCD canonically equivalent sequences

• This process is time consuming, but it is done at build time

• May increase data set

Page 16: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

1621st International Unicode Conference Dublin, Ireland, May 2002

Sort Key Generation

• Whole strings are processed

• Sort keys tend to get reused, so the emphasis is on producing as short sort keys as possible

• Two modes of operation– Normalization ON: strings are quick checked and

normalization is performed, if required

– Normalization OFF: depends on strings being in FCD form. The performance increases by 20% to 50%

Page 17: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

1721st International Unicode Conference Dublin, Ireland, May 2002

String Compare

• Very time critical

• Result is usually determined before fully processing both strings

• First step is binary comparison for equality

• When it fails, comparison continues from a safe spot

A

Å

No need to backup, normal situation

c h

c z

Must backup to the start of contraction

Must backup to the normalization safe spot

Page 18: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

1821st International Unicode Conference Dublin, Ireland, May 2002

String Compare Continued

• Normalization ON: incremental FCD check and incremental FCD normalization if required

• Normalization OFF: assumes that the source strings are FCD

• Most locales don’t require normalization on and thus are 20% faster by using FCD

Page 19: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

1921st International Unicode Conference Dublin, Ireland, May 2002

International Components for Unicode

• International Components for Unicode(ICU) is a library that provides robust and full-featured Unicode support

• The ICU normalization engine supports the optimizations mentioned here

• Library services accept FCD strings as input• Wide variety of supported platforms • Open source (X license – non-viral)• C/C++ and JAVA versions• http://oss.software.ibm.com/icu/

Page 20: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

2021st International Unicode Conference Dublin, Ireland, May 2002

Conclusion

• The presented techniques allow much faster string processing

• In case of collation, sort key generation gets up to 50% faster than if normalizing beforehand

• String compare function becomes up to 3 times faster!

• May increase data size• Canonical closure preprocessing takes more

time to build, but pays off at runtime

Page 21: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

2121st International Unicode Conference Dublin, Ireland, May 2002

Q & A

Page 22: 21 st International Unicode Conference Dublin, Ireland, May 2002 1 Optimizing the Usage of Normalization Vladimir Weinstein vweinste@us.ibm.com Globalization.

2221st International Unicode Conference Dublin, Ireland, May 2002

Summary

• Introduction

• Avoiding normalization

• Check for normalized text

• Normalize incrementally

• Concatenation of normalized strings

• Accepting the FCD form

• Implementation of collation in ICU