Unicode Cjk Compatible Variations

11
Unicode CJK Compatible Variations Taken from UNIHAN Database

Transcript of Unicode Cjk Compatible Variations

  • 1.Unicode CJK Compatible Variations Taken from UNIHAN Database

2. Once I got a task to convert Chinese

    • Unicode extracted from PDF
    • Traditional Chinese to Simplified Chinese
    • Saved in GBK text file for full-text search
    • But some common words resulted error
    • Once converted into GBK, it results ? character
  • $ echo | iconv -f utf8 -t gbk
  • iconv: illegal input sequence at position 0
  • $ echo | perl -MEncode -ne 'print encode q(utf8), decode q( gb2312 ), encode q( gb2312 ), decode q(utf8), $_'
  • ?

3. There is more than one way to enlighten the world http://www.isthisthingon.org/unicode/index.phtml?page=0F&subpage=9 4. So I declared they are Korean chars :)

    • That means it's impossible to do the convert
    • But is there any report about this trend?
    • I come to know what's CJK Compatible Ideographs
    • And I want a mapping to Simplified Unicode
    • And finally compatible with GBK
    • OnesiteI discover at writing this slide

5. Is there any power make this happen?

    • One PDF viewer named evince works magically
    • I guess that means perl can make it too
    • With help from CPAN modules
    • Unicode::Unihan can be a solution
    • It was introduced to me by fayland three years ago
    • Now it can do more than extract PinYin?

6. The table I want is:

  • perl -MEncode -e 'print encode q(UCS-2LE), decode q(utf8), qq( 0 )' | od -x
  • 0000000f97700004eae
  • 0xf977 => 0x4eae
  • perl -MEncode -e 'print encode q(UCS-2LE), decode q(utf8), qq( 0 )' | od -x
  • 0000000f97d00008ad2
  • 0xf97d => 0x8ad2

7. XYZ variants, I picked the last

    • Tag:kZVariant
    • Status: Provisional
    • Category:Variants
    • Separator: space
    • Syntax: U+2?[0-9A-F]{4}(:k[A-Za-z]+)?
    • Description: The Unicode value(s) for knownz-variantsof this character.
    • x-axis to representmeaning
    • y-axis to represent abstractshape
    • z-axis is used forstylisticvariations
  • http://search.cpan.org/~dankogai/Unicode-Unihan-0.03/Unihan.pm
  • But latest version is not 0.03 now! Thanks Dan =>

8. But how to get them in bulk?

    • Quick and dirty way
    • Thus I already forgot it once get it done
    • What I remember is to iterate from F900 to FA00
    • with help from Unicode::String
    • use Unicode::Unihan;
    • use Encode;
    • $uh = Unicode::Unihan->new();
    • print $uh-> ZVariant (decode(q(utf8), q( )))
    • U+4EAE
    • print $uh->ZVariant(decode(q(utf8), q( )))
    • U+F977

9. The loop starts here

    • Unicode::String's special constructor
    • and it's special behavior
  • perl -MUnicode::String= uchr-e 'print uchr($_)->utf8 for 0xf900 .. 0xf9ff'
    • combine of two power
  • perl -MUnicode::String=uchr, uhex-MUnicode::Unihan -MEncode -e '$uh=Unicode::Unihan->new(); print uhex $uh-> ZVariant(decode(q(utf8), uchr($_)-> utf8 )) for 0xf900 .. 0xf9ff'

10. Do I really need to install those module in production machine?

  • perl -MUnicode::String=uchr,uhex -MUnicode::Unihan -MEncode -e '$uh=Unicode::Unihan->new(); print qq("), $x = uchr($_)->utf8, qq(" => "), uhex($uh-> ZVariant(decode(q(utf8), $x))), qq("n) for 0xf900 .. 0x faff ' | less
  • " " => " "
  • " " => " "
  • " " => " "
  • " " => " "
  • ...

11. Thanks!

  • ...