Unicode Cjk Compatible Variations
11
Unicode CJK Compatible Variations Taken from UNIHAN Database
-
Upload
joe-jiang -
Category
Technology
-
view
1.920 -
download
1
Transcript of Unicode Cjk Compatible Variations
- 1.Unicode CJK Compatible Variations Taken from UNIHAN Database
2. Once I got a task to convert Chinese
-
- Unicode extracted from PDF
-
- Traditional Chinese to Simplified Chinese
-
- Saved in GBK text file for full-text search
-
- But some common words resulted error
-
- Once converted into GBK, it results ? character
- $ echo | iconv -f utf8 -t gbk
- iconv: illegal input sequence at position 0
- $ echo | perl -MEncode -ne 'print encode q(utf8), decode q( gb2312 ), encode q( gb2312 ), decode q(utf8), $_'
- ?
3. There is more than one way to enlighten the world http://www.isthisthingon.org/unicode/index.phtml?page=0F&subpage=9 4. So I declared they are Korean chars :)
-
- That means it's impossible to do the convert
-
- But is there any report about this trend?
-
- I come to know what's CJK Compatible Ideographs
-
- And I want a mapping to Simplified Unicode
-
- And finally compatible with GBK
-
- OnesiteI discover at writing this slide
5. Is there any power make this happen?
-
- One PDF viewer named evince works magically
-
- I guess that means perl can make it too
-
- With help from CPAN modules
-
- Unicode::Unihan can be a solution
-
- It was introduced to me by fayland three years ago
-
- Now it can do more than extract PinYin?
6. The table I want is:
- perl -MEncode -e 'print encode q(UCS-2LE), decode q(utf8), qq( 0 )' | od -x
- 0000000f97700004eae
- 0xf977 => 0x4eae
- perl -MEncode -e 'print encode q(UCS-2LE), decode q(utf8), qq( 0 )' | od -x
- 0000000f97d00008ad2
- 0xf97d => 0x8ad2
7. XYZ variants, I picked the last
-
- Tag:kZVariant
-
- Status: Provisional
-
- Category:Variants
-
- Separator: space
-
- Syntax: U+2?[0-9A-F]{4}(:k[A-Za-z]+)?
-
- Description: The Unicode value(s) for knownz-variantsof this character.
-
- x-axis to representmeaning
-
- y-axis to represent abstractshape
-
- z-axis is used forstylisticvariations
- http://search.cpan.org/~dankogai/Unicode-Unihan-0.03/Unihan.pm
- But latest version is not 0.03 now! Thanks Dan =>
8. But how to get them in bulk?
-
- Quick and dirty way
-
- Thus I already forgot it once get it done
-
- What I remember is to iterate from F900 to FA00
-
- with help from Unicode::String
-
- use Unicode::Unihan;
-
- use Encode;
-
- $uh = Unicode::Unihan->new();
-
- print $uh-> ZVariant (decode(q(utf8), q( )))
-
- U+4EAE
-
- print $uh->ZVariant(decode(q(utf8), q( )))
-
- U+F977
9. The loop starts here
-
- Unicode::String's special constructor
-
- and it's special behavior
- perl -MUnicode::String= uchr-e 'print uchr($_)->utf8 for 0xf900 .. 0xf9ff'
-
- combine of two power
- perl -MUnicode::String=uchr, uhex-MUnicode::Unihan -MEncode -e '$uh=Unicode::Unihan->new(); print uhex $uh-> ZVariant(decode(q(utf8), uchr($_)-> utf8 )) for 0xf900 .. 0xf9ff'
10. Do I really need to install those module in production machine?
- perl -MUnicode::String=uchr,uhex -MUnicode::Unihan -MEncode -e '$uh=Unicode::Unihan->new(); print qq("), $x = uchr($_)->utf8, qq(" => "), uhex($uh-> ZVariant(decode(q(utf8), $x))), qq("n) for 0xf900 .. 0x faff ' | less
- " " => " "
- " " => " "
- " " => " "
- " " => " "
- ...
11. Thanks!
- ...