Unicode Cjk Compatible Variations

1.Unicode CJK Compatible Variations Taken from UNIHAN Database

Unicode extracted from PDF

Traditional Chinese to Simplified Chinese

Saved in GBK text file for full-text search

But some common words resulted error

Once converted into GBK, it results ? character

$ echo | iconv -f utf8 -t gbk

iconv: illegal input sequence at position 0

$ echo | perl -MEncode -ne 'print encode q(utf8), decode q( gb2312 ), encode q( gb2312 ), decode q(utf8), $_'

That means it's impossible to do the convert

But is there any report about this trend?

I come to know what's CJK Compatible Ideographs

And I want a mapping to Simplified Unicode

And finally compatible with GBK

OnesiteI discover at writing this slide

One PDF viewer named evince works magically

I guess that means perl can make it too

With help from CPAN modules

Unicode::Unihan can be a solution

It was introduced to me by fayland three years ago

Now it can do more than extract PinYin?

perl -MEncode -e 'print encode q(UCS-2LE), decode q(utf8), qq( 0 )' | od -x

0000000f97700004eae

0xf977 => 0x4eae

perl -MEncode -e 'print encode q(UCS-2LE), decode q(utf8), qq( 0 )' | od -x

0000000f97d00008ad2

0xf97d => 0x8ad2

Tag:kZVariant

Status: Provisional

Category:Variants

Separator: space

Syntax: U+2?[0-9A-F]{4}(:k[A-Za-z]+)?

Description: The Unicode value(s) for knownz-variantsof this character.

x-axis to representmeaning

y-axis to represent abstractshape

z-axis is used forstylisticvariations

http://search.cpan.org/~dankogai/Unicode-Unihan-0.03/Unihan.pm

But latest version is not 0.03 now! Thanks Dan =>

Quick and dirty way

Thus I already forgot it once get it done

What I remember is to iterate from F900 to FA00

with help from Unicode::String

use Unicode::Unihan;

use Encode;

$uh = Unicode::Unihan->new();

print $uh-> ZVariant (decode(q(utf8), q( )))

U+4EAE

print $uh->ZVariant(decode(q(utf8), q( )))

U+F977

Unicode::String's special constructor

and it's special behavior

perl -MUnicode::String= uchr-e 'print uchr($_)->utf8 for 0xf900 .. 0xf9ff'

combine of two power

perl -MUnicode::String=uchr, uhex-MUnicode::Unihan -MEncode -e '$uh=Unicode::Unihan->new(); print uhex $uh-> ZVariant(decode(q(utf8), uchr($_)-> utf8 )) for 0xf900 .. 0xf9ff'

perl -MUnicode::String=uchr,uhex -MUnicode::Unihan -MEncode -e '$uh=Unicode::Unihan->new(); print qq("), $x = uchr($_)->utf8, qq(" => "), uhex($uh-> ZVariant(decode(q(utf8), $x))), qq("n) for 0xf900 .. 0x faff ' | less

" " => " "

Unicode Cjk Compatible Variations

Technology

Transcript of Unicode Cjk Compatible Variations