Unicode basics in python
Click here to load reader
-
Upload
navaneethan-ramasamy -
Category
Technology
-
view
239 -
download
1
Transcript of Unicode basics in python
![Page 1: Unicode basics in python](https://reader038.fdocuments.us/reader038/viewer/2022100500/554fa06ab4c905ad218b49d9/html5/thumbnails/1.jpg)
Unicode
in python
![Page 2: Unicode basics in python](https://reader038.fdocuments.us/reader038/viewer/2022100500/554fa06ab4c905ad218b49d9/html5/thumbnails/2.jpg)
We Cover these now
● Unicode history● terms clarity (code point,BOM,utf-8,utf-16)● decoding and encoding in python● how django handles these?● helpful python modules to tackle it
Note: BOM is used in utf-16.since, it has multi bytes character code point
![Page 3: Unicode basics in python](https://reader038.fdocuments.us/reader038/viewer/2022100500/554fa06ab4c905ad218b49d9/html5/thumbnails/3.jpg)
How it came?
Americans came up with (7 bit)ASCII representation with english only alphabets as a standard to exchange information.(‘A’ - 65, ’a’ - 97)
Rest of the world came up with their unaccented english characters ('ä', )in their own way.(messed up)
![Page 4: Unicode basics in python](https://reader038.fdocuments.us/reader038/viewer/2022100500/554fa06ab4c905ad218b49d9/html5/thumbnails/4.jpg)
What causes unicode born?
To exchange information in all languages, we got some requirements● Unique and simple rule was needed● Adoptable across all machines(windows,ibm,
etc..)● Efficient storage as much possible
![Page 5: Unicode basics in python](https://reader038.fdocuments.us/reader038/viewer/2022100500/554fa06ab4c905ad218b49d9/html5/thumbnails/5.jpg)
Unicode
Unicode = UCS(universal character set) + bit representation logicUCS:character + code point(‘a’, 97)bit representation:
BOM = Big endian (or) Little endian
00 48 00 65 00 6C 00 6C 00 6F (or) 48 00 65 00 6C
00 6C 00 6F 00
![Page 6: Unicode basics in python](https://reader038.fdocuments.us/reader038/viewer/2022100500/554fa06ab4c905ad218b49d9/html5/thumbnails/6.jpg)
utf-8 is famous, because
● multi-byte encoding● variable width encoding● upto 4 byte code points are allowed by utf-8● mostly, No need BOM(8 bits)● memory efficient
How? for NON-ASCII bytes, 1st byte is reserved to indicate the no of bytes the char is using(eg.compression)
![Page 7: Unicode basics in python](https://reader038.fdocuments.us/reader038/viewer/2022100500/554fa06ab4c905ad218b49d9/html5/thumbnails/7.jpg)
decoding
Character to Numeric value(code point) conversion● from <type 'str'> to <type 'unicode'>● it throws maximum “UnicodeDecodeError:”
(samples demo)
![Page 8: Unicode basics in python](https://reader038.fdocuments.us/reader038/viewer/2022100500/554fa06ab4c905ad218b49d9/html5/thumbnails/8.jpg)
encoding
● Numeric value(code point) to Characters● from <type 'unicode'> to <type 'str'>● it throws maximum “UnicodeEncodeError:”
(samples demo)
![Page 9: Unicode basics in python](https://reader038.fdocuments.us/reader038/viewer/2022100500/554fa06ab4c905ad218b49d9/html5/thumbnails/9.jpg)
Rules to Remember…
● Decode early, Unicode everywhere, Encode late● UTF-8 is the best guess for an encoding● chardet.detect()==========================
in Python 3 this is solved…
● <type 'str'> is a Unicode object
![Page 10: Unicode basics in python](https://reader038.fdocuments.us/reader038/viewer/2022100500/554fa06ab4c905ad218b49d9/html5/thumbnails/10.jpg)
How django handles?>>> def to_unicode(... obj, encoding='utf-8'):... if isinstance(obj, basestring):... if not isinstance(obj, unicode):... obj = unicode(obj, encoding)... return obj
smart_text(s, encoding='utf-8', strings_only=False, errors='strict')force_text(s, encoding='utf-8', strings_only=False, errors='strict')smart_bytes(s, encoding='utf-8', strings_only=False, errors='strict')
![Page 11: Unicode basics in python](https://reader038.fdocuments.us/reader038/viewer/2022100500/554fa06ab4c905ad218b49d9/html5/thumbnails/11.jpg)
How to set your python default encoding standard?
import sys>>>reload(sys)>>>sys.setdefaultencoding(‘utf-8’)>>>sys.getdefaultencoding>>>’utf-8’(or)# -*- coding: utf-8 -*-(tell to python you saved <mod_name.py> in utf-8)
![Page 12: Unicode basics in python](https://reader038.fdocuments.us/reader038/viewer/2022100500/554fa06ab4c905ad218b49d9/html5/thumbnails/12.jpg)
Related python modules..
● chardet.detect()● unicodedata● codecs
![Page 13: Unicode basics in python](https://reader038.fdocuments.us/reader038/viewer/2022100500/554fa06ab4c905ad218b49d9/html5/thumbnails/13.jpg)
Thanks for your time
Post your questions.
![Page 14: Unicode basics in python](https://reader038.fdocuments.us/reader038/viewer/2022100500/554fa06ab4c905ad218b49d9/html5/thumbnails/14.jpg)
samples demo….
![Page 15: Unicode basics in python](https://reader038.fdocuments.us/reader038/viewer/2022100500/554fa06ab4c905ad218b49d9/html5/thumbnails/15.jpg)
screenshot2