UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane...
-
Upload
tyrone-terry -
Category
Documents
-
view
218 -
download
0
Transcript of UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane...
Unicode & controlDay 13 - 9/24/14LING 3820 & 6820
Natural Language Processing
Harry Howard
Tulane University
Course organization
24-Sept-2014NLP, Prof. Howard, Tulane University
2
http://www.tulane.edu/~howard/LING3820/
The syllabus is under construction.
http://www.tulane.edu/~howard/CompCultEN/
Review of Unicode
24-Sept-2014
3
NLP, Prof. Howard, Tulane University
ASCII characters
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 – – – – – – – – – – – – – – – –
1 – – – – – – – – – – – – – – – –
2 ! “ # $ % & ‘ ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ –
24-Sept-2014NLP, Prof. Howard, Tulane University
4
6.2.1. Character encoding in Python
24-Sept-2014NLP, Prof. Howard, Tulane University
5
Open Spyder
24-Sept-2014
6
NLP, Prof. Howard, Tulane University
6. Non-English characters: one code to rule them all
24-Sept-2014
7
NLP, Prof. Howard, Tulane University
6.2.2. What happens when you type a non-ASCII character into a Python console?
1. >>> import sys 2. >>> sys.getdefaultencoding()
1. >>> special = 'ó' 2. >>> special 3. '\xc3\xb3' 4. >>> print special ó
24-Sept-2014NLP, Prof. Howard, Tulane University
8
6.2.3. How to translate into and out of Unicode with decode() and encode()1. >>> S1 = 'ca\xc3\xb1\xc3\xb3n' 2. >>> uS1 = S1.decode('utf8') 3. >>> uS1 4. u'ca\xf1\xf3n'5. >>> len(uS1) 6. 5 7. >>> utf8S1 = uS1.encode('utf8')8. >>> print utf8S1 9. cañón
24-Sept-2014NLP, Prof. Howard, Tulane University
9
6.2.4.1. How to turn on non-ASCII character matching with re.UNICODE1. >>> S1 = 'ca\xc3\xb1\xc3\xb3n' # same as before
2. >>> uS1 = S1.decode('utf8')
3. >>> uS1
4. u'ca\xf1\xf3n'
5. >>> import re
6. >>> lS1 = re.findall(r'\w{5}', uS1, re.U)
7. >>> lS1
8. [u'ca\xf1\xf3n']
9. >>> eS1 = ''.join(lS1)
10. >>> eS1
11. u'ca\xf1\xf3n'
12. >>> utf8S1 = eS1.encode('utf8')
13. >>> utf8S1
14. 'ca\xc3\xb1\xc3\xb3n'
15. >>> print
16. cañón
24-Sept-2014NLP, Prof. Howard, Tulane University
10
6.2.5. How to translate between Unicode strings and numbers with ord() and unichar()1. >>> 'ó' 2. '\xc3\xb3' 3. >>> 'ó'.decode('utf8') 4. u'\xf3' 5. >>> ord(u'\xf3') 6. 243 7. >>> unichr(243) 8. u'\xf3' 9. test = unichr(243).encode('utf8')10. >>> print test 11. ó
24-Sept-2014NLP, Prof. Howard, Tulane University
11
I am going to fold the Unicode chapter into §1 & §2 and move the next chapter (§8) up a notch, so the chapter numbering will change.
Chapter numbering
24-Sept-2014
12
NLP, Prof. Howard, Tulane University
Up to now, your short programs are entirely dependent on you for making decisions. This is fine for pieces of text that fit on a single line, but is clearly insufficient for texts that run to hundreds of lines in length. You will want Python to make decisions for you. How to tell Python to do so is the topic of this chapter, and falls under the rubric of control.
8. Control
24-Sept-2014
13
NLP, Prof. Howard, Tulane University
The first step in making a decision is to distinguish those cases in which the decision applies from those in which it does not. In computer science, this is usually known as a condition.
8.1. Conditions
24-Sept-2014
14
NLP, Prof. Howard, Tulane University
8.1.1. How to check for the presence of an item with in Perhaps the simplest condition in text processing is whether an item is present or not. Python handles this in a way that looks a lot like English:
1. >>> greeting = 'Yo!' 2. >>> 'Y' in greeting3. >>> 'o' in greeting4. >>> '!' in greeting5. >>> 'o!' in greeting6. >>> 'Yo!' in greeting7. >>> 'Y!' in greeting8. >>> 'n' in greeting9. >>> '?' in greeting10.>>> '' in greeting
24-Sept-2014NLP, Prof. Howard, Tulane University
15
in & lists
Lists behave exactly like strings, with the proviso that the string being asked about match a string in the list exactly:
1. >>> fruit = ['apple', 'cherry', 'mango', 'pear', 'watermelon']
2. >>> 'apple' in fruit
3. >>> 'peach' in fruit
4. >>> 'app' in fruit
5. >>> '' in fruit
6. >>> [] in fruit
24-Sept-2014NLP, Prof. Howard, Tulane University
16
Python can understand sequences of in conditions
1. >>> 'app' in 'apple' in fruit2. # 'app' in 'apple' > True 3. # 'apple' in lst > True 4. >>> 'aple' in 'apple' in fruit5. >>> 'pea' in 'peach' in fruit
24-Sept-2014NLP, Prof. Howard, Tulane University
17
8.1.2. How to check for the absence of an item with not in1. >>> not 'n' in greeting2. >>> 'n' not in greeting3. >>> 'Y' not in greeting4. >>> 'Y!' not in greeting5. >>> 'Yo' not in greeting6. >>> '' not in greeting7. >>> 'apple' not in fruit8. >>> 'peach' not in fruit9. >>> 'app' not in fruit10. >>> '' not in fruit11. >>> 'pee' not in 'peach' not in fruit12. >>> 'pea' not in 'peach' not in fruit13. >>> 'pea' not in 'apple' not in fruit
24-Sept-2014NLP, Prof. Howard, Tulane University
18
More on control
Next time
24-Sept-2014NLP, Prof. Howard, Tulane University
19