UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane...

Post on 25-Dec-2015

218 views 0 download

Tags:

Transcript of UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane...

Unicode & controlDay 13 - 9/24/14LING 3820 & 6820

Natural Language Processing

Harry Howard

Tulane University

Course organization

24-Sept-2014NLP, Prof. Howard, Tulane University

2

http://www.tulane.edu/~howard/LING3820/

The syllabus is under construction.

http://www.tulane.edu/~howard/CompCultEN/

Review of Unicode

24-Sept-2014

3

NLP, Prof. Howard, Tulane University

ASCII characters

  0 1 2 3 4 5 6 7 8 9 A B C D E F

0 – – – – – – – – – – – – – – – –

1 – – – – – – – – – – – – – – – –

2   ! “ # $ % & ‘ ( ) * + , - . /

3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?

4 @ A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ \ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z { | } ~ –

24-Sept-2014NLP, Prof. Howard, Tulane University

4

6.2.1. Character encoding in Python

24-Sept-2014NLP, Prof. Howard, Tulane University

5

Open Spyder

24-Sept-2014

6

NLP, Prof. Howard, Tulane University

6. Non-English characters: one code to rule them all

24-Sept-2014

7

NLP, Prof. Howard, Tulane University

6.2.2. What happens when you type a non-ASCII character into a Python console?

1. >>> import sys 2. >>> sys.getdefaultencoding()

1. >>> special = 'ó' 2. >>> special 3. '\xc3\xb3' 4. >>> print special ó

24-Sept-2014NLP, Prof. Howard, Tulane University

8

6.2.3. How to translate into and out of Unicode with decode() and encode()1. >>> S1 = 'ca\xc3\xb1\xc3\xb3n' 2. >>> uS1 = S1.decode('utf8') 3. >>> uS1 4. u'ca\xf1\xf3n'5. >>> len(uS1) 6. 5 7. >>> utf8S1 = uS1.encode('utf8')8. >>> print utf8S1 9. cañón

24-Sept-2014NLP, Prof. Howard, Tulane University

9

6.2.4.1. How to turn on non-ASCII character matching with re.UNICODE1. >>> S1 = 'ca\xc3\xb1\xc3\xb3n' # same as before

2. >>> uS1 = S1.decode('utf8')

3. >>> uS1

4. u'ca\xf1\xf3n'

5. >>> import re

6. >>> lS1 = re.findall(r'\w{5}', uS1, re.U)

7. >>> lS1

8. [u'ca\xf1\xf3n']

9. >>> eS1 = ''.join(lS1)

10. >>> eS1

11. u'ca\xf1\xf3n'

12. >>> utf8S1 = eS1.encode('utf8')

13. >>> utf8S1

14. 'ca\xc3\xb1\xc3\xb3n'

15. >>> print

16. cañón

24-Sept-2014NLP, Prof. Howard, Tulane University

10

6.2.5. How to translate between Unicode strings and numbers with ord() and unichar()1. >>> 'ó' 2. '\xc3\xb3' 3. >>> 'ó'.decode('utf8') 4. u'\xf3' 5. >>> ord(u'\xf3') 6. 243 7. >>> unichr(243) 8. u'\xf3' 9. test = unichr(243).encode('utf8')10. >>> print test 11. ó

24-Sept-2014NLP, Prof. Howard, Tulane University

11

I am going to fold the Unicode chapter into §1 & §2 and move the next chapter (§8) up a notch, so the chapter numbering will change.

Chapter numbering

24-Sept-2014

12

NLP, Prof. Howard, Tulane University

Up to now, your short programs are entirely dependent on you for making decisions. This is fine for pieces of text that fit on a single line, but is clearly insufficient for texts that run to hundreds of lines in length. You will want Python to make decisions for you. How to tell Python to do so is the topic of this chapter, and falls under the rubric of control.

8. Control

24-Sept-2014

13

NLP, Prof. Howard, Tulane University

The first step in making a decision is to distinguish those cases in which the decision applies from those in which it does not. In computer science, this is usually known as a condition.

8.1. Conditions

24-Sept-2014

14

NLP, Prof. Howard, Tulane University

8.1.1. How to check for the presence of an item with in Perhaps the simplest condition in text processing is whether an item is present or not. Python handles this in a way that looks a lot like English:

1. >>> greeting = 'Yo!' 2. >>> 'Y' in greeting3. >>> 'o' in greeting4. >>> '!' in greeting5. >>> 'o!' in greeting6. >>> 'Yo!' in greeting7. >>> 'Y!' in greeting8. >>> 'n' in greeting9. >>> '?' in greeting10.>>> '' in greeting

24-Sept-2014NLP, Prof. Howard, Tulane University

15

in & lists

Lists behave exactly like strings, with the proviso that the string being asked about match a string in the list exactly:

1. >>> fruit = ['apple', 'cherry', 'mango', 'pear', 'watermelon']

2. >>> 'apple' in fruit

3. >>> 'peach' in fruit

4. >>> 'app' in fruit

5. >>> '' in fruit

6. >>> [] in fruit

24-Sept-2014NLP, Prof. Howard, Tulane University

16

Python can understand sequences of in conditions

1. >>> 'app' in 'apple' in fruit2. # 'app' in 'apple' > True 3. # 'apple' in lst > True 4. >>> 'aple' in 'apple' in fruit5. >>> 'pea' in 'peach' in fruit

24-Sept-2014NLP, Prof. Howard, Tulane University

17

8.1.2. How to check for the absence of an item with not in1. >>> not 'n' in greeting2. >>> 'n' not in greeting3. >>> 'Y' not in greeting4. >>> 'Y!' not in greeting5. >>> 'Yo' not in greeting6. >>> '' not in greeting7. >>> 'apple' not in fruit8. >>> 'peach' not in fruit9. >>> 'app' not in fruit10. >>> '' not in fruit11. >>> 'pee' not in 'peach' not in fruit12. >>> 'pea' not in 'peach' not in fruit13. >>> 'pea' not in 'apple' not in fruit

24-Sept-2014NLP, Prof. Howard, Tulane University

18

More on control

Next time

24-Sept-2014NLP, Prof. Howard, Tulane University

19