1 String and Data Processing. Sequential Processing Processing each element in a sequence for e in...

46
1 String and Data Processing

Transcript of 1 String and Data Processing. Sequential Processing Processing each element in a sequence for e in...

1

String and Data Processing

2

Sequential Processing

Processing each element in a sequence

for e in [1,2,3,4]:

print e

for c in “hello”:

print c

for e in (1,2,3,”Name”):

print e

3

List Comprehension

When creating a new list with a certain condition or

a mapping,

[ x for x in [1,2,3 ] ] [1,2,3]

[ x*x for x in [1,2,3] ] [1,4,9]

[ x for x in [1,2,3,…,10] if x%2 == 0 ]

[ 2,4,6,8,10 ]

4

Operation with data records

L = [ (1, 3), (1, 4), (1, 5), (2, 1), (2, 2) ]

Each tuple has (<group id>, <number>).

Counting the number of data (records) of group 1?

Creating a list with data from the group 1.

L0 = [ x for x in L if x[0] == 1 ]

Then count the number of elements with len(L0)

Sum the total number from group 1?

5

Operation with data records

Create another list whose elements are all from the

group 1 and element is only a number.

N0 = [ x[1] for x in L if x[0] == 1 ]

Then, apply the sum() built-in function.

sum(N0) 12

Another built-in functions for a sequence?

6

Built-in Functions for a sequence

To compute the maximum, use max(L) function

max([1,2,3,4,5,4,3,2,1]) 5

To compute the minimum, use min(L) function

min([1,2,3,4,5,4,3,2,1]) 1

To create a sorted sequence, use sorted(L) function

sorted([1,2,3,4,5,4,3,2,1]) [1,1,2,2,3,3,4,4,5]

7

Built-in Functions for a sequence

What if we have to deal with the inner product?

The zip(L1,L2,…,Ln) function will help!

zip([1,2,3],[4,5,6]) [ (1,4), (2,5), (3,6) ]

8

Built-in Functions for a sequence

How to use with a loop?

Use packing & unpacking!

for (a,b) in zip([1,2,3],[4,5,6]):

print a,b

This will print (1,4), (2,5), and (3,6) within the loop.

9

Built-in Functions for a sequence

When you need an index with for statement?

Use enumerate() function

for (i, e) in enumerate([“Tom”, “Jack”, “Bob”]):

print i, e

This will print

1 Tom

2 Jack

3 Bob

10

Built-in Functions for a sequence

Functions:

min(<sequence>) the minimum element

max(<sequence>) the maximum element

sum(<numeric list>, <number>) the sum of

elements

zip(a list of sequences with the same length)

enumerate(<sequence>) a list with tuples

which has an index (0,1,2,…,) and an element from

the given sequence.

11

Built-in Functions for a sequence

Packing and Unpacking is useful to deal with

multiple values at an operation.

Collecting each element whose index is 2’s multiple

[ x for (j,x) in enumerate([4,5,6]) if j%2 == 0 ]

[4, 6]

Computing the inner product

sum( [ a*b for (a,b) in zip((1,2),(1,2)) ]) 5

12

Playing with Real Data

13

File

Hopefully, you didn’t forget how to read a file.

file = open(<path-to-a-file>,<read or write>)

lines = file.readlines()

file.close()

Data processing is essentially dealing with a list of

lines.

However, you should have a clear mind for the

structure of your data.

14

Data Processing

Data processing is essentially dealing with a list of

lines.

However, you should have a clear mind for the

structure of your data.

15

Data Processing

Pre-existing data is mostly the subject for statistical

analysis.

Basic description: count, sum, min, max, set

operations

Descriptive statistics such as mean, variance

Information Visualization

Comparative Analysis

Cross correlation, Hypothesis testing

Modeling and Validation

Linear regression via Least Squares

16

Pythonic Way forDescriptive Statistics

17

Sternberg’s experiment

Does people process a set of numbers in parallel or

in sequential?

18

Data that we have

In Excel, we have

19

Data that we have

You can download the previous data in a text file

https://www.cs.unc.edu/~joohwi/comp116/ResponseTime.txt

Download the file into your project directory

Let’s make a list of tuples whose type is

(<number>, <number>, <number>, <number>)

0th element is the id of each trial

1st element is the response time in 1/100 sec

2nd element is the number of digits

3rd element is 1 for if the digit is included, 2 for not

20

Transforming data

Let’s make the data more readable.

For example,

21

The structure of data

The first (uppermost) group: the number of digits

1/3/5

The second group: the presence of digit in a given

number

Y/N

Let’s make a hierarchical structure

22

The structure of data

1

Y

40,41,…

N

52,45,…

3

Y

73,83,…

N

73,47,…

5

Y

39,65,…

N

66,53,...

23

The structure of data

Make a tuple of

(1st level data, 2nd level data, 3rd level data)

(1,1,40) (3,1,73) (5,1,39)

(1,2,45) (3,2,73) (5,2,66)

… … …

24

The structure of data

Read a list of strings from the file

Strip whitespaces using strip() function

Separate data into a list of four words using

split()

Create a list of tuples with list comprehension

25

Reading data step by step

file = open(“<path-to-file>”, “r”)

lines = file.readlines()

file.close()

lines = [ l.strip() for l in lines ]

words = [ l.split() for l in lines ]

26

Reading data step by step

data = [ (int(w[1]),int(w[2]),\

int(w[3]) for w in words ]

print data

[ (1,1,40), (1,1,41), …, ]

27

Reading data step by step

data = [ (int(w[1]),int(w[2]),\

int(w[3]) for w in words ]

print data

[ (1,1,40), (1,1,41), …, ]

28

Now, we have data

Now, let’s make a list of strings which contains 15

numbers of reaction time.

For example,

L = [ 1,2,3,4,5,…,100 ]

D = [ [ 1,2,3,4,…,15], [16,17,18,…,30], [31,32,

…,45], … ]

How could we do that?

There are many ways we can.

29

Collection by counting

1. Create a counter variable and collect a list with

every 15s.

D = []

for j in range(0, len(L), 15):

S = [ ]

for k in range(j, j+15):

S.append(L[j])

D.append(S)

30

Collection by counting

2. Use list comprehension

D = [ [ L[k] for k in range(j, j+15) ] \

for j in range(0, len(L), 15) ]

3. Use list slicing to replace the inner loop

D = [ L[j:j+15] for j in range(0, len(L), 15) ]

4. Convert each inner list into a string

output = [ “ “.join(sublist) for sublist in D ]

31

Repeat for each group

Before repeating the process, define a function

and name the piece of code.

def format_data(L):

D = [ L[j:j+15] for j in range(0, len(L), 15) ]

return “<BR>”.join([ “ “.join(sublist) for sublist

in D ])

The signature of our function is

format_data(<list>) <string>

32

Creating a report

Let’s make a html report

html_template = “<table>… %s … %s … %s …</table>

fo = open(“<path-to-html-file>”, “w”)

o1 = format_data(L1)

o2 = format_data(L2)

fo.write(html_template % (o1, o2, o3, …, ))

fo.close()

Open the file with your browser.

33

Textual Visualization

More informative visualization by frequency

counting.

Make a histogram!

34

Multiple countings

A classic problem using another data

structure called a dictionary or an

associative array.

Make a tuple (key, value)

A list contains multiple tuples while

maintaining each key is unique within the

list.

35

Dictionary

Whenever inserting a tuple, test if the key

exists already

If the key exists, overwrite the value

If not, append the tuple into the list

(1, 3), (2, 4), (3, 1), (1, 4)

{ (1,3) }(1,4) (2,4) (3,1)

36

Dictionary

A dictionary is created with {} constructor.

For example

data = { } # an empty dictionary

data = { 1: 3, 2: 4 }

The key and value pair is represented by

key:value within the constructor

37

Dictionary

Accessing an element requires its key

The bracket [] operator takes a key

data = { 1: 3, 2: 4 }

print data[1] # will print 3

print data[2] # will print 4

Don’t be confused with the indexing

38

Dictionary

The type of a key and value can be anything!

data = { “Tom”: “Cruise”, 1:3, (0,2):(3,4) }

This is different from a sequence.

print data[“Tom”] # will print Cruise

print data[1] # will print 3

print data[(0,2)] # will print (3,4)

39

Dictionary

Accessing a non-key value will raise

KeyError

print d[“Nicole”]

KeyError: “Nicole”

IN operator tests the key’s existence

print “Tom” in d # True

print “Nicole” in d # False

40

Dictionary

A set of its keys is obtained by keys()

function

print d.keys()

>>> [ 1, (0,2), “Tom” ]

Note that the insertion order is not

preserved

It depends on the implementation of Python

41

Dictionary

A set of its values is obtained by values()

function

print d.values()

>>> [ 3, (3,4), “Cruise” ]

Note that the insertion order is not

preserved

It depends on the implementation of Python

42

Dictionary

FOR loop works well with the dictionary type

for k in d:

print “(“, k, “,”, v, “)”

There is dictionary comprehension as well!

43

Dictionary

{ j:j+1 for j in range(0,3) }

>>> { 0:1, 1:2, 2:3 }

{ j:e for (j,e) in enumerate(L) }

>>> { 0:L[0], 1:L[1], …, n:L[n] }

{ e:[] for e in L } # assume L = [ 1, 1, 2 ]

>>> { 1:[], 2:[] }

44

Frequency counting

Make each response time value the key of

counts

L = [ (1,1,40), (1,1,41), … ]

F = { e[2]: 0 for e in L }

Initialize each value w/ 0

Duplicate keys ignored

automatically!

45

Frequency counting

L = [ (1,1,40), (1,1,41), … ]

F = { 36:0, 37:0, …, }

Loop each element value in L and update

F[key]

for e in L:

F[ e[2] ] += 1

F = {36: 1, 37: 1, … }

46

Frequency counting

Interestingly, keys are in increasing order!

The reason is its implementation.

An efficient implementation of a dictionary

inevitably requires a sorted set of keys.

Why? Searching is efficient with sorted data

than non-sorted data