1 String and Data Processing. Sequential Processing Processing each element in a sequence for e in...
-
Upload
marcus-jefferson -
Category
Documents
-
view
221 -
download
1
Transcript of 1 String and Data Processing. Sequential Processing Processing each element in a sequence for e in...
2
Sequential Processing
Processing each element in a sequence
for e in [1,2,3,4]:
print e
for c in “hello”:
print c
for e in (1,2,3,”Name”):
print e
3
List Comprehension
When creating a new list with a certain condition or
a mapping,
[ x for x in [1,2,3 ] ] [1,2,3]
[ x*x for x in [1,2,3] ] [1,4,9]
[ x for x in [1,2,3,…,10] if x%2 == 0 ]
[ 2,4,6,8,10 ]
4
Operation with data records
L = [ (1, 3), (1, 4), (1, 5), (2, 1), (2, 2) ]
Each tuple has (<group id>, <number>).
Counting the number of data (records) of group 1?
Creating a list with data from the group 1.
L0 = [ x for x in L if x[0] == 1 ]
Then count the number of elements with len(L0)
Sum the total number from group 1?
5
Operation with data records
Create another list whose elements are all from the
group 1 and element is only a number.
N0 = [ x[1] for x in L if x[0] == 1 ]
Then, apply the sum() built-in function.
sum(N0) 12
Another built-in functions for a sequence?
6
Built-in Functions for a sequence
To compute the maximum, use max(L) function
max([1,2,3,4,5,4,3,2,1]) 5
To compute the minimum, use min(L) function
min([1,2,3,4,5,4,3,2,1]) 1
To create a sorted sequence, use sorted(L) function
sorted([1,2,3,4,5,4,3,2,1]) [1,1,2,2,3,3,4,4,5]
7
Built-in Functions for a sequence
What if we have to deal with the inner product?
The zip(L1,L2,…,Ln) function will help!
zip([1,2,3],[4,5,6]) [ (1,4), (2,5), (3,6) ]
8
Built-in Functions for a sequence
How to use with a loop?
Use packing & unpacking!
for (a,b) in zip([1,2,3],[4,5,6]):
print a,b
This will print (1,4), (2,5), and (3,6) within the loop.
9
Built-in Functions for a sequence
When you need an index with for statement?
Use enumerate() function
for (i, e) in enumerate([“Tom”, “Jack”, “Bob”]):
print i, e
This will print
1 Tom
2 Jack
3 Bob
10
Built-in Functions for a sequence
Functions:
min(<sequence>) the minimum element
max(<sequence>) the maximum element
sum(<numeric list>, <number>) the sum of
elements
zip(a list of sequences with the same length)
enumerate(<sequence>) a list with tuples
which has an index (0,1,2,…,) and an element from
the given sequence.
11
Built-in Functions for a sequence
Packing and Unpacking is useful to deal with
multiple values at an operation.
Collecting each element whose index is 2’s multiple
[ x for (j,x) in enumerate([4,5,6]) if j%2 == 0 ]
[4, 6]
Computing the inner product
sum( [ a*b for (a,b) in zip((1,2),(1,2)) ]) 5
13
File
Hopefully, you didn’t forget how to read a file.
file = open(<path-to-a-file>,<read or write>)
lines = file.readlines()
file.close()
Data processing is essentially dealing with a list of
lines.
However, you should have a clear mind for the
structure of your data.
14
Data Processing
Data processing is essentially dealing with a list of
lines.
However, you should have a clear mind for the
structure of your data.
15
Data Processing
Pre-existing data is mostly the subject for statistical
analysis.
Basic description: count, sum, min, max, set
operations
Descriptive statistics such as mean, variance
Information Visualization
Comparative Analysis
Cross correlation, Hypothesis testing
Modeling and Validation
Linear regression via Least Squares
19
Data that we have
You can download the previous data in a text file
https://www.cs.unc.edu/~joohwi/comp116/ResponseTime.txt
Download the file into your project directory
Let’s make a list of tuples whose type is
(<number>, <number>, <number>, <number>)
0th element is the id of each trial
1st element is the response time in 1/100 sec
2nd element is the number of digits
3rd element is 1 for if the digit is included, 2 for not
21
The structure of data
The first (uppermost) group: the number of digits
1/3/5
The second group: the presence of digit in a given
number
Y/N
Let’s make a hierarchical structure
23
The structure of data
Make a tuple of
(1st level data, 2nd level data, 3rd level data)
(1,1,40) (3,1,73) (5,1,39)
(1,2,45) (3,2,73) (5,2,66)
… … …
24
The structure of data
Read a list of strings from the file
Strip whitespaces using strip() function
Separate data into a list of four words using
split()
Create a list of tuples with list comprehension
25
Reading data step by step
file = open(“<path-to-file>”, “r”)
lines = file.readlines()
file.close()
lines = [ l.strip() for l in lines ]
words = [ l.split() for l in lines ]
26
Reading data step by step
data = [ (int(w[1]),int(w[2]),\
int(w[3]) for w in words ]
print data
[ (1,1,40), (1,1,41), …, ]
27
Reading data step by step
data = [ (int(w[1]),int(w[2]),\
int(w[3]) for w in words ]
print data
[ (1,1,40), (1,1,41), …, ]
28
Now, we have data
Now, let’s make a list of strings which contains 15
numbers of reaction time.
For example,
L = [ 1,2,3,4,5,…,100 ]
D = [ [ 1,2,3,4,…,15], [16,17,18,…,30], [31,32,
…,45], … ]
How could we do that?
There are many ways we can.
29
Collection by counting
1. Create a counter variable and collect a list with
every 15s.
D = []
for j in range(0, len(L), 15):
S = [ ]
for k in range(j, j+15):
S.append(L[j])
D.append(S)
30
Collection by counting
2. Use list comprehension
D = [ [ L[k] for k in range(j, j+15) ] \
for j in range(0, len(L), 15) ]
3. Use list slicing to replace the inner loop
D = [ L[j:j+15] for j in range(0, len(L), 15) ]
4. Convert each inner list into a string
output = [ “ “.join(sublist) for sublist in D ]
31
Repeat for each group
Before repeating the process, define a function
and name the piece of code.
def format_data(L):
D = [ L[j:j+15] for j in range(0, len(L), 15) ]
return “<BR>”.join([ “ “.join(sublist) for sublist
in D ])
The signature of our function is
format_data(<list>) <string>
32
Creating a report
Let’s make a html report
html_template = “<table>… %s … %s … %s …</table>
fo = open(“<path-to-html-file>”, “w”)
o1 = format_data(L1)
o2 = format_data(L2)
…
fo.write(html_template % (o1, o2, o3, …, ))
fo.close()
Open the file with your browser.
34
Multiple countings
A classic problem using another data
structure called a dictionary or an
associative array.
Make a tuple (key, value)
A list contains multiple tuples while
maintaining each key is unique within the
list.
35
Dictionary
Whenever inserting a tuple, test if the key
exists already
If the key exists, overwrite the value
If not, append the tuple into the list
(1, 3), (2, 4), (3, 1), (1, 4)
{ (1,3) }(1,4) (2,4) (3,1)
36
Dictionary
A dictionary is created with {} constructor.
For example
data = { } # an empty dictionary
data = { 1: 3, 2: 4 }
The key and value pair is represented by
key:value within the constructor
37
Dictionary
Accessing an element requires its key
The bracket [] operator takes a key
data = { 1: 3, 2: 4 }
print data[1] # will print 3
print data[2] # will print 4
Don’t be confused with the indexing
38
Dictionary
The type of a key and value can be anything!
data = { “Tom”: “Cruise”, 1:3, (0,2):(3,4) }
This is different from a sequence.
print data[“Tom”] # will print Cruise
print data[1] # will print 3
print data[(0,2)] # will print (3,4)
39
Dictionary
Accessing a non-key value will raise
KeyError
print d[“Nicole”]
KeyError: “Nicole”
IN operator tests the key’s existence
print “Tom” in d # True
print “Nicole” in d # False
40
Dictionary
A set of its keys is obtained by keys()
function
print d.keys()
>>> [ 1, (0,2), “Tom” ]
Note that the insertion order is not
preserved
It depends on the implementation of Python
41
Dictionary
A set of its values is obtained by values()
function
print d.values()
>>> [ 3, (3,4), “Cruise” ]
Note that the insertion order is not
preserved
It depends on the implementation of Python
42
Dictionary
FOR loop works well with the dictionary type
for k in d:
print “(“, k, “,”, v, “)”
There is dictionary comprehension as well!
43
Dictionary
{ j:j+1 for j in range(0,3) }
>>> { 0:1, 1:2, 2:3 }
{ j:e for (j,e) in enumerate(L) }
>>> { 0:L[0], 1:L[1], …, n:L[n] }
{ e:[] for e in L } # assume L = [ 1, 1, 2 ]
>>> { 1:[], 2:[] }
44
Frequency counting
Make each response time value the key of
counts
L = [ (1,1,40), (1,1,41), … ]
F = { e[2]: 0 for e in L }
Initialize each value w/ 0
Duplicate keys ignored
automatically!
45
Frequency counting
L = [ (1,1,40), (1,1,41), … ]
F = { 36:0, 37:0, …, }
Loop each element value in L and update
F[key]
for e in L:
F[ e[2] ] += 1
F = {36: 1, 37: 1, … }