1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And...

72
1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory E:\Matlab (cd E:\ Matlab;) From the course website ( http://www.carine.co.il/htmls/page_1176.aspx?c0=13889&bsp=14333&bssearch=4,0,5, 3,41,0 ) Download: Weizmann 2010 ©

Transcript of 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And...

Page 1: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

1

Introduction to Matlab & Data Analysis

Tutorials 8 and 9: Cell Arrays

Advanced Text Processing And File Handling

Please change directory to directory E:\Matlab (cd E:\Matlab;)

From the course website

(http://www.carine.co.il/htmls/page_1176.aspx?c0=13889&bsp=14333&bssearch=4,0,5,3,41,0

)

Download:

t89.zip and unzip itWeizmann 2010 ©

Page 2: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

2

Outline

2

Cell arrays: Creating and indexing Useful functions for strings lists

Structures Advanced string manipulation

Regular expressions File handling

Reading files Writing to files High-level file handling functions

Final example – P53

Page 3: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

3

Cell Arrays – Lecture Reminders Cell arrays –

Used for keeping different types of data in the same array

For example: A{1}= 2; A{2}= 4:2:44; A{3}= ‘hello’;

Extremely useful for handling lists of strings

Notice the curly brackets

2 4:2:44 hello

Cell Cell Cell Cell Array

Page 4: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

4

Creating Cell Arrays – Lecture Reminder

A(1) = {3}; A{2} = 3; A{3} = ‘radio blabla’; A{4} = 2:2:66;B(1:3) = {3, [1, 2], ’abc’};

C = {‘george clooney’ ; … ‘richard gere’ }; %Initializing an empty cell array:

D=cell(4,2);

>>A‘ans = [ 3][ 3]

' radio blabla'[ 1x33 double]

C = ' george clooney'

' richard gere'

D = ][ ][ ][ ][ ][ ][ ][ ][

Page 5: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

5

Indexing Cell Arrays Define a cell array:>> A(1) = {3};>> A{2} = 3;>> A{3} = ‘radio blabla’;

>> A{4} = 2:2:66; (or load A.mat;)

What is the difference?A(1)

A{1}

>>x=A(1) >>class(x)

>>x=A{1}>> class(x)

>>x=A(3)>> class(x)

>>x=A{3} >>class(x)

x = [3]cellx = 3doublex = 'radio blabla'cellx = radio blablachar

3 [1,2,7] ‘Str’

Cell Cell Cell Cell ArrayTry:

Page 6: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

6

Manipulating Cell arraysJust like numerical arrays…Examples:x([1,3,5]) = {'aaa','bbb','ccc'}x = repmat(x,2,3)x(:,4)x(1:2,3:5)

% Notice:% Using curly brackets returns couple of cells

[a, b]=x{1:2}

Numerical array default value is zero, in cell array it is []

Page 7: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

7

Cell Arrays Are Very Useful For Keeping Lists of Strings

Cell arrays of strings can be treated similarly to numerical arrays.

Many functions can work both numerical & cell arrays Many functions which work on strings can handle cell

arraysload fruit.mat;%fruit={‘mango’,’banana’,’melon’,’apple’,’kiwi’,’orange’};%fruit_prices=[30 15 10 5 35 8]; Find what is the price of melon?ind = find(strcmp(fruit,’melon’));fruit_prices(ind) Sort the fruits from cheapest to most expensive[sorted_p,y]=sort(fruit_prices);fruit(y)

ans = 10

{‘apple‘,’orange‘,’melon‘,’banana‘,’mango‘,’kiwi‘}

Page 8: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

8

Manipulating Cell Arrays That Hold Lists Of Strings

unique

intersect

setdiff

union

Page 9: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

9

Manipulating Cell Arrays That Hold Lists Of Strings - Example

%fruit={‘mango’,’banana’,’melon’,’apple’,’qiwi’,’orange’};

%fruit_sales={‘mango’,’banana’,’melon’,…

’mango’,’mango’,’qiwi’,’banana’,’mango’};

Which fruits were not sold today?setdiff(fruit,unique(fruit_sales))

{'apple‘,'orange‘}

For efficiency

Page 10: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

10

ismember Function Is Useful For Mapping One List To Another

Finds if an element exists in a list>> b = {‘z’,’y’,’x’,’w’};>> a = ismember(‘x’,b)a = 1

If it does – ismember can tell you where it is>>[a,map]= ismember(‘x’,b)a=1, map=3

ismember is good for mapping one list to another – when order is important! >>[a,map]= ismember({‘x’,’y’,‘c’},b);a=[1 1 0], map=[3 2 0]

Page 11: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

11

Comparing Two Lists of Strings:ismember, find and intersect

Which function to use? I want to find the order of

elements of one list in another list?

ismember I want to find which elements

of a list are also in another list?

intersect I want to find all the

occurrences of an element in a list?

find

When the element appears in the list more than once, ismember will return only the last position

Page 12: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

12

Using ismember - Example

>> a = ismember(‘banana’, fruit_sales)a=1>> a = ismember(‘orange’, fruit_sales)a=0>> a = ismember(fruit, fruit_sales);a = [1 1 1 0 1 0]% Reminder: fruit_prices = 30 15 10 5 35 8

Example: calculate the amount of money made by each fruit sale

>> [a,b]= ismember(fruit_sales, fruit);a = [1, 1, 1, 1, 1, 1, 1, 1]b = [1, 2, 3, 1, 1, 5, 2, 1]

>> sales_money = fruit_kilos .* fruit_prices(b)sales_money = [90, 30, 10, 60, 240, 17.5, 45, 150]

Page 13: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

13

Structures

Page 14: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

14

Lecture Reminder - Structures Creation

>> dogs.name = 'rufus';>> dogs.breed = 'Bulldog';>> dogs.age = 1.5; % in years>> dogs.special_food = 'none';>> dogsdogs =

name: 'rufus' breed: 'Bulldog ' age: 1.5000 special_food: 'none‘

14

Page 15: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

15

Lecture Reminder - Structures creation

Adding more dogs…>> dogs(2).name = 'king-kong';>> dogs(2).breed = ‘Chihuahua';>> dogs(2).age = 5; >> dogs(2).special_food = 'filet mignon';

>> dogs(3).name = 'wong';>> dogs(3).breed = 'pekingese';>> dogs(3).age = 20; >> dogs(3).special_food = 'sushi';

>> dogs =

1x3 struct array with fields: name breed age special_food

15

Page 16: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

16

Structures – Short Example

Define a “fruits” structure array that has the fields: name price color

and contains two fruits of your choice

Get: Cell array of the names Array of the prices The first fruit

>> fruits(1).name = 'Lemon';>> fruits(1).color = 'Yellow';>> fruits(1).price = 20; >> fruits(2).name = 'Apple';>> fruits(2).color = 'Green';>> fruits(2).price = 10;

>> {fruits.name}'Lemon' 'Apple'>> [fruits.price]20 10>> a = fruits(1)a = name: 'Lemon' color: 'Yellow' price: 20

Page 17: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

17

Structure Advertisement

Although this tutorial focuses on cells:

Using Structures to aggregate variables that belong to the same entity makes the program easier to design, more readable and easier to debug.

Page 18: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

18

Advanced Text processing (String Manipulation)

1. Review of useful functions:1. findstr, strfind, strtok, strtrim2. sprintf

2. Regular expressions

Page 19: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

19

Review of Useful Functions For String Manipulation

So far we learned simple string manipulations: str2num, num2str strcmp, strncmp, strcmpi, strncmpi

More advance string manipulation functions (used in text processing): findstr, strfind strtok strtrim sprintf (related functions: fprintf, sscanf)

Page 20: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

20

Finding One String Inside Another - findstr and strfind

findstr(str1,str2) – Searches the longer of the two input

strings for any occurrences of the shorter string (input order does not matter!):

>> k = findstr('beauty is in the eyes of the beholder','be')

k=[1, 30]

strfind(str1,str2) The order matters: finding str2 inside

str1 str1 can be a cell array of strings!!!

Page 21: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

23

Consider the line ‘this is an example’ How we write a program that breaks it to a

cell array of single words?rem=‘this is an example’;

words=cell(0);

while 1

[tok,rem] = strtok(rem);

if isempty(tok)

break;

end

words{end+1}=tok;

end

Example –Parsing a Line Using strtok

words'

ans =

'this' 'is' 'an' 'example'

Page 22: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

25

load fruit.mat;for i=1:length(fruit) s = sprintf('Fruit number %d: %s', i, fruit{i}); disp(s);end

sprintf – Write Formatted Data Into Strings

Fruit number 1: mangoFruit number 2: bananaFruit number 3: melonFruit number 4: appleFruit number 5: qiwiFruit number 6: orange

Number String

sprintf(format,…) – write formatted data into strings

Good for creating massages for disp Related functions: fprintf, sscanf

format special characters: %s – a string %d – an integer %f – a float (short double)

Page 23: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

26

sprintf - Example Consider the cell arraynames = {'Danny', 'Noa', 'Moti'}; Write a script that prints:Number:1, Name:Danny.Number:2, Name:Noa.Number:3, Name:Moti. Answer:for i=1:length(names) s = sprintf('Number:%d, Name:%s.',…

i, names{i}); disp(s);end

See also: sscanf & fprintf

Page 24: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

27

More Useful String Manipulation Functions

strtrim(str) – removes all leading and trailing white-space>> strtrim(' do not blink ')'do not blink‘

strtok(str,delim) - breaks a string into “tokens”>> [tok,rem]=strtok('this is an example', ' ')

tok =‘this’ rem = ‘ is an example’ strfind (str1,str2) - searches str2 in str1. str1 can be a cell array of strings! >> k = strfind('beauty is in the eyes of the

beholder','be') k=[1, 30] findstr(str1,str2) – Searches the longer of the two input

strings for any occurrences of the shorter string More useful functions at:

Help -> Matlab -> Functions by category -> Strings functions

Page 25: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

28

Regular expressions

Page 26: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

29

Regular Expression - Definition

Wikipedia – Regular expressions provide a concise and flexible means for identifying strings of text

of interest, such as particular characters, words, or patterns of characters.

ind = regexp(long_str,'\w+ain')

Regular expressions

We need to learn the regular expressions “language” syntax

Page 27: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

30

Regular Expressions Syntax

Defining a pattern: [] is like OR

Any character out of a,b,c or d: [abcd] Anything other than a,b,c or d : [^abcd]

Character range: (all characters a to z) [a-z] Special Charecters used in defining a pattern:

Any character: . Whitespace: \s Newline: \n Tab: \t Any alphanumeric character: \w [a-zA-Z_0-9] Any digit: \d [0-9]

Page 28: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

31

Pattern definition - Expression Quantifiers: One or more: exp+ (Example: ‘[\w]+’) Zero or more: exp* Between n and m times: exp{n,m}Examples

Read more about “regular expressions” in the MATLAB help!(search “regular expressions” )

Function: loc = regexp(str, pattern)

Regular Expressions Syntax

‘\w\s+\w’ – Two alphanumeric expressions with one or more spaces in the middle

‘[SRM]amy’ –

Ramy, Samy or Mamy

Page 29: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

32

Using Regular Expressions to Search For Pattern occurrences In a Long String

Example:

prof_higgins = 'The rain in Spain stays mainly in the plain.';

We would like to find all the words that rhyme with ‘ain’

1. Defining the pattern: new word (preceded with space) One or more alphanumeric characters ‘ain’ pattern= ‘\w+ain[\s\.]’ OR pattern= ‘[a-zA-Z]+ain [\s\.]’

Page 30: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

33

>> prof_higgins = … 'The rain in Spain falls mainly on the plain.';

Find occurrences indices: >> loc = regexp(prof_higgins,'\w+ain')loc = [5 13 25 39]

Get pattern occurrences:>> words = regexp(prof_higgins,'\w+ain','match')words = {'rain','Spain','main','plain'}

Using Regular Expressions to Search For Pattern occurrences In a Long String

Page 31: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

34

Replace all pattern occurrences:

>> eliza_doolittle=regexprep(prof_higgins,’ain’,’yne’)

elisa_doolittle = ‘The ryne in Spyne falls mynely on the plyne.’

Split a line to the words (Good for parsing lines of input file): >> words = regexp(prof_higgins, '\s', 'split');words ={'The‘, 'rain‘,'in‘,'Spain‘,'falls‘,'mainly‘,'on‘,'the‘,

'plain.‘}

Using Regular Expressions to Replace Pattern Occurrences In a Long String

Page 32: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

35

Using Regular Expression to Parse a line (see strtok for another option)

no_rhymes = regexp(prof_higgins, 'ain\w*\s', 'split')no_rhymes =

{'The r' 'in Sp' 'falls m' 'on the plain.‘}

Error: The last word does not have space after it

Fixing it:

no_rhymes = regexp(prof_higgins, '\w+ain[\s\.]', 'split')no_rhymes =

{'The ' 'in ' 'falls mainly on the ' '' }

Page 33: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

36

Running Example – Finding Bomb Threats

You are a CIA agent,who is in charge of identifying potential bombing threats of cities, by going over emails of terrorists .

Page 34: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

37

Using Regular Expression to Identify Significant Lines

Assume an email is stored as a cell array of strings (each line in a cell), called “email”

Using Regular expression: Identify lines that contain the expression “bomb” in it. When you find such a line, print: “Help!!!” load email.mat;for i=1:length(email)

line=email{i};if( )

disp(‘HELP!!!’);end

end

~isempty(regexp(line,’bomb’))

Page 35: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

38

Using Regular Expression to Identify Significant Lines

Notice there is a “bug” in the code: load email.mat;for i=1:length(email)

line=email{i};if(~isempty(regexp(line,’bomb’)) )

disp([‘HELP!!!:’ line]);end

end

HELP!!!:thinking of bombing rehovotHELP!!!:thinking of bombing sderotHELP!!!:thinking of going to the bombamella festival next week

How do we fix the bug?Hint | is or: ‘smil[e|ed|ing]’

Page 36: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

39

Using Regular Expression to Identify Significant Lines

Here is a fix for the bug:

load email.mat;for i=1:length(email)

line=email{i};if(~isempty(regexp(line,’[Bb]omb[ed|ing|s]*\s’)))

disp([‘HELP!!!:’ line]);end

end

HELP!!!:thinking of bombing rehovotHELP!!!:thinking of bombing sderot

| is or

Page 37: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

40

Regular Expression Tokens Are Used to Retrieve Specific Part of the Pattern Occurrences

tokens = regexp( …'bla bla [email protected] bli bli [email protected] ya', …

'(\w+)@(\w+)\.ac\.il', 'tokens')

Token 1 Token 2

tokens =

{ {‘ami’, ‘weizmann’} {‘tami’ ‘tau’} }

ocuurence1

tokens{1}{1} = ‘ami’

Token1

Token2

ocuurence2Token

1Token

2

Page 38: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

41

Using Tokens to Retrieve Specific Parts of the Pattern Occurrences

Now that you identified the suspicious email, take out the threatened city Hint: Use

regexp(line, <some expression>, ‘tokens’).

for i=1:length(email)line=email{i};if(~isempty(regexp(line,’[Bb]omb[ed|ing|s|\s]*\s’))) city = regexp(line,…

'[Bb]omb[ed|ing|s|\s]*\s(\w+)',…

'tokens');disp([‘HELP!!! Bomb threat on ‘ city{1}{1}]);

endend

HELP!!! Bomb threat on:rehovotHELP!!! Bomb threat on:sderot

Page 39: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

42

Using Tokens to Retrieve Specific Parts of the Pattern Occurrences

Here is a loop-less version: load email.mat;cities = regexp(email, '[Bb]omb[ed|ing|s]*\s(\w+).*', 'tokens')

is_threat = ~cellfun('isempty',cities);cities = cities(is_threat);cities = [cities{:}];cities = [cities{:}];warnings = strcat('HELP!!! Bomb threat on: ', cities)disp(strvcat(warnings))

HELP!!! Bomb threat on:rehovotHELP!!! Bomb threat on:sderot

regexp can handle cell array

Page 40: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

43

Handling Files

Page 41: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

44

Lecture Reminder –Opening and Closing Files

Opening a file for reading:fid=fopen(‘filename’,’r’); Opening a file for writing:fid=fopen(‘filename’,’w’); fid is a scalar MATLAB integer, called a

file identifier. You use the fid as the first argument to

other file input/output routines

Always close your file!!! fclose(fid);

Permissions: ‘a’ – append‘r+’- read and writeMore in the HELP…

Page 42: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

45

Lecture Reminder –Reading a File Line by Line

Reading line by line:line = fgetl(fid); How can we read the entire file?fid = fopen('names.txt');

while feof(fid)==0tline = fgetl(fid);

if ~ischar(tline) break; endtline = strtrim(tline);%<do whatever you want>

end

fclose(fid);

Open

Close

feof – did file reached the end

fgetl – file get linebreak if not char

Page 43: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

46

Lecture Reminder – Writing to a File

Open the file for writing permission Writing, line by line, using:

fprintf(fid,format,…); % similar to sprintf!!! Format – is a string with special characters:

%s – a string, %d – an integer, %f – a float (short double) Close the file Example:

fid = fopen(‘tmp.txt', 'w');for i=1:length(lines) fprintf(fid,’this is a line: %s\n’,lines{i});Endfclose(fid);

Page 44: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

47

fid = fopen('names.txt', 'r');l_cnt = 0;

while feof(fid)==0 line = fgetl(fid); if ~ischar(line) break; end l_cnt = l_cnt +1; disp(['Line number ' num2str(l_cnt) ':' line]); end

fclose(fid);

File handling - Example

Open the file names.txt for read

Display it with line numbers:Line number 1: <line1>Line number 2: <line2> …

Close the file

Page 45: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

48

File Handling - Example Congratulations!

You were just promoted to a senior spy. You have a directory full of emails text

files. Now you need to read all emails files,

identify the bomb threat, and write them into a summary threat_report.txt file.

Page 46: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

49

File Handling - ExampleSolution strategy:1. Open output the threats file 2. Go over all the emails in a given

directory:1. Open an input email file2. Read it, line by line 3. identify threats

When a threat is identified – Print the line

4. Close the input email file3. Close output threats file

Page 47: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

50

File Handling – Example:Programs Design

searchEmailsDirForThreats – Open report output file Open a directory and get all the files

names For each file run

searchEmailForThreats – Open email input file Search line by line for threat If threat is found –

Write the threat to the output file

1. Email file name2. Report output fid

Page 48: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

51

File Handling – Example:Main Function Design

function threats_found = searchEmailsDirForThreats(in_emails_dir, out_report_fname)

%<getting all files names> % <opening report output file>

% <going over the files>

% <closing report output file>

Page 49: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

52

File Handling – Example:Main Function Design

function threats_found = searchEmailsDirForThreats(in_emails_dir, out_report_fname

%<getting all files names>if (~isdir(in_emails_dir)) error([in_emails_dir ' is not a directory']);end % getting file namesfs = dir(in_emails_dir);file_names = {fs.name};

Directory management:

dir, pwd, cd, copyfile, delete, movefile, mkdir, rmdir, …

Page 50: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

53

File Handling – Example:Main Function Design

function threats_found = searchEmailsDirForThreats(in_emails_dir, out_report_fname

%<getting all files names> % <opening report output file>out_report_fid = fopen(out_report_fname, 'w');if out_report_fid < 0 error(['File ' ,out_report_fname ,' could not open']);end threats_found = 0;

Page 51: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

54

File Handling – Example:Main Function Design

function threats_found = searchEmailsDirForThreats(in_emails_dir, out_report_fname % <going over the files>for i=1:length(file_names) email_fname = file_names{i};

if (~isdir(email_fname)) threats_found = threats_found + ..

searchEmailForThreats(out_report_fid, … [in_emails_dir '/' email_fname]); end end% <closing report output file>fclose(out_report_fid);

Page 52: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

55

File Handling – Example:Looking for Threats In an Email

function threats_found = searchEmailForThreats(out_report_fid,email_fname)

% <opening email input file>%<going over the file line by line>while feof(in_fid) == 0 % <read line> if % <is found threat>

%<get the threatened city> % <adding to the report> endend%<closing input file>

Page 53: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

56

File Handling – Example:Opening File For Read

function threats_found = searchEmailForThreats(out_report_fid,email_fname)

% <opening email input file>in_fid = fopen(email_fname, 'rt');if in_fid < 0 error(['File ' , email_fname ,' was not found.']);end threats_found = 0; l_cnt = 0;

Page 54: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

57

File Handling – Example:Reading a File Line by Line

function threats_found = searchEmailForThreats(out_report_fid,email_fname)

% <opening email input file>%<going over the file line by line>while feof(in_fid) == 0 % <read line > line = fgetl(in_fid); if ~ischar(line) break; end l_cnt = l_cnt+1; line = strtrim(line); if % <is found threat>

… endend

Page 55: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

58

File Handling – Example:Using Regular expression to find and retrieve pattern occurences

while feof(in_fid) == 0 % <read line> % <is found threat> if (~isempty(regexp(line,'.*bomb.*'))) city = regexp(line, '.*bomb\w*\s([\w-]+).*', 'tokens'); % <adding to the report> fprintf(out_report_fid,'File: %s, Line number:%d, Threat on %s - %s\n', ... email_fname , l_cnt, city{1}{1},line); threats_found = threats_found + 1; end

end

Page 56: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

59

File Handling – Example:Looking For Threats in an Email

function threats_found = searchEmailForThreats(out_report_fid,email_fname)

% <opening email input file>%<going over the file line by line>while feof(in_fid) == 0 % <read line> if % <is found threat>

%<get the threatened city> % <adding to the report> endend%<closing input file>fclose(in_fid);

Page 57: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

60

High-Level File Handling Functions

Page 58: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

61

Matlab Has a Collection of High Level Write / Read Functions

Matlab has a collection of high level read and write functions

These functions can save the need to write read/ write the file line by line.

Examples: dlmread, dlmwrite textread, textscan xlsread importdata

Page 59: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

62

High-level File Reading Function Example- textread

Reading an entire text file in one line: lines=textread(filename,format,parameters) Example: When reading a file containing a single word in every

line: names=textread(‘names.txt’,’%s’);

If there are more words in a line – each word will be read separately

Example 1:

email=textread(‘email.txt’,’%s’); What happens?

email = {'thinking' 'of'' bombing' 'rehovot''thinking‘…}

Page 60: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

63

High-level File Reading Function Example- textread

Example 2: Reading a text file, line by line Try:

email = textread('email.txt', '%s', 'delimiter','\n‘);

What happens?

email = {'thinking of bombing rehovot''thinking of bombing sderot''thinking of going to the bombamella festival next week’}

Page 61: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

64

MATLAB functions for High-level file reading

Reading an entire Excel file in one line:

[nums,t]=xlsread(filename,options…) Will create a numerical array nums and a

cell array t. Try:

[n,t]=xlsread('rt_example3.xls') What happens?

Textual cells are set to NaNs in n Numerical cells are set to ‘’ (empty strings) in t

Note: can read each sheet (read the HELP)

Page 62: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

65

MATLAB functions for High-level file reading

Reading an entire Excel/tab delimited text file /other preformatted files:

A=importdata(filename,options…) Will create a structure A, which contains:

A.data - numerical array A.textdata - a cell array.

Try: A=importdata('rt_example3.xls') What happens?

Page 63: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

66

Summery – File Handeling

Matlab has diverse and powerful functions for text processing

Before you start coding using low levels I/O function – Check if one of the high level functions solves it.

Page 64: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

67

Final example:Looking for p53 TFBS

(Transcription Factor Binding Sites)in human promoters

Page 65: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

68

Looking for p53 TFBS in human promoters

A TF can recognize a variable site Some positions are fixed Some are optional, e.g. A/T are

acceptable, but not G/C. Consensus sequence: the pattern

representing all possible recognized sites.

Page 66: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

69

Looking for p53 TFBS in human promoters

Let’s define a consensus for p53 half-site:1. Pos #1: G/A/T2. Pos #2: G/A3. Pos #3: A/G/C4. Pos #4: C5. Pos #5: A/T6. Pos #6: A/T7. Pos #7: G8. Pos #8: N9. Pos #9: T/C/G10. Pos #10: T/C

Variable space0-13

Half-site Half-site

Page 67: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

70

Looking for p53 TFBS in human promoters

How do we even start???1. Read the promoter file into a cell array.2. Go through the promoters:

Look for the p53 consensus (need to define it – regular expression) When we find it store the data on the hit

3. Open a result file4. Go through all the hits you found

Print them into the results file

Page 68: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

71

Looking for p53 TFBS in human promoters

1. Reading the promoter file:

The file name: masked_promoters.some.txtThe file format: FASTA>gene1 header lineSequence…Sequence…> gene2 header lineSequence…Sequence…

>GENE=ENSG00000001036 Transcript=1 LLid=2519 orgDBsym=FUCA2 other details… CCATGTTCTAAACGACTTCATAGATTTATTTCTTTCAGTCAT…

Page 69: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

72

Looking for p53 TFBS in human promoters

1. Reading the promoter file:promoters={};ensID={};symb={};

fid=fopen('masked_promoters.all.txt');while feof(fid)==0 tline = fgetl(fid); >process the data> endfclose(fid);

Page 70: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

73

Looking for p53 TFBS in human promoters

1. Reading the promoter file:while 1 >from previous slide…> if(tline(1)=='>') %it is a header tmp=regexp(tline,…

'.*GENE=(\w+)\s.*orgDBsym=(\w+)',… 'tokens');

ensID{end+1}=tmp{1}{1}; symb{end+1}=tmp{1}{2}; else %it is a promoter promoters{end+1}=tline; endend

Page 71: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

74

Looking for p53 TFBS in human promoters

2. Go through the promoters:

hit_seq={};hit_gene=[];hit_pos=[];p53_consensus = ...'[GAT][GA][AGC]C[AT][AT]G.[TCG][TC].{0,13}[GAT][GA]

[AGC]C[AT][AT]G.[TCG][TC]';

for i=1:length(promoters) [m s e] = regexp(promoters{i}, p53_consensus, 'match', …

'start', 'end');%let’s ignore that DNA is double stranded…

if(~isempty(m)) hit_seq(end+1:end+length(m))=m; hit_gene(end+1:end+length(m))=repmat(i,1,length(m)); hit_pos(end+1:end+length(m))=s; endend

Page 72: 1 Introduction to Matlab & Data Analysis Tutorials 8 and 9: Cell Arrays Advanced Text Processing And File Handling Please change directory to directory.

75

Looking for p53 TFBS in human promoters

3&4. Open a result file, print all the hits

fid=fopen('p53_TFBS.txt','w');%printing a header linefprintf(fid,'gene ID\tgene name\tsite\tpos\n');for i=1:length(hit_gene) fprintf(fid,'%s\t%s\t%s\t%d\n',

ensID{hit_gene(i)},... symb{hit_gene(i)},...

hit_seq{i},... hit_pos(i));

endfclose(fid);