Calculating Word Frequency in a Document. 11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

21
Data Structure Project 2 Calculating Word Frequency in a Document
  • date post

    20-Jan-2016
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of Calculating Word Frequency in a Document. 11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

Page 1: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

Data Structure Project 2

Calculating Word Frequency in a Document

Page 2: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

http://mpc.cs.nctu.edu.tw/forum/ 11/6( 四 ) 這個星期四小考 , 5. Threaded

Binary Tree 不考 11/15( 六 ) 10:10~12:00 期中考!

TA’s website & remainder

Page 3: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

有關多一行的問題 .. >> version

◦ ifstream input(argv[1]);◦ while (!input.eof() && input.peek() > 0) {◦ input >> buf;◦ cout << buf ;◦ input >> buf;◦ input.get(); /* 拿走 ‘ \n’ 這個 character

*/◦ cout << " " << buf << endl;◦ }

About Project One…

Page 4: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

Getline version◦ ifstream input(argv[1]); ◦ while (!input.eof()) {◦ input.getline(buf, 500);◦ if (input.gcount() > 0) /* 判斷是不是有拿到東西了 */◦ cout << buf << endl;◦ }

Another one◦ ifstream input(argv[1]);◦ while (input.getline(buf, 500)) { ◦ cout << buf << endl;◦ }

About Project One…

Page 5: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

有關於出現 ^@ 的問題◦ 看到 demo 時候出現 ^@ 就是你把 ‘ \0’ ( 就是 0) output 到檔案

中了 ..◦ 以後多出這種 demo 程式就不會過 , 就以錯誤計算

How to fix ?◦ 最常發生的就是沒有計算好 buffer/string 長度就 output 到檔案中 .◦ int i; FILE* fw; char *a = "123"; ◦ fw = fopen(argv[1], "w");◦ /* 這樣不會 output 出 ^@ */◦ for(i=0; i<3; i++) fprintf(fw, "%c", a[i]);◦ /* 這樣就會 output 出 ^@ */◦ for(i=0; i<4; i++) fprintf(fw, "%c", a[i]); ◦ fclose(fw);

About Project One…

1 2 3 \0

Page 6: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

補 demo project 1 請先 upload code ftp://mpc.cs.nctu.edu.tw, 開一個自己學號的目錄 .

第一次 demo 成績 : http://www.cs.nctu.edu.tw/~hhyou/ds.php

About Project One

Page 7: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

Input: a text file and a stop words list◦ Using argc and argv◦ ./a.out stopword textfile

Output: pairs of word and the number of their occurrence◦ To stdout (the screen)

Project Two

Page 8: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

Text file (without stop word)Hello, I’m Billy, not bi|lyor 6illy or b.

Output◦ Hello,:1◦ I’m:1◦ Billy,:◦ not:1◦ bi|ly: 1◦ or: 2◦ 6illy: 1◦ b.: 1

Project Two

Page 9: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

Text file (same) Stop word list

◦ and◦ not◦ or

Output◦ Hello,:1◦ I’m:1◦ Billy,:◦ bi|ly: 1◦ 6illy: 1◦ b.: 1

Project Two

Page 10: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

Text file◦ a b c d e f g h i j a b c d e

Stop words list◦ a b c d

Output◦ e:2 ; f:1 ; g:1 ; h:1 ; i:1 ; j:1

Project Two

Page 11: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

Input◦ Text file

Every words are spited by ‘ ‘,’\t’, or ‘\n’. Case sensitive.

Do and do are different words There’s at most 2000 chars in one line. There will be no Chinese input. Not only one line in a text file. There might be consecutive ‘\t’ or ‘ ‘ or ‘\n’. Program executive time are limited.

Project Two

Page 12: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

Input◦ Stop words list

One word one line No space,’\t’ in one line No more than 2000 chars one line

Correct◦ Haha◦ Hehe◦ kerker

Incorrect◦ 囧 oo◦ A b

Project Two

Page 13: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

Word occurrence◦ String+’ ‘+number+’’\n’A 3B 5

String orders won’t matter.B 5A 3

Project Two

Page 14: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

You can use any data structure to store the pair (word, occurrence), such like an array. (watch out about the large case)

One array for your string, another for the occurrence

Your data structure must be fast in insertion and selection (search).

Project Two

Page 15: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

We’ll use program to judge your homework◦ Please take care about the I/O format

You can not read the whole file in one time◦ You have to read at most one line in one time

We’ll release some test data. Due: 11/21 Your bonus will depend on the efficiency of

your program

Project Two

Page 16: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

Large case◦ A lot of different words (more than 1000000)◦ A lot of words in a text file◦ 30%◦ One of them will be released

10% per test case We will release 2 normal test case and 1

large test case for testing.

Project Two

Page 17: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

Some simple algorithm Assume STOPWORD has N word, TEXTFILE

has M word. We build SW_LIST to store stop words,

TXT_LIST to store text file words.

Project Two

Page 18: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

Read in STOPWORD, store it as SW_LIST foreach ( word read from TEXTFILE ) { if ( the word is in SW_LIST ) then continue to read another word. else ( the word is not in SW_LIST ) then if ( the word is in TXT_LIST ) then add count of the word 1 else ( the word is not in TXT_LIST ) then insert word into TXT_LIST }

Project Two (Brute Force)O(N)

O(M)

O(M)

O(N)

Page 19: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

這個作業寫的比較快的會有 Bonus. 到時候會把大家的程式拿到某台神秘的工作站上面

跑 , 看誰快誰慢 . 如果對於加分部份的公平性有疑問請在 11/6( 四 )

上課前提出 .

Project Two

Page 20: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

先到 ftp://mpc.cs.nctu.edu.tw 建立自己學號的資料夾 .

上傳可 compile, run 的 C/C++ source code 檔案到 ftp://mpc.cs.nctu.edu.tw

Project Two – How to hand in

Page 21: Calculating Word Frequency in a Document.     11/6( 四 ) 這個星期四小考, 5. Threaded Binary.

Any questions ?

Q & A