星さんの型紙...Title 星さんの型紙 Author toshi Created Date 20140918150514Z
Calculating Word Frequency in a Document. 11/6( 四 ) 這個星期四小考, 5. Threaded Binary.
-
date post
20-Jan-2016 -
Category
Documents
-
view
218 -
download
0
Transcript of Calculating Word Frequency in a Document. 11/6( 四 ) 這個星期四小考, 5. Threaded Binary.
Data Structure Project 2
Calculating Word Frequency in a Document
http://mpc.cs.nctu.edu.tw/forum/ 11/6( 四 ) 這個星期四小考 , 5. Threaded
Binary Tree 不考 11/15( 六 ) 10:10~12:00 期中考!
TA’s website & remainder
有關多一行的問題 .. >> version
◦ ifstream input(argv[1]);◦ while (!input.eof() && input.peek() > 0) {◦ input >> buf;◦ cout << buf ;◦ input >> buf;◦ input.get(); /* 拿走 ‘ \n’ 這個 character
*/◦ cout << " " << buf << endl;◦ }
About Project One…
Getline version◦ ifstream input(argv[1]); ◦ while (!input.eof()) {◦ input.getline(buf, 500);◦ if (input.gcount() > 0) /* 判斷是不是有拿到東西了 */◦ cout << buf << endl;◦ }
Another one◦ ifstream input(argv[1]);◦ while (input.getline(buf, 500)) { ◦ cout << buf << endl;◦ }
About Project One…
有關於出現 ^@ 的問題◦ 看到 demo 時候出現 ^@ 就是你把 ‘ \0’ ( 就是 0) output 到檔案
中了 ..◦ 以後多出這種 demo 程式就不會過 , 就以錯誤計算
How to fix ?◦ 最常發生的就是沒有計算好 buffer/string 長度就 output 到檔案中 .◦ int i; FILE* fw; char *a = "123"; ◦ fw = fopen(argv[1], "w");◦ /* 這樣不會 output 出 ^@ */◦ for(i=0; i<3; i++) fprintf(fw, "%c", a[i]);◦ /* 這樣就會 output 出 ^@ */◦ for(i=0; i<4; i++) fprintf(fw, "%c", a[i]); ◦ fclose(fw);
About Project One…
1 2 3 \0
補 demo project 1 請先 upload code ftp://mpc.cs.nctu.edu.tw, 開一個自己學號的目錄 .
第一次 demo 成績 : http://www.cs.nctu.edu.tw/~hhyou/ds.php
About Project One
Input: a text file and a stop words list◦ Using argc and argv◦ ./a.out stopword textfile
Output: pairs of word and the number of their occurrence◦ To stdout (the screen)
Project Two
Text file (without stop word)Hello, I’m Billy, not bi|lyor 6illy or b.
Output◦ Hello,:1◦ I’m:1◦ Billy,:◦ not:1◦ bi|ly: 1◦ or: 2◦ 6illy: 1◦ b.: 1
Project Two
Text file (same) Stop word list
◦ and◦ not◦ or
Output◦ Hello,:1◦ I’m:1◦ Billy,:◦ bi|ly: 1◦ 6illy: 1◦ b.: 1
Project Two
Text file◦ a b c d e f g h i j a b c d e
Stop words list◦ a b c d
Output◦ e:2 ; f:1 ; g:1 ; h:1 ; i:1 ; j:1
Project Two
Input◦ Text file
Every words are spited by ‘ ‘,’\t’, or ‘\n’. Case sensitive.
Do and do are different words There’s at most 2000 chars in one line. There will be no Chinese input. Not only one line in a text file. There might be consecutive ‘\t’ or ‘ ‘ or ‘\n’. Program executive time are limited.
Project Two
Input◦ Stop words list
One word one line No space,’\t’ in one line No more than 2000 chars one line
Correct◦ Haha◦ Hehe◦ kerker
Incorrect◦ 囧 oo◦ A b
Project Two
Word occurrence◦ String+’ ‘+number+’’\n’A 3B 5
String orders won’t matter.B 5A 3
Project Two
You can use any data structure to store the pair (word, occurrence), such like an array. (watch out about the large case)
One array for your string, another for the occurrence
Your data structure must be fast in insertion and selection (search).
Project Two
We’ll use program to judge your homework◦ Please take care about the I/O format
You can not read the whole file in one time◦ You have to read at most one line in one time
We’ll release some test data. Due: 11/21 Your bonus will depend on the efficiency of
your program
Project Two
Large case◦ A lot of different words (more than 1000000)◦ A lot of words in a text file◦ 30%◦ One of them will be released
10% per test case We will release 2 normal test case and 1
large test case for testing.
Project Two
Some simple algorithm Assume STOPWORD has N word, TEXTFILE
has M word. We build SW_LIST to store stop words,
TXT_LIST to store text file words.
Project Two
Read in STOPWORD, store it as SW_LIST foreach ( word read from TEXTFILE ) { if ( the word is in SW_LIST ) then continue to read another word. else ( the word is not in SW_LIST ) then if ( the word is in TXT_LIST ) then add count of the word 1 else ( the word is not in TXT_LIST ) then insert word into TXT_LIST }
Project Two (Brute Force)O(N)
O(M)
O(M)
O(N)
這個作業寫的比較快的會有 Bonus. 到時候會把大家的程式拿到某台神秘的工作站上面
跑 , 看誰快誰慢 . 如果對於加分部份的公平性有疑問請在 11/6( 四 )
上課前提出 .
Project Two
先到 ftp://mpc.cs.nctu.edu.tw 建立自己學號的資料夾 .
上傳可 compile, run 的 C/C++ source code 檔案到 ftp://mpc.cs.nctu.edu.tw
Project Two – How to hand in
Any questions ?
Q & A