Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain...

48
Google N-Gram Google N-Gram Patterns Patterns CS 8621 CS 8621 Fall 2007 Fall 2007 By Team Flamengo: By Team Flamengo: Darshan Paranjape Darshan Paranjape Bin Lan Bin Lan Anurag Jain Anurag Jain Vishnu Pedireddy Vishnu Pedireddy

Transcript of Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain...

Page 1: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Google N-Gram PatternsGoogle N-Gram Patterns

CS 8621CS 8621Fall 2007Fall 2007

By Team Flamengo:By Team Flamengo:Darshan ParanjapeDarshan Paranjape

Bin LanBin LanAnurag JainAnurag Jain

Vishnu PedireddyVishnu Pedireddy

Page 2: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Project GoalProject Goal

Build a co-occurrence network using Google Build a co-occurrence network using Google unigram and bi-gram data.unigram and bi-gram data.

Analyze the network layout and minimize the Analyze the network layout and minimize the problem by finding disjoint networks.problem by finding disjoint networks.

Extract the path information for user query Extract the path information for user query words.words.

Prune the network using unigram and Prune the network using unigram and association score cutoff.association score cutoff.

Page 3: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

System FlowSystem Flow

Network Building(File I/O)

Network Analysis(Find disjoint network)

UnigramCutoff

Path FindingAssociation Score

Cutoff

Page 4: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Data StructuresData Structures

2 important aspects of co-occurrence network- 2 important aspects of co-occurrence network- Nodes & Edges.Nodes & Edges.

Data structure “node” stores the information Data structure “node” stores the information about the unigram data. about the unigram data.

Data structure “edge” stores the information Data structure “edge” stores the information about the bi-gram data.about the bi-gram data.

Implementation choices:Implementation choices:Array of structures or Linked list. Array of structures or Linked list.

Page 5: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Data StructuresData Structurestypedef struct nodetypedef struct node{{

char* token; char* token; //Word string//Word stringlong int freq;long int freq; //Unigram Frequency//Unigram Frequencyint index; int index; //location in unigram array//location in unigram arrayedge *incoming; edge *incoming; //Starting pointer of incoming linked list//Starting pointer of incoming linked listedge *curr_incoming; edge *curr_incoming; //Current pointer of incoming linked list//Current pointer of incoming linked listedge *outgoing; edge *outgoing; //Starting pointer of outgoing linked list//Starting pointer of outgoing linked listedge *curr_outgoing; edge *curr_outgoing; //Current pointer of outgoing linked list//Current pointer of outgoing linked listint has_seen; int has_seen; //Variable used while finding distinct networks //Variable used while finding distinct networks int is_checked;int is_checked; //Variable used while finding distinct networks//Variable used while finding distinct networksint count_outgoing; int count_outgoing; //Total number of outgoing edges//Total number of outgoing edgesint count_incoming; int count_incoming; //Total number of incoming edges//Total number of incoming edgeslong int total_out_weight; long int total_out_weight; //Sum of all outgoing weights//Sum of all outgoing weightsint incoming_nodes_added;int incoming_nodes_added; //Variable used in the beta stage//Variable used in the beta stageint outgoing_nodes_added;int outgoing_nodes_added; //Variable used in the beta stage//Variable used in the beta stage

};};typedef struct edgetypedef struct edge{{

int index; int index; //location in unigram array//location in unigram arraylong int freq; long int freq; //weight associated with edge//weight associated with edgestruct edge *next; struct edge *next; //Pointer to next entry in the linked list//Pointer to next entry in the linked listint marked; int marked; //Variable used in beta stage//Variable used in beta stageint marked_beforeint marked_before //Variable used in beta stage//Variable used in beta stage

};};

Page 6: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Network Building StepsNetwork Building Steps

Count the number of unigrams.Count the number of unigrams.

Memory Allocation.Memory Allocation.

Read the unigram file.Read the unigram file.

Bi-gram File Distribution.Bi-gram File Distribution.

Finding index from Unigram array.Finding index from Unigram array.

Adding incoming and outgoing edge information.Adding incoming and outgoing edge information.

Page 7: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Network AnalysisNetwork Analysis

Two steps Algorithm:Two steps Algorithm:

1.1. Local analyze. (Based on the information Local analyze. (Based on the information on each processor)on each processor)

2.2. Global result is derived from the local Global result is derived from the local results.results.

Page 8: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Local AnalyzeLocal Analyze

Two checking bits are being used during this process Two checking bits are being used during this process (has_seen and is_checked):(has_seen and is_checked):

– First, function check_all_branch will be applied to a node A. First, function check_all_branch will be applied to a node A. Inside the function, node A’s has_seen bit is being marked Inside the function, node A’s has_seen bit is being marked because we have “seen” node A.because we have “seen” node A.

– Second, function check_all_branch will be applied to all the Second, function check_all_branch will be applied to all the neighbors of node A.neighbors of node A.

– Finally, node A’s is_checked bit will be marked because we Finally, node A’s is_checked bit will be marked because we cover all neighbors of node A. So node A is “checked”.cover all neighbors of node A. So node A is “checked”.

– During the local analyze, each disjoint network is assigned a During the local analyze, each disjoint network is assigned a unique network ID.unique network ID.

Page 9: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Global AnalyzeGlobal Analyze

After the local analyze, each processor After the local analyze, each processor has a unique but incomplete view of the has a unique but incomplete view of the network based on the edges information it network based on the edges information it stores.stores.

– For example: For example: In CPU A, node 1 is connected to node 2, 3, and 5. In CPU A, node 1 is connected to node 2, 3, and 5.

Their network ID is 5. Meanwhile, in CPU B, node Their network ID is 5. Meanwhile, in CPU B, node 1 is connected to node 2, 6, and 7. Their network 1 is connected to node 2, 6, and 7. Their network ID is 2. ID is 2.

Page 10: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Global Analyze (Cont.)Global Analyze (Cont.)

So what we need to do during the global analyze So what we need to do during the global analyze is to combine the local results so that the global is to combine the local results so that the global result reflects the real network layout.result reflects the real network layout.– In previous example, what our algorithm does In previous example, what our algorithm does

is basically to tell everyone that network 5 in is basically to tell everyone that network 5 in CPU A is actually connected to network 2 in CPU A is actually connected to network 2 in CPU B. Hence, node 1 is not only connected CPU B. Hence, node 1 is not only connected to node 2, 3, 5, but also to node 6 and 7. to node 2, 3, 5, but also to node 6 and 7. (Note that even the network ID could be (Note that even the network ID could be different for the same node in different CPU, different for the same node in different CPU, but the node ID or index is always the same but the node ID or index is always the same for all CPUs)for all CPUs)

Page 11: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

PRINTING METHOD AND OBLIGATIONS---------------------------------------------------------

To be able to print path of specific length with target ‘X’ at the center:

1) We will collect X’s disjoint network on master processor.2) Then recursively print the paths.

Note: 1) We don’t print cyclic paths. (Cyclic paths occur when same edge occurs

in a path more than once.)2) We print complete paths. (If the last node in the printed path does not

have any parent or a child or both then that path is called a complete path.)

3) We will not collecting the whole disjoint network for the target but a part of it depending upon the specified length.

Page 12: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Avoiding Self-Loops (Edge-Marking Method)-----------------------------------------------------------

In this method while building up a path to be printed we mark all the edges that have been included in the path so that if they occur again we can just skip them and move to the next connected edge and so on till we find an unmarked edge. In this was way we can avoid the self loops.

Note: 1) We do not mark the last outgoing edge as its useless.2) We will not be collecting nodes for the last nodes in printed paths.

Printing Complete Paths--------------------------------

To print the complete paths we took advantage of the property of complete paths that the last node in a complete path will have no parent or no child or both. If the last node in the path does not have any child or parent or both and the length is less than or equal to the specified length we print that path.

Page 13: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

1gm File----------A 1000B 2000F 3000

2gm file to read--------------------

A A 100A B 200B A 400B B 500F A 600

1gm(Vocab) File and 2gm File To Be Read

Page 14: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

PRINT PATHS

TARGET ‘A’ LENGTH 2

NUMBER OF PROCESSORS 2

Page 15: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Number Of Processors 2 | Target Token A | Length 2 --------------------------------------------------------------------------- A’s disjoint network distributed among the processors.

Processor 1                          Processors 2 ---------------- -----------------|A| -> 0,100->1,200 (Outgoing)  |A| -> NULL (Outgoing)   -> 0,100->1,400 (Incoming)  -> 2,600 (Incoming)|B| -> 0,400                           |B| -> 1,500     -> 0,200                      -> 1,500|F| -> NULL                            |F| -> 0,600     -> NULL                                -> NULL

INDICES 1gm File------------ -----------

0 A 10001 B 20002 F 3000

2gm file to read--------------------

A A 100A B 200B A 400B B 500F A 600

Page 16: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Processor 1                          Processors 2 ---------------- -----------------|A| -> 0,100->1,200 (Outgoing)  |A| -> NULL (Outgoing)   -> 0,100->1,400->2,600(Incoming) -> 2,600 (Incoming)|B| -> 0,400->1,500                           |B| -> 1,500     -> 0,200->1,500                   -> 1,500|F| -> 0,600                            |F| -> 0,600     -> NULL                                -> NULL

Number Of Processors 2 | Target Token A | Length 2(Before Printing) --------------------------------------------------------------------------------------------------

Collect A’s Disjoint Network On Processor 1

INDICES 1gm File------------ -----------

0 A 10001 B 20002 F 3000

2gm file to read--------------------

A A 100A B 200B A 400B B 500F A 600

Page 17: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Print Path For ‘A’ And Length 2Print Path For ‘A’ And Length 2

Processor 1          ----------------

|A| -> 0,100->1,200 (Outgoing)   -> 0,100->1,400->2,600(Incoming)|B| -> 0,400->1,500     -> 0,200->1,500|F| -> 0,600     -> NULL

NETWORK GRAPH FORMNETWORK PROCESSOR FORM

Page 18: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

A B

B A BA

A

FBA

BAFA B

Print Path For ‘A’ And Length 2Print Path For ‘A’ And Length 2

NETWORK A’s LENGTH 2 NETWORK

Page 19: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

A B

B A BA

A

FBA

BAFA B

A->(100)->A->(100)->A(SKIPPED)

(CENTER)

SOLUTION

QueryTarget: ALength: 2

Print Path For ‘A’ And Length 2Print Path For ‘A’ And Length 2

- Printed (Unmarked)

- Might Be Printed (Marked)

- Not Printed

Page 20: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

A B

B A BA

A

FBA

BAFA B

B->(400)->A->(100)->A->(100)->A(SKIPPED)

(CENTER)

SOLUTION

QueryTarget: ALength: 2

Print Path For ‘A’ And Length 2Print Path For ‘A’ And Length 2

- Printed (Unmarked)

- Might Be Printed (Marked)

- Not Printed

Page 21: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

A B

B A BA

A

FBA

BAFA B

B->(400)->A->(100)->A->(100)->B->(400)->A(SKIPPED)

(CENTER)

SOLUTION

QueryTarget: ALength: 2

Print Path For ‘A’ And Length 2Print Path For ‘A’ And Length 2

- Printed (Unmarked)

- Might Be Printed (Marked)

- Not Printed

Page 22: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

A B

B A BA

A

FBA

BAFA B

B->(400)->A->(100)->A->(100)->B->(500)->B(PRINTED)

(CENTER)

SOLUTION

B->(400)->A->(100)->A->(200)->B->(500)->B

QueryTarget: ALength: 2

Print Path For ‘A’ And Length 2Print Path For ‘A’ And Length 2

- Printed (Unmarked)

- Might Be Printed (Marked)

- Not Printed

Page 23: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

A B

B A BA

A

FBA

BAFA B

F->(600)->A->(100)->A->(100)->A(SKIPPED) (CENTER)

SOLUTION

B->(400)->A->(100)->A->(200)->B->(500)->B

QueryTarget: ALength: 2

Print Path For ‘A’ And Length 2Print Path For ‘A’ And Length 2

- Printed (Unmarked)

- Might Be Printed (Marked)

- Not Printed

Page 24: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

A B

B A BA

A

FBA

BAFA B

F->(600)->A->(100)->A->(200)->B->(400)->A(PRINTED)

(CENTER)

SOLUTION

B->(400)->A->(100)->A->(200)->B->(500)->BF->(600)->A->(100)->A->(200)->B->(400)->A

QueryTarget: ALength: 2

Print Path For ‘A’ And Length 2Print Path For ‘A’ And Length 2

- Printed (Unmarked)

- Might Be Printed (Marked)

- Not Printed

Page 25: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

A B

B A BA

A

FBA

BAFA B

F->(600)->A->(100)->A->(200)->B->(500)->B(PRINTED)

(CENTER)

SOLUTION

B->(400)->A->(100)->A->(200)->B->(500)->BF->(600)->A->(100)->A->(200)->B->(400)->AF->(600)->A->(100)->A->(200)->B->(500)->B

QueryTarget: ALength: 2

Print Path For ‘A’ And Length 2Print Path For ‘A’ And Length 2

- Printed (Unmarked)

- Might Be Printed (Marked)

- Not Printed

Page 26: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

A B

B A BA

A

FBA

BAFA B

A->(200)->B->(400)->A->(100)->A->(100)->A(SKIPPED)A->(200)->B->(400)->A->(100)->A->(200)->B(SKIPPED)

(CENTER)

SOLUTION

B->(400)->A->(100)->A->(200)->B->(500)->BF->(600)->A->(100)->A->(200)->B->(400)->AF->(600)->A->(100)->A->(200)->B->(500)->B

QueryTarget: ALength: 2

Print Path For ‘A’ And Length 2Print Path For ‘A’ And Length 2

- Printed (Unmarked)

- Might Be Printed (Marked)

- Not Printed

Page 27: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

A B

B A BA

A

FBA

BAFA B

A->(200)->B->(400)->A->(200)->B(SKIPPED)

(CENTER)

SOLUTION

B->(400)->A->(100)->A->(200)->B->(500)->BF->(600)->A->(100)->A->(200)->B->(400)->AF->(600)->A->(100)->A->(200)->B->(500)->B

QueryTarget: ALength: 2

Print Path For ‘A’ And Length 2Print Path For ‘A’ And Length 2

- Printed (Unmarked)

- Might Be Printed (Marked)

- Not Printed

Page 28: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

A B

B A BA

A

FBA

BAFA B

B->(500)->B->(400)->A->(100)->A->(100)->A(SKIPPED)

(CENTER)

SOLUTION

B->(400)->A->(100)->A->(200)->B->(500)->BF->(600)->A->(100)->A->(200)->B->(400)->AF->(600)->A->(100)->A->(200)->B->(500)->B

QueryTarget: ALength: 2

Print Path For ‘A’ And Length 2Print Path For ‘A’ And Length 2

- Printed (Unmarked)

- Might Be Printed (Marked)

- Not Printed

Page 29: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

A B

B A BA

A

FBA

BAFA B

B->(500)->B->(400)->A->(100)->A->(200)->B(PRINTED)

(CENTER)

SOLUTION

B->(400)->A->(100)->A->(200)->B->(500)->BF->(600)->A->(100)->A->(200)->B->(400)->AF->(600)->A->(100)->A->(200)->B->(500)->BB->(500)->B->(400)->A->(100)->A->(200)->B

QueryTarget: ALength: 2

Print Path For ‘A’ And Length 2Print Path For ‘A’ And Length 2

- Printed (Unmarked)

- Might Be Printed (Marked)

- Not Printed

Page 30: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

A B

B A BA

A

FBA

BAFA B

B->(500)->B->(400)->A->(200)->B->(400)->A(SKIPPED)

B->(500)->B->(400)->A->(200)->B->(400)->B(SKIPPED)

(CENTER)

SOLUTION

B->(400)->A->(100)->A->(200)->B->(500)->BF->(600)->A->(100)->A->(200)->B->(400)->AF->(600)->A->(100)->A->(200)->B->(500)->BB->(500)->B->(400)->A->(100)->A->(200)->B

QueryTarget: ALength: 2

Print Path For ‘A’ And Length 2Print Path For ‘A’ And Length 2

- Printed (Unmarked)

- Might Be Printed (Marked)

- Not Printed

Page 31: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

A B

B A BA

A

FBA

BAFA B

F->(600)->A->(100)->A->(100)->A(SKIPPED)

(CENTER)

SOLUTION

B->(400)->A->(100)->A->(200)->B->(500)->BF->(600)->A->(100)->A->(200)->B->(400)->AF->(600)->A->(100)->A->(200)->B->(500)->BB->(500)->B->(400)->A->(100)->A->(200)->B

QueryTarget: ALength: 2

Print Path For ‘A’ And Length 2Print Path For ‘A’ And Length 2

- Printed (Unmarked)

- Might Be Printed (Marked)

- Not Printed

Page 32: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

A B

B A BA

A

FBA

BAFA B

F->(600)->A->(100)->A->(200)->B(PRINTED)

(CENTER)

SOLUTION

B->(400)->A->(100)->A->(200)->B->(500)->BF->(600)->A->(100)->A->(200)->B->(400)->AF->(600)->A->(100)->A->(200)->B->(500)->BB->(500)->B->(400)->A->(100)->A->(200)->BF->(600)->A->(100)->A->(200)->B

QueryTarget: ALength: 2

Print Path For ‘A’ And Length 2Print Path For ‘A’ And Length 2

- Printed (Unmarked)

- Might Be Printed (Marked)

- Not Printed

Page 33: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

A B

B A BA

A

FBA

BAFA B

F->(600)->A->(200)->B->(400)->A(PRINTED)

(CENTER)

SOLUTION

B->(400)->A->(100)->A->(200)->B->(500)->BF->(600)->A->(100)->A->(200)->B->(400)->AF->(600)->A->(100)->A->(200)->B->(500)->BB->(500)->B->(400)->A->(100)->A->(200)->BF->(600)->A->(100)->A->(200)->BF->(600)->A->(200)->B->(400)->A

QueryTarget: ALength: 2

Print Path For ‘A’ And Length 2Print Path For ‘A’ And Length 2

- Printed (Unmarked)

- Might Be Printed (Marked)

- Not Printed

Page 34: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

A B

B A BA

A

FBA

BAFA B

F->(600)->A->(200)->B->(500)->B(PRINTED)

(CENTER)

SOLUTION

B->(400)->A->(100)->A->(200)->B->(500)->BF->(600)->A->(100)->A->(200)->B->(400)->AF->(600)->A->(100)->A->(200)->B->(500)->BB->(500)->B->(400)->A->(100)->A->(200)->BF->(600)->A->(100)->A->(200)->BF->(600)->A->(200)->B->(400)->AF->(600)->A->(200)->B->(500)->B

QueryTarget: ALength: 2

Print Path For ‘A’ And Length 2Print Path For ‘A’ And Length 2

- Printed (Unmarked)

- Might Be Printed (Marked)

- Not Printed

Page 35: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

IMPORTANT OBSERVATIONS

Paths And Unigram Cut-Off-------------------------------------Unigram Cut-Off is the heart. Unigram Cut-Off helps us to reduce the size of the network the much we want and that helps us in running our systems even with less amount of memory.

Paths And Associative Cut-Off----------------------------------------If Unigram Cut-Off is the heart then Associative Cut-Off is the soul. It helps us in running the system with high lengths even with lack of disk space. No only this it also helps us in reducing the size of the collected network.

Page 36: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Ques: Which system is suitable for printing?Ans: We can use IBM BladeCenter or SGI Altix 3700 BX2.

IBM BladeCenter-----------------------It can be used only with high unigram cut-off’’s and very small associative cut-off’’s due to lack of memory and disk space respectively.

SGI Altix 3700 BX2--------------------------It can be used even with no unigram cut-off’’s and very small associative cut-off’’s due to availability of loads of memory but limited disk space respectively.

IMPORTANT OBSERVATIONS CONTD.

Page 37: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

IMPORTANT OBSERVATIONS CONTD.

TARGET LIST--------------------It is hard to pick the best target list for Google n-gram data reason being there is one really big network with 13163490 nodes and other’s with very small number of nodes 1, 2 or more nodes. So if we pick a word from the big network we might end up with a network residing on the master processor. This makes IBM BladeCenter unsuitable for printing because of memory limitation and forces us to choose ALTIX.On ALTIX with enough memory you can probably print paths for any token present in unigram array and any length.

Note: We might still be short of disk space until we specify high unigram cut-off and very very low (e.g. 0.000000000000001) as associative cut-off.

NO PATHS PRINTED FOR LENGTH GREATER THAN HALF THE NUMBER OF EDGES IN THE DISJOINT NETWORK----------------------------------------------------------------------------------------------

Page 38: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Cut-Off frequencies And Cut-Off frequencies And Association Cut.Association Cut.

What is their purpose ?What is their purpose ?

1. To reduce the size of the network built1. To reduce the size of the network built

in system memory.in system memory.

2. Tools to manipulate the structure of 2. Tools to manipulate the structure of graph.graph.

Page 39: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Cut – Off frequency.Cut – Off frequency.

Design Options.Design Options.

1. Create the array with all the unigrams 1. Create the array with all the unigrams but the edge information.but the edge information.

2. Create the unigram array with 2. Create the unigram array with unigrams that are above the cut-off unigrams that are above the cut-off frequency.frequency.

Page 40: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Where to plug in ???Where to plug in ???

Create the unigram array to reflect the Create the unigram array to reflect the total number of unigrams.total number of unigrams.

Before adding the unigram into the Before adding the unigram into the array ,check if it satisfies unigram cut-off.array ,check if it satisfies unigram cut-off.

Before adding the edge information (from Before adding the edge information (from bigram), check for the presence of bigram), check for the presence of unigram using the binary search.unigram using the binary search.

Page 41: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Association Cut.Association Cut.

Determines which bigram pair to be Determines which bigram pair to be included in the path based on associative included in the path based on associative score.score.

Unigram cut – taken care during network Unigram cut – taken care during network build.build.

Association cut – taken care during path Association cut – taken care during path finding network.finding network.

Page 42: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Where to plug in ???Where to plug in ???

Role of a regulator to path tracking.Role of a regulator to path tracking.

Prunes the whole sub graph if one of the Prunes the whole sub graph if one of the branches does not satisfy association cut.branches does not satisfy association cut.

Page 43: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Network Analysis(alternate Network Analysis(alternate approach)approach)

Approach that finds the network Approach that finds the network completely before finding the next based completely before finding the next based on message passing.on message passing.

Useful in knowing the statistics of a Useful in knowing the statistics of a network to which a word belongs to rather network to which a word belongs to rather than building the whole network.than building the whole network.

Page 44: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

How does it work?How does it work?

Processor 0 -- master.Processor 0 -- master.

Rest of Processors – Slaves.Rest of Processors – Slaves.

For every new node tracked in master, it is For every new node tracked in master, it is broadcasted.broadcasted.

Slaves receive the nodes and perform Slaves receive the nodes and perform localized search.localized search.

Broadcast from slaves to account for Broadcast from slaves to account for disjoint network spread over processors.disjoint network spread over processors.

Page 45: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

How does it work ??How does it work ??

Master check if its local list is updated.Master check if its local list is updated.

If yes, continue the iteration beginning If yes, continue the iteration beginning from first step again.from first step again.

If no, the whole of network is found.If no, the whole of network is found.

Page 46: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Future WorkFuture Work

Combine the network information in a Combine the network information in a better way while building the network.better way while building the network.

Faster algorithm to find disjoint networks.Faster algorithm to find disjoint networks.

Page 47: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Question & AnswerQuestion & Answer

Page 48: Google N-Gram Patterns CS 8621 Fall 2007 By Team Flamengo: Darshan Paranjape Bin Lan Anurag Jain Vishnu Pedireddy.

Thank You !Thank You !

Enjoy your winter break!Enjoy your winter break!