Document Data Mining Design Review

33
Document Data Mining Design Review November 18, 2010 1 Team Members: Dallas Stinger, Wenlong Huang, Aaron Phillips Advisor: Gregory Donohoe, Ph.D.

description

Document Data Mining Design Review. November 18, 2010. Team Members: Dallas Stinger, Wenlong Huang, Aaron Phillips Advisor: Gregory Donohoe, Ph.D. The Problem. State Board collects meeting minutes and other documents recording decisions made - PowerPoint PPT Presentation

Transcript of Document Data Mining Design Review

Page 1: Document Data Mining Design Review

Document Data MiningDesign Review

November 18, 2010

1

Team Members: Dallas Stinger, Wenlong Huang, Aaron PhillipsAdvisor: Gregory Donohoe, Ph.D.

Page 2: Document Data Mining Design Review

The Problem

• State Board collects meeting minutes and other documents recording decisions made

• Board members want to retrieve text from old documents that relate to current issues– May not recall when issue was discussed– May not know exact keywords to search for

2

Page 3: Document Data Mining Design Review

The Existing Solution

• Currently, all files exist on a large, unorganized shared network drive.

• Finding information recorded in documents requires knowing when it was recorded, and in which document.

3

Page 4: Document Data Mining Design Review

Requirements / Design Decisions

4

Page 5: Document Data Mining Design Review

Multiple File types• System limited to more major file types

– Word documents (.doc, .docx)– PDF files (.pdf)– Excel (.xls, .xlsx)– Text (.txt)

• Lacking – WordPerfect (.wpd)– PDF files that were scanned in– Open Office document types

5

Page 6: Document Data Mining Design Review

Multi-User Access

Web Based• Pros:

– Information searchable anywhere

– Only one index required– Index on regular basis

without interrupt

• Cons:– File permissions

Individual User Application• Pros:

– Can be programmed to learn user behavior

– Apply more emphasis to files he/she used before

(Looks at search history to aid in new searches)

• Cons:– Software package installed

on each users machine

6

Page 7: Document Data Mining Design Review

Search Collection of Documents Efficiently

• Real Time Searching– Pros:

• Easy• No initial overhead

– Cons:• Time consuming(> 100,000 words)• Unable to find non-

exact search results

• Reverse Indexing– Pros:

• Fast and efficient• Able to find useful

information without exact search text known

– Cons:• Large initial overhead(pre-analyze all documents)• Keep index file up to date• Storage space necessary

Results displayed in less than a second

7

Page 8: Document Data Mining Design Review

8

Page 9: Document Data Mining Design Review

Find Useful Information Without Exact String Specification (A: Stemming)

• Create our own– Pros:

• Pay attention to details that may be lacking in existing algorithms

(aglet vs. readable)• More efficient• Define special cases

– Cons:• Requires a lot of time

• Use existing algorithm– Pros:

• Readily available• Spend more time on

other important details

– Cons:• Special cases incorrect• Some root words are

truncated

9

Page 10: Document Data Mining Design Review

Porter Stemming Algorithm

• Large set of steps based on English Natural Language to determine root of word

• Extensively used in programs

• Outdated: Results not always correct

10

Page 11: Document Data Mining Design Review

Find Useful Information Without Exact String Specification (B: Thesaurus)

• Own Model– Pros:

• Fine tune thesaurus to have only relevant terms (terms that exist inside our index file)

– Cons:• Very time consuming

and complex

• Using pre-built Thesaurus– Pros:

• Quick and easy to use• Very extensive

– Cons:• Has irrelevant search

term results• Unnecessary terms for

State Board

11

Page 12: Document Data Mining Design Review

Searching

• User types in a search criteria– Determine whether they want Narrow Search results

or Broad Search Results• May retrieve too many results in Broad Search

• Search algorithm converts each typed word into a list of possible stems and synonyms

• Tries all possible permutations of words, trying to find the closest match to the search

• Calculate standard deviation of the distance between all of the words

12

Page 13: Document Data Mining Design Review

Searching (cont.)

• Each file is ranked based on the number of matches it contains– Exact matches rank highest– Reordering of exact match is ranked next– Stems, synonyms, partial matches, and large

spacing between searched words rank lowest

• All rank values found inside a file are summed• Highest ranked files considered most relevant

13

Page 14: Document Data Mining Design Review

14

Page 15: Document Data Mining Design Review

15

Page 16: Document Data Mining Design Review

16

Page 17: Document Data Mining Design Review

17

Page 18: Document Data Mining Design Review

18

Page 19: Document Data Mining Design Review

19

Page 20: Document Data Mining Design Review

20

Page 21: Document Data Mining Design Review

Unit Testing

21

Page 22: Document Data Mining Design Review

Unit Testing

• BenefitsGoalFacilitates change

• LimitationsNot omnipotentLow cost performance

22

Page 23: Document Data Mining Design Review

DocumentTest:

/// Returns the document location

public void getFileLocationTest() { convertPDF converpdf = new convertPDF("D:\\Class\\test.pdf"); string actual; actual = converpdf.getFileLocation(); string expected; expected = "D:\\Class\\test.pdf"; Assert.AreEqual(actual, expected); }

Unit Testing

23

Page 24: Document Data Mining Design Review

/// creates word count in alphabetical order for all words located inside PDF

public void createDictionaryTest() { convertPDF converpdf = new convertPDF("D:\\Class\\test.pdf"); string toDictionary = "this is test code code code"; converpdf.createDictionary(toDictionary); int actual; converpdf.WordCounts.TryGetValue(“code", out actual); Assert.AreEqual(3, actual); }

Unit Testing

24

Page 25: Document Data Mining Design Review

End of Semester Status

• Goals:– Working, tested prototype– Documentation for future teams

• Plenty of areas open for extension or improvement

25

Page 26: Document Data Mining Design Review

Future Possibilities: File Types

• Currently supported file types– Microsoft Word– Microsoft Excel– PDF

• No optical character recognition

• Our system will allow for easy extension

26

Page 27: Document Data Mining Design Review

27

Page 28: Document Data Mining Design Review

Future Possibilities: Indexing

• We have a relatively simple indexing scheme• More complex indexing would lead to

decreased search time• Our indexing scheme is very general

– Could be specific to the State Board– Could lead to more relevant results

28

Page 29: Document Data Mining Design Review

Future Possibilities: Searching

• Search time increases quickly as search terms are added

• Thesaurus is broad– Large number of synonyms can slow search– Could be trimmed to fit domain

• Porter stemming algorithm could be replaced

29

Page 30: Document Data Mining Design Review

Future Possibilities: Correlation

• Related documents should be correlated– By date?– Using a tagging system?

30

Page 31: Document Data Mining Design Review

Future Possibilities: Decision Database

• A client need that is not addressed by our software

• Many board decisions have been passed, with varying lifetimes

• A database could track all board decisions and lifespan

• Possible connection to our search engine?

31

Page 32: Document Data Mining Design Review

Future Possibilities: Web-Based Interface

• Software will be installed on each user’s computer

• GUI could be web based, with access restricted to State Board employees

• Users could search from home or while on the road, not just in the office

• Indexing would be simplified

32

Page 33: Document Data Mining Design Review

Questions?

33