Document Data Mining Design Review

Document Data MiningDesign Review

November 18, 2010

1

Team Members: Dallas Stinger, Wenlong Huang, Aaron PhillipsAdvisor: Gregory Donohoe, Ph.D.

The Problem

• State Board collects meeting minutes and other documents recording decisions made

• Board members want to retrieve text from old documents that relate to current issues– May not recall when issue was discussed– May not know exact keywords to search for

2

The Existing Solution

• Currently, all files exist on a large, unorganized shared network drive.

• Finding information recorded in documents requires knowing when it was recorded, and in which document.

3

Requirements / Design Decisions

4

Multiple File types• System limited to more major file types

– Word documents (.doc, .docx)– PDF files (.pdf)– Excel (.xls, .xlsx)– Text (.txt)

• Lacking – WordPerfect (.wpd)– PDF files that were scanned in– Open Office document types

5

Multi-User Access

Web Based• Pros:

– Information searchable anywhere

– Only one index required– Index on regular basis

without interrupt

• Cons:– File permissions

Individual User Application• Pros:

– Can be programmed to learn user behavior

– Apply more emphasis to files he/she used before

(Looks at search history to aid in new searches)

• Cons:– Software package installed

on each users machine

6

Search Collection of Documents Efficiently

• Real Time Searching– Pros:

• Easy• No initial overhead

– Cons:• Time consuming(> 100,000 words)• Unable to find non-

exact search results

• Reverse Indexing– Pros:

• Fast and efficient• Able to find useful

information without exact search text known

– Cons:• Large initial overhead(pre-analyze all documents)• Keep index file up to date• Storage space necessary

Results displayed in less than a second

7

Find Useful Information Without Exact String Specification (A: Stemming)

• Create our own– Pros:

• Pay attention to details that may be lacking in existing algorithms

(aglet vs. readable)• More efficient• Define special cases

– Cons:• Requires a lot of time

• Use existing algorithm– Pros:

• Readily available• Spend more time on

other important details

– Cons:• Special cases incorrect• Some root words are

truncated

9

Porter Stemming Algorithm

• Large set of steps based on English Natural Language to determine root of word

• Extensively used in programs

• Outdated: Results not always correct

10

Find Useful Information Without Exact String Specification (B: Thesaurus)

• Own Model– Pros:

• Fine tune thesaurus to have only relevant terms (terms that exist inside our index file)

– Cons:• Very time consuming

and complex

• Using pre-built Thesaurus– Pros:

• Quick and easy to use• Very extensive

– Cons:• Has irrelevant search

term results• Unnecessary terms for

State Board

11

Searching

• User types in a search criteria– Determine whether they want Narrow Search results

or Broad Search Results• May retrieve too many results in Broad Search

• Search algorithm converts each typed word into a list of possible stems and synonyms

• Tries all possible permutations of words, trying to find the closest match to the search

• Calculate standard deviation of the distance between all of the words

12

Searching (cont.)

• Each file is ranked based on the number of matches it contains– Exact matches rank highest– Reordering of exact match is ranked next– Stems, synonyms, partial matches, and large

spacing between searched words rank lowest

• All rank values found inside a file are summed• Highest ranked files considered most relevant

13

Unit Testing

21

Unit Testing

• BenefitsGoalFacilitates change

• LimitationsNot omnipotentLow cost performance

22

DocumentTest:

/// Returns the document location

public void getFileLocationTest() { convertPDF converpdf = new convertPDF("D:\\Class\\test.pdf"); string actual; actual = converpdf.getFileLocation(); string expected; expected = "D:\\Class\\test.pdf"; Assert.AreEqual(actual, expected); }

Unit Testing

23

/// creates word count in alphabetical order for all words located inside PDF

public void createDictionaryTest() { convertPDF converpdf = new convertPDF("D:\\Class\\test.pdf"); string toDictionary = "this is test code code code"; converpdf.createDictionary(toDictionary); int actual; converpdf.WordCounts.TryGetValue(“code", out actual); Assert.AreEqual(3, actual); }

Unit Testing

24

End of Semester Status

• Goals:– Working, tested prototype– Documentation for future teams

• Plenty of areas open for extension or improvement

25

Future Possibilities: File Types

• Currently supported file types– Microsoft Word– Microsoft Excel– PDF

• No optical character recognition

• Our system will allow for easy extension

26

Future Possibilities: Indexing

• We have a relatively simple indexing scheme• More complex indexing would lead to

decreased search time• Our indexing scheme is very general

– Could be specific to the State Board– Could lead to more relevant results

28

Future Possibilities: Searching

• Search time increases quickly as search terms are added

• Thesaurus is broad– Large number of synonyms can slow search– Could be trimmed to fit domain

• Porter stemming algorithm could be replaced

29

Future Possibilities: Correlation

• Related documents should be correlated– By date?– Using a tagging system?

30

Future Possibilities: Decision Database

• A client need that is not addressed by our software

• Many board decisions have been passed, with varying lifetimes

• A database could track all board decisions and lifespan

• Possible connection to our search engine?

31

Future Possibilities: Web-Based Interface

• Software will be installed on each user’s computer

• GUI could be web based, with access restricted to State Board employees

• Users could search from home or while on the road, not just in the office

• Indexing would be simplified

32

Questions?

33

Document Data Mining Design Review

Documents

Transcript of Document Data Mining Design Review