Finding Similar Files in Large Document Repositories
-
Upload
feiwin -
Category
Technology
-
view
1.787 -
download
1
Transcript of Finding Similar Files in Large Document Repositories
Finding Similar Files Finding Similar Files in Large Document in Large Document RepositoriesRepositories
KDD’05, August 21-24, 2005, Chicago, Illinois, USA.Copyright 2005 ACM
George Forman HewlettPackard Labs
Kave EshghiHewlettPackard [email protected]
Stephane ChiocchettiHewlettPackard France
AgendaAgenda• Introduction
• Method
• Results
• Related work
• Conclusions
Presented byJoyce Chen
IntroductionIntroduction
Millions of technical support documents.Covering many different products, solutions, and phases of support.
The content in new document may be duplicate.Author prefer to copy rather than link to content by reference.To avoid the possibility of dead linkBy mistake or limited authorization, the version is not update.
SolutionChunking technology to break up the document into paragraph-like pieces.Detecting collisions among the has signatures of these chunks.Efficiently determine which files are related in a large repository.
Presented byJoyce Chen
MethodMethod
Step1: Using a ‘content-based chunking algorithm’ to break up each file into a sequence of chunks
Step2:Compute the hash of each chunk.
Step3:To find those files that share chunk hashes
Reporting only those whose intersection is above some threshold.
Presented byJoyce Chen
Hashing backgroundHashing background
Use the ‘compare by hash’ method to compare chunks occurring in different files.
If fixed size sequences, it is almost impossible to find two chunks that have the same hash.
Use the MD5 algorithm which generates 128-bit hashes.
Two advantage (Rather than compare the chunk itself)
Comparison time is shorter.
Being short an fixed size, lend themselves to efficient data structures for lookup and comparison.
Presented byJoyce Chen
ChunkingChunking
Breaking a file into a sequence of chunks.
Chunk boundaries are determined by the local contents of the file.
Basic Sliding Window AlgorithmA pair of pre-determined integers D and r, r<D.
A fixed width sliding window of width W.
Fk the fingerprint on position k.
k is a chunk boundary if Fk mod D = r.
Presented byJoyce Chen
Chunking and file similarityChunking and file similarity
The problems in content-based chunking algorithm.
When two sequences R and R’ share a contiguous sub-sequence larger than the average chunk size .There would be good probability at least one shared chunk falling within the shared sequences.
Use TTTD to avoid above problemTwo Thresholds, Two Divisors algorithm.Four parameters:
D the main divisor
D’ the backup divisor
Tmin the minimum chunk size threshold
Tmax the maximum chunk size threshold
Presented byJoyce Chen
Presented byJoyce Chen
Presented byJoyce Chen
File similarity algorithmFile similarity algorithm
Step1Break each file’s content into chunk
For each chunk, record its byte length and its hash code.
The bit-length of the hash code be sufficiently long.
To avoid having many accidental hash collisions among truly different chunk.
Presented byJoyce Chen
File similarity algorithm (cont.)File similarity algorithm (cont.)
Step2Optional step for scalability
Prune and partition the above metadata into independent sub-problems
Each small enough to fit in memory
Step3Constructing a bipartite gragh
With an edge between a file vertex and a chunk vertex
iff the chunk occurs in the file
File notes are annotated with their file length
The chunk nodes are annotated with their chunk length
Presented byJoyce Chen
File similarity algorithm (cont.)File similarity algorithm (cont.)
Step4Construct a separate file-file similarity graph.
For each file A:
(a) Look up the chunks AC that occur in file A.
(b) For each chunk in AC, look up the files it appears in,
accumulating the set of other files BS that share any chunks
with file A. (As an optimization due to symmetry, we exclude
files that have previously been considered as file A in step 4.)
(c) For each file B in set BS, determine its chunks in common
with file A,2 and add A-B to the file similarity graph if the total
chunk bytes in common exceeds some threshold, or percentage
of file length.
Presented byJoyce Chen
File similarity algorithm (cont.)File similarity algorithm (cont.)
Step5Output the file-file similarity pairs as desired.
Use union-find algorithm to determine clusters of interconnected files.
Presented byJoyce Chen
Handling identical filesHandling identical files
Having a multiple files with identical content.
Using the same metadate with a small enhancement.
While loading the file-chunk dataCompute a hash over all the chunk hashes
Maintain a hash table that reference file nodes according to their unique content hashes.
If a file has already been loaded
We note the duplicate file name and avoid duplicating the chunk data in memory
Presented byJoyce Chen
Handling identical files (cont.)Handling identical files (cont.)
Presented byJoyce Chen
Complexity analysisComplexity analysis
The chunking of the files is linear in the total size N of the content.
O(C log C) where C is the number of chunks in the repository. (Including duplicates.)
Since C is linear in N -> O(N log N).
Presented byJoyce Chen
ResultsResults
Implement chunking algorithm in C++ (~1200 lines of code)
Used Perl to implement the similarity analysis algorithm (~500 LOC)
The bipartite partitioning algorithm (~250 LOC)
Shared union-find module (~300 LOC)
The performance on a given repository ranges widely depending on the average chunk size (a controllable size)
52, 125 technical support documents in 347 folders
Comprising 327 MB of HTML content.
3GHz Intel processor and IGB RAMChunk size set to 5000 bytes -> took 25 minutes and generated 88,510 chunks
Chunk size set to 100 byte -> took 39 minutes and generate 3.8 million
chunks.
Presented byJoyce Chen
Related workRelated work
Brin et al. “Copy detection machanisims for digital documents”
Have a large indexed database of existing documents
Detect new document contains material that already exists in the database.
it is a 1-vs-N document method.
This paper is all-to-all.
Chunk boundaries are based on the hash of ‘text units’
Paragraphs
Sentences
Can not handle the technical doumentations
This paper use TTTD chunking algorithm.
Presented byJoyce Chen
ConclusionsConclusions
To identify pieces that may have been duplicated.
Relies on chunking technology rather than paragraph boundary detection.
The bottleneck is in human attention to consider in many results.
Future workReducing false alarms and missed detections
Making the human review process as productive as possible.