Post on 16-Dec-2015
Australian Document Computing Conference Dec 3 2011
Information Retrieval in Large Organisations
Simon Kravis
Information Retrieval in Large Organisations
Simon Kravis
Copyright 2010 Fujitsu Limited
FUJITSU CONFIDENTIAL
Large Organisations
Can’t rely on personal contacts to obtain information Have difficulty in storing and retrieving information Often use multiple systems for storing information
Paper Files Shared Filesystems Document Management Systems
• Intranets (SharePoint)• Specialised Systems (eg TRIM, Documentum, Alfresco)
Are only interested in Internet style search to meet legal challenges
2 Copyright 2010 Fujitsu Limited
FUJITSU CONFIDENTIAL
Paper files
Well understood Easy to manage Can be stored over hundreds of years Expensive to store and search Most documents now ‘born digital’
3 Copyright 2011 Fujitsu Limited
FUJITSU CONFIDENTIAL
Electronic Documents
Cheap to create, exchange and store in the short term Price of powerful applications is poor management
4 Copyright 2011 Fujitsu Limited
FUJITSU CONFIDENTIAL
Filesystems
• Files are building blocks of– Operating Systems– Applications
• Desktop applications commonly store electronic documents as files
• Hardware costs of storage have become very low• Difficult to model statistically
– many attributes follow power laws (files/folder, file size, subfolders, file types)
5 Copyright 2011 Fujitsu Limited
FUJITSU CONFIDENTIAL
Why shared filesystems?
Cheap & simple Access to documents from different computers Support collaborative work
6 Copyright 2011 Fujitsu Limited
FUJITSU CONFIDENTIAL
Shared Filesystem Organisation
Multiple volumes, often based on organisational structure Tree structure of folders and files User and Group areas Permissions based on user ID and group membership Higher levels of folder trees usually controlled by
administrators
7 Copyright 2011 Fujitsu Limited
FUJITSU CONFIDENTIAL
Are shared filesystems unstructured?
Folder tree represents a high degree of structure created by users
Local but not global consistency Users structure folder trees to facilitate their own work Structures are usually highly efficient information stores
Small survey of users in an IT service company in 2005 showed that only 1 user out of 12 had spent more than 15 mins/day looking for files on share drives over past week
8 Copyright 2011 Fujitsu Limited
FUJITSU CONFIDENTIAL
Filesystem volume growth & effect of quotas
9 Copyright 2010 Fujitsu Limited
01-Mar-
05
01-May
-05
01-Jul-0
5
01-Sep-05
01-Nov-0
5
01-Jan-06
01-Mar-
06
01-May
-06
01-Jul-0
6
01-Sep-06
01-Nov-0
6
01-Jan-07
01-Mar-
070
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
Finance sector file server growth over 2 years
Vol (
GByt
es)
3000 users, 90 volumesBasically linear with small accelerationLinear component= 190 Gbytes/Month600 Mbytes/month/userGrowth acceleration =7 Gbytes/month2
11/2003
03/2004
07/2004
11/2004
03/2005
07/2005
11/2005
03/2006
07/2006
11/2006
03/2007
07/2007
11/2007
0
2000
4000
6000
8000
10000
12000
14000
16000
Transport organisation file server growth over 4 years
After QuotasBefore Quotas
Usar
and
Gro
up V
ol (M
Byte
s)
22,000 users, 328 user and group volumesQuadratic fit to cleaned data before quotasLinear component= 160 GBytes/month7 Mbytes/month/userGrowth acceleration =0.07 GBytes/ month2
FUJITSU CONFIDENTIAL
Volume and count profiles (Financial Services)
10 Copyright 2010 Fujitsu Limited
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
< 2% xls mdb prj doc TXT zip pst nsf pro csv DBF pdf
Volume Profile for 11 TBytes of Data
Vol
Count
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
< 2% doc xls (blank) txt lnk htm pdf prj gif CSV jpg A png
Count Profile for 21 Million Files
Vol
Count
FUJITSU CONFIDENTIAL
File Size and Count Profile
11 Copyright 2010 Fujitsu Limited
Size range covers 5 orders of magnitude
50% of volume used by 3% of files
0
25
50
75
100
C um % S ize H istogramA ll f iles
%
S ize H istogram
46
.4 - 1
00
.0 K
By
tes
10
0.0
- 21
5.4
KB
yte
s
21
5.4
- 46
4.2
KB
yte
s
46
4.2
- 10
00
.0 K
By
tes
1.0
- 2.1
MB
yte
s
2.1
- 4.5
MB
yte
s
4.5
- 9.8
MB
yte
s
9.8
- 21
.0 M
By
tes
21
.0 - 4
5.3
MB
yte
s
45
.3 - 9
7.7
MB
yte
s
97
.7 - 2
10
.4 M
By
tes
21
0.4
- 45
3.3
MB
yte
s
45
3.3
- 97
6.6
MB
yte
s
1.0
- 2.1
GB
yte
s
2.1
- 4.4
GB
yte
s
Count Vol
FUJITSU CONFIDENTIAL
Why filesystems are like poorly sorted soil
12 Copyright 2010 Fujitsu Limited
Most of volume taken up by large particles
FUJITSU CONFIDENTIAL
Duplication by count and volume
13 Copyright 2010 Fujitsu Limited
Volume and count spectra usually different – vol savings seldom > 20% from de-duplication
FUJITSU CONFIDENTIAL
File Use Profiles – 6500 accesses to 3.5 million files over 21 days by 145 users
14 Copyright 2010 Fujitsu Limited
• 2 accesses per user per day
• About 3 read accesses for every modification
• Files on share drives not frequently shared between users
• Files accessed many times by many users are applications1 2 3 4 5 6 7 8 9
1
10
100
1000
10000
17
13
19
25
31
Users
Files
Accesses
1
2
3
4
5
6
7
8
9
FUJITSU CONFIDENTIAL
Text Documents in Large Organisations
Mainly created by desktop applications (Office) Usually comprise 15-20% of file count, 10-15% of volume Collections used by different parts of the organisation Small collections often very intensively used
Collateral for service companies
15 Copyright 2011 Fujitsu Limited
FUJITSU CONFIDENTIAL
Duplication in 12,00 text documents from software development project
16 Copyright 2010 Fujitsu Limited
Exact Near (Document Vector Comparison)
Similar cluster spectra for 40,000 text documents from Govt. Department
FUJITSU CONFIDENTIAL
Evaluating Measures of Near-Duplication
17 Copyright 2010 Fujitsu Limited
Very large parameter space to test Document vector generation, matching algorithm,
matching level False positives detected by sampling cluster Very difficult to detect false negative clustering
Do documents with similar names have similar content? Trigram matching – very compute-intensive
Most clusters are versions of documents
FUJITSU CONFIDENTIAL
Example of correct clustering
18 Copyright 2010 Fujitsu Limited
10 versions of the same file, all in same folder
FUJITSU CONFIDENTIAL
Example of incorrect clustering
19 Copyright 2010 Fujitsu Limited
RfA Diagram2.rtfUI navigation diagrams 010210.RTF
Same 3 words – different pictures
FUJITSU CONFIDENTIAL
Information Retrieval by Search for Internal Collections
Few or no hyperlinks Composite documents are common Documents frequently have implicit content High level of near duplication Search terms are often commonly occurring words or phrases -> Poor search results when compared to Internet search Users prefer to ask people or browse
20 Copyright 2011 Fujitsu Limited
FUJITSU CONFIDENTIAL
Is tagging the answer?
Sparse access means that common tags don’t emerge
21 Copyright 2011 Fujitsu Limited
1 2 3 4 5 6 7 8 9
1
10
100
1000
10000
17
13
19
25
31
Users
Files
Accesses
1
2
3
4
5
6
7
8
9
FUJITSU CONFIDENTIAL
What might help?
Automated tagging Training sets Synonym groups Learning required to adapt to rapidly changing vocabulary
Extraction of document headings & captions “Find a good paragraph on reporting capability”
Clustering of similar documents “Find the most recent version of this document” is a very common
requirement
Using a document management system with version control Presence of a capability doesn’t mean it will be used Cluster spectra of documents in DMS very similar to filesystem for
software development docs
22 Copyright 2011 Fujitsu Limited
FUJITSU CONFIDENTIAL 23 Copyright 2010 FUJITSU LIMITED