Managing Shared Content Part 1 - Susan J. Sullivan, CRM; NARA Part 2 - Tim Shinkle, Gimmal 1.

34
Managing Shared Content Part 1 - Susan J. Sullivan, CRM; NARA Part 2 - Tim Shinkle, Gimmal 1

Transcript of Managing Shared Content Part 1 - Susan J. Sullivan, CRM; NARA Part 2 - Tim Shinkle, Gimmal 1.

1

Managing Shared Content

Part 1 - Susan J. Sullivan, CRM; NARAPart 2 - Tim Shinkle, Gimmal

2

Target Outcomes

Knowledge of:• How NARA combined people and technology to meet

the shared drive cleanup challenge, and • How cleaning up shared drives can lead to efficient

management of all electronic content. • How people interact with Indexing and

Categorization (I&C) Management Tools to manage content

• Start thinking how you can do this in your organization

3

First meeting with the CIO as Records Officer

Clean up those shared drives!

ARMA is in Florida this

year.

*This is a dramatization

4

A bit of luck at ARMA

ARMA 2009

- How to Clean Shared Drives

May I have your card?

*This is a dramatization

5

Services Contract• Award winning RM experts• Approach rooted in records management• Appliance with crawlers• Data gathering period, pre-crawls• Pilot project, draft rules – cleanup pilot• Information Management Plan• Communications Plan (posters wiki,

meetings)• Met and worked with 36 SMEs, crawled

and reported, validated, cleaned content

6

Overall Process

Business Assessment Content Assessment

Information Management Plan

Human Validation

Cleanup

Policies

Litigation Preservation

Organization Migration

7

People + Tools = CleanupPeople Technology

Where is the content? Pre-crawl for overview

What to protect / preserve? Find /lock preservation content

Develop and validate auto-cleanup rules (.tmp, backup, ~)

Find / report on auto-cleanup content

Approve auto-clean rules Auto-clean (2% moved)

Develop rules for content to be reviewed before cleaning or moving (large, old, marked “draft, old, trash”, music, video, audio)

Find / report on reviewable content

Review and identify for cleanup Cleanup or move individual or groups of files

Develop data integrity policies, develop and communicate cleanup maintenance rules and policies

Cleanup / identify continually

8

What is eTrash

• Non-business value, non-records– Temporary files– Obsolete applications– Files with zero content– Un-openable/un-readable content

• Not so important content that is not FOIA, Litigation related, retention scheduled and is:– Duplicates– Large– Voluminous– Obsolete

Content identified for auto-clean

• System generated temporary files • System generated backup files• Zero Content files and folders• Abandoned applications• Obsolete install files• System status reporting files

9

10

Content reviewed for cleanup• Old files• Large files• Large duplicate

archive• Identified old files• Compressed file

duplicates• Renditions

convenience copies

• Non accessed drafts older than 1 year

• Human identified backup files

• Human-identified as delete-able content

• Non-business images music or audio video or media

• Office Document Duplicates

11

Content Identified for Risk

• PII and eDiscovery– Litigation hold files– Credit card numbers– Social security numbers

A new way of viewing content

12

13

Outcomes• Shares cleaned or under review • Bulletin issued - NARA Bulletin 2012-02, December 06, 2011;

Guidance on Managing Content on Shared Drives [http://www.archives.gov/records-mgmt/bulletins/2012/2012-02.html]

• Inputs to Presidential Memorandum -- Managing Government Records, November 28, 2011, [http://www.whitehouse.gov/the-press-office/2011/11/28/presidential-memorandum-managing-government-records

]• Policies identified

– cleaning shares, – managing content on shared drives

• Most importantly…..clear path forward

14

Data Management Policies

• Reliability and content – Preserving file attributes• Context – Preserve file path, include recordkeeping

metadata within user profiles• Usability

– Align storage with usage, preservation and access needs. Widely accessible storage location for high value content

– Use robust search capability eliminate complex hierarchical folder structures.

• Authenticity - only authorized users can add, change and delete content for a specified data set.

15

Benefits

• Storage space – when cleaned all identified, can provide upwards of 47% in reclaimed storage.

• Money - To maintain 27 TB clean per year, save $187K per year (shares and email).

• Time – Staff time culling through volumes of content to find or migrate their information (FOIA – E-discovery)

• Access efficiencies – Right information, right place right time

16

Enterprise Value of Cleaning

• If users are to migrate their own content to a target system (e.g., new share or ECRM), they decide:1. Should it migrate (30 seconds)2. How should it be tagged (30 seconds)

• If 55% of 10TB should not be migrated:– There are 16.8M documents (at 360KB/DOC)– 140K hours of wasted decision making– $10.4M of staff time (at $75 per hour)

17

Next Opportunity: Align management with value

Permanent

Mid to long term retention

Short term retention

E-trash

18

Beyond cleanup – Auto-class

• Keyword/Boolean - group• Concept Search

– Ontology (file plan) based– Analyze word/phrase

relationships (Bayesian)• Clustering (Bayesian)

– Auto-classifier– Near duplicates– Predictive coding

• Learning by example

19

People will still have a role

• Connect with business owners• Develop / enforce file integrity & metadata policies • Locate and categorize content

– Develop rules Boolean / Bayesian / Train tools by example– Run & review crawls and refine rules

• Determine and assign value (metadata)– Search legal terms, financial terms, health terms– How often accessed, how often duplicated?

• Develop / enforce storage / management policies according to value

• Manage by exception

20

Vision of future

Workers store

content

Crawler indexes content nightly

Maintain current item-level,

categorized index

Tweek indexcrawler, learns

I&C Tools

21

Retention Bell CurveGranular retention schedules

Big bucketretentionschedules

Granular retention schedules

Paper-basedfile

plans

Large categories by business function /

creator role

Indexed and Categorized at

item level

I&C Tools

22

Vision Evolution

• Get clean and lean– Mature cleanup program– Cleanup and network policies enforced– Comfortable with crawlers and rules

• Develop rules to index at various value levels– Transitory, 3 year, 7 year, 10 year, etc.– Permanent

• Group, cluster, train, metadata extraction, migrate, and tag with retention

• Go granular as needed• Crawl nightly, manage by exception

23

Automated Policies• Run continuously

– Cleanup policies for files that are no longer needed– Security risk policies for files meeting PII criteria– Legal policies for files meeting legal hold criteria– IT drive management policies for databases, multimedia, images,

applications, web content and other categories– Lifecycle management policies for

• inactive files• draft files• active files• records• reference material

– Content integrity policies for abandoned files– Migration policies for files that require migration to target systems

24

I&C Tools

• Indexes & categorizes content according to rules

• Clusters content around trends (retention)• Ingests content samples and learns.• More crawling = more learning

– Manage by exception = more learning• Act upon content (move, lock)

Part 2

Tim Shinkle

Professional Services Approach

• Phase 0 – Source repository assessment• Phase I – Program setup• Phase II – Pilot• Phase III – Enterprise Transformation

Phase 0 – Source Repository Assessment

• Perform preliminary analysis, e.g., Shares, SMEs, business processes

• Scan a percentage of enterprise shares that represent a good cross section of the data

• Produce an assessment report and plan moving forward– Percentage of candidate eTrash, migrate, relocate, review leave– PII analysis– Metadata extraction– Classification and categorization– Strategy and approach– Expected cost savings and benefit analysis

Phase I – Program setup

• Meet with sponsors• Identify scope (shares, groups, SMEs)• Develop communication plan• Stand up program communications portal (e.g.,

corporate portal, wiki)• Identify pilot group(s) (shares and SMEs)• Meet with RM and Legal to identify candidate

records, holds and PII• Approve policies, cleanup and classification rules

Phase II – Pilot

• Interview pilot group• Perform analysis and refine rules• Crawl and report on pilot group shares • Review results with SMEs and move approved files to

cleanup location• Refine global cleanup and classification rules and

update published policies• Develop benefit analysis report and enterprise

strategy• Publish updates to program portal (wiki)

Phase III – Enterprise Transformation

• Outline schedule, project plan and group priority• Update communications on portal (e.g., wiki)• Repeat for each group

– Meet, refine share mapping to SMEs and perform high level analysis, size and scope

– Crawl shares, generate reports and publish for review– Meet with SMEs, review results and start the clock– Move approved files to temporary storage location– Perform any additional migration and transformation based on

classification, categorization and metadata extraction of results– Repeat until completed

• Publish final reports, develop and implement strategy for on going policy information governance automation

Technology Demonstration

Example Data

Classification Best Practices

• Create categories that can be leveraged by machines (e.g., Put like content in like containers, e.g., expense reports)

• People can provide context that is not found in the content (e.g., the folder name is a project number where anything to do with the project is dumped into the folder, such as proposal, meeting notes, deliverables, reference material, directions, local hotels, expenses, etc...)

• Folders can only sometimes tell you what can be found in the folder, other times it cannot (e.g., \Invoices vs. \Misc)

Classification Best Practices

• File names are mixed and matched (e.g., \Proposals\Proposal XYZ.doc, \Meeting notes\Proposal XYZ.doc)

• Don’t confuse how individuals want to organize their content with the classification of the content

• The more the information is shared, the more it needs to be generally classified (big buckets)

• Combine algorithm extracted topics and identified classifications with subject matter expert search rules

Q&A

• Susan Sullivan, CRMDirector - Corporate Records Management National Archives and Records [email protected]: (301) 837-2088

• Tim ShinkleDirector, Gimmal [email protected](703) 927-5650