Welsh Government Workshop
-
Upload
abacadigitalsensitivityreview -
Category
Technology
-
view
46 -
download
1
description
Transcript of Welsh Government Workshop
![Page 1: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/1.jpg)
Abaca:Technically Assisted Sensitivity
Review of Digital Records
0
![Page 2: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/2.jpg)
Agenda
● Transferring of Records to Archives● The Digital Problem● The Abaca Project● Abaca Classifier Experiment● The Test Collection● The Abaca Project - Where Next?● Break-Out Group Session● Groups Discussion
1
![Page 3: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/3.jpg)
Transferring of Records to Archives● Department selects and appraises records
for permanent preservation– In paper, about 5% of output selected - digital
may rise to 20%● Prior to transfer, department must
complete sensitivity review– Paper review is well understood– Digital presents many new challenges and is
not so well understood● Hence our research !
2
![Page 4: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/4.jpg)
The Digital Problem● The file has gone● Volume will increase
– The way business is done has changed– Largely unstructured despite EDRMs
● Big transfers of departmental records● Appraisal
– Separate issue not addressed today● Precautionary closure – Need to research a solution
● Not unique to public records3
![Page 5: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/5.jpg)
Our Approach● Provide a Framework of Utilities ...
– to assist the Review Process● Need Methods ...
– that respect the reality of Digital Records in all their “Glory”
– that can be tailored to specific circumstances ● Need tools ...
– to help reviewers be more productive
4
![Page 6: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/6.jpg)
The Abaca Project
● Research to show that utilities will help● Two Phases
– Proof of Concept (In Progress)– Full Project (Seeking external funding)
● Today we are describing our proof-of-concept work
● Abaca:Technically Assisted Sensitivity Review of Digital Records
6
![Page 7: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/7.jpg)
Abaca Classifier Experiment● Overview of the Task & Approach● Predicting Exemptions using a Classifier
– Features– Types of Features
● Example Sensitive Document● Research Question● Overview of Classification● Evaluation Methodology● Results
7
![Page 8: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/8.jpg)
The Task
Produce a classifier that can predict the presence of sensitive material within unstructured text.
Initially focusing on two FOIA sensitivitiesSection 27: International RelationsSection 40: Personal Information
8
![Page 9: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/9.jpg)
Approach
Manually review sensitive data to create a test collection.
Split test collection into training and test sets.
Train a classifier to predict the sensitivities in documents using the set of identified features.
Test the classifier on previously “unseen” documents.
Measure classification success.
9
![Page 10: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/10.jpg)
External Resources
External Resources
Predict Exemptions Using a Classifier
FeatureExtraction
LearnClassifier
Features representedas real numbers.
Documents representedas feature vectors.
FeatureExtraction
RunClassifier
Features representedas real numbers.
Documents representedas feature vectors.
Learned Model
Predictions
Usi
ng
10
![Page 11: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/11.jpg)
FeaturesDocument features, such as the words it contains or the
entities it references, convey information about a document.
11
![Page 12: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/12.jpg)
FeaturesDocument features, such as the words it contains or the
entities it references, convey information about a document.
A document can be modelled by using a statistical representation of its features.
11
![Page 13: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/13.jpg)
FeaturesDocument features, such as the words it contains or the
entities it references, convey information about a document.
A document can be modelled by using a statistical representation of its features.
We use external knowledge bases, Natural Language Processing and semantic analysis to better understand
the document features.
11
![Page 14: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/14.jpg)
FeaturesDocument features, such as the words it contains or the
entities it references, convey information about a document.
A document can be modelled by using a statistical representation of its features.
We use external knowledge bases, Natural Language Processing and semantic analysis to better understand
the document features.
The classifier recognises patterns in the documents’ feature sets and uses them for prediction.
11
![Page 15: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/15.jpg)
The features we use can be divided into three main categories.Types of Features
Feature Type Examples Comments
StructureLists of Words (tf/idf)
Document LengthNumber of Recipients
Ubiquitous throughout the collection.Can expose patterns in document types.High value information about the nature
of the communication.
ContentSubjectivity
Verbs“D.O.B”Negation
By applying techniques such as Natural Language Processing and dictionary
based term matching, we can identify the tone of the communication.
EntitiesCountries
PeopleOrganisations
Tells us what the document “is about”. Context related to the entity, such as a “high-risk” country or a “significant” person or role can suggest sensitivity
likelihood.
12
![Page 16: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/16.jpg)
Research Question:Can we produce a classifier that can predict the presence
of sensitive material within unstructured text?
13
![Page 17: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/17.jpg)
Research Question:
Measure:
Can we produce a classifier that can predict the presence of sensitive material within unstructured text?
Balanced Accuracy - Arithmetic mean of True Positive and True Negative predictions, with random = 0.5000
13
![Page 18: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/18.jpg)
Research Question:
Measure:
Test Collection:
Can we produce a classifier that can predict the presence of sensitive material within unstructured text?
Balanced Accuracy - Arithmetic mean of True Positive and True Negative predictions, with random = 0.5000
Total Documents 1849
Total Section 27 208
Total Section 40 14213
![Page 19: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/19.jpg)
Overview of Classification
LearnClassifier
on trainingdata
RunClassifieron unseen
data
Learned Model
Predictions
TestCollection
14
![Page 20: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/20.jpg)
Evaluation Methodology
Test CollectionAssessorJudgments
Results Statistical analysis
Classifier Predictions
15
![Page 21: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/21.jpg)
Results
By adding features to a tf/idf text classification baseline, we see noticeable improvement in both Section 27 and
Section 40 predictions.
But there is still much work to be done !
Balanced AccuracyBalanced AccuracyFeatures s27 s40
Text Classification 0.6327 0.6344
+ Source Count 0.6369 0.6303 + Country Count 0.6453 0.6406 + Country Risk Score 0.6417 0.6368 + DOB Score 0.6327 0.6391 + Negation Score 0.6378 0.6382
16
![Page 22: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/22.jpg)
Test Collection - Aims
● To provide sensitivity judgements and training data to develop and measure tools
17
![Page 23: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/23.jpg)
Test Collection - Aims
● To provide sensitivity judgements and training data to develop and measure tools
● To measure and understand assessors’ behavior
17
![Page 24: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/24.jpg)
Test Collection - Measurments
● Time
18
![Page 25: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/25.jpg)
Test Collection - Measurments
● Time
● Agreement of sensitivity – Not previously studied
18
![Page 26: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/26.jpg)
Test Collection - Measurments
● Time
● Agreement of sensitivity – Not previously studied
● Hard Judgements● Identify borderline cases● Sensitivities sub-categories
– Good indicator for features
18
![Page 27: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/27.jpg)
The Abaca Project - Where Next?
● Understanding the real digital environment– Changes in working practice
● Testing our proof-of-concept system against real data
● More, wider and deeper– More exemptions, more data, more features– BIS, HO, MOJ, FCO, ... and more to come!– Funding
19
![Page 28: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/28.jpg)
Questions and Feedback
20
![Page 29: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/29.jpg)
Break-Out Groups
Discuss sensitivity review in the Welsh Government and language context.
Share your understanding anddevelop some ideas.
Aims:
21
![Page 30: Welsh Government Workshop](https://reader034.fdocuments.us/reader034/viewer/2022051818/54c1ac674a7959120f8b457c/html5/thumbnails/30.jpg)
Break-Out GroupsQuestions:
1. What digital records does The Welsh Government create?
2. What sort of sensitivities are expected within these digital records?
3. What aspects of the sensitivity review process could be technically supported by a software tool or system?
4. What document features could be used to identify the expected sensitivities?
22