Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate...
Transcript of Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate...
![Page 1: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/1.jpg)
Industry Mentors
Junghoon Woo, Director Data Scientist,
Data & Analytics (The Lighthouse), KPMG LLP, US
Viral Chawda, Principal, Innovation & Enterprise Solutions (I&ES), Lighthouse and Global lead,
AI & Analytics for Government & Infrastructure, KPMG LLP, US
Algorithmic Comment Processing
Members*
Gayani Perera, Liliana Cruz-Lopez, Minsu Yeom, Pranjal Bajaj
Data Science Institute Mentor
Sining Chen, Lecturer, Columbia University
* In alphabetical order
![Page 2: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/2.jpg)
Automate the Identification and Summarisation of Sections in PDF Documents
OUR GOAL
2
![Page 3: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/3.jpg)
1. Problem Statement2. Module 1: PDF Ingestion3. Module 1: Data Preparation4. Module 1: Modelling5. Module 2: Section Summarization
Roadmap
3
![Page 4: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/4.jpg)
Problem Statement: Background
Client: Regulations.gov
Final RulingPre-rule
4
![Page 5: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/5.jpg)
Problem Statement: Business Impact
12-20 Weeks
30People
Prior to Automation
2People
2Weeks
Post Automation
Source: KPMG 5
![Page 6: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/6.jpg)
Problem Statement: Our Solution
Filename SectionID Summary
6
![Page 7: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/7.jpg)
Module 1
7
![Page 8: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/8.jpg)
PDF Ingestion: PDF to ?
PDFs to TextIssue: White spaces only between the paragraphs
Two other attempts
PDFs to XMLIssue: A section title appears within
a paragraph
PDFs to HTMLs
Information extracted from HTMLs led us to build extra features used in our models.
8
![Page 9: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/9.jpg)
PDF Ingestion: Can you tell which one is an original
PDF?
9
![Page 10: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/10.jpg)
HTML-based features (“raw”) in blue. Engineered features from the raw in red.
Data Preparation: Feature Engineering
10
![Page 11: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/11.jpg)
11
Data Preparation: Feature EngineeringCategory Feature Name Description
Binary Leading_Char_Upper A line start with a uppercase character
Leading_Numeral A line start with Arabic or Roman numeral
Ends_in_Period A line ends with a period
Leading_Number_Period A line starts with any numeral combination followed by period
Leading_Char_Period A line start with any uppercase or lowercase character followed by period
Leading_Roman_Numeral A line start with any Roman numeral
Roman_Period A line start with Roman numeral followed by period
Numerical Num_Word Number of words in the text line
Num_of_Spec_Char Number of special characters in the text line
LS A line space between previous and current lines.
Punctuation_Count Number of punctuations in the text line
Title_Word_Count Number of title word counts in the text line
Upper_Case_Word_Count Number of uppercase word counts in the text line
Ratio_of_Title_Word_To_Total Ratio of the number of title words to all words in the line
Categorical Document File Name
Textural Last_Word Last word of the text line
First_Three_Words First three words of the text line
11
![Page 12: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/12.jpg)
12
Data Preparation: HTML to Data frame
PDF Features Data frame
12
![Page 13: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/13.jpg)
Data Preparation: Getting Modelling-ready
● Treating Missing Data
● OneHotEncoding Categorical Data
● Scaling Continuous Features
● Transforming Text Data: Last Word and First 3 Words○ One Hot Encoded Representation: CountVectorizer and TfidfVectorizer○ n_grams: (e.g. “not happy”, “deeply sad”)○ stop_words (e.g. “a”, “in”)
13
![Page 14: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/14.jpg)
14
Scikit-learn pipeline prevents leakage by chaining transformations with
cross-validation
Modelling: Test-Train splits and Pipelines
7744 lines coming from 19 documents
70% Train
30% Test
14
![Page 15: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/15.jpg)
15
Modelling: Class Imbalance and Evaluation Metrics
2.08% of the lines are section titles
• False Negative: Section titles incorrectly identified as a in-text line
• False Positive: in-text line incorrectly identified as a section header
• In our scenario, we cared slightly more about False Negatives.
15
![Page 16: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/16.jpg)
Classification Algorithms:
1. Baseline Model: Logistic Regression
2. Random Forest Classifier
3. XGBoost Classifier
Modelling: Algorithms
Parameter Tuning and Cross-validation
● Grid-search over parameters
● Using a 5-fold cross-validation: Stratified Shuffle Split
● Embedded in a scikit-learn Pipeline
Outlier Detection Algorithms:
1. Isolation Forest: Picks outliers by
randomly selecting features
2. Elliptic Envelope: Assume Gaussian
Covariance to isolate outliers
16
![Page 17: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/17.jpg)
Modelling: Best Results on an Independent Test Set
True Negatives: 2,273 False Positives: 10
False Negatives: 4 True Positives: 36
Results Table
Random Forest
Max Depth: 50 Number of Trees: 100
Oversampling Minority Class
Empirical Rule Any line that begins
with “RE:” is labelled as a section title
Threshold Precision Recall F1Score
ROC AUC
Accuracy
0.31 0.78 0.90 0.84 0.95 0.99
Confusion Matrix
17
![Page 18: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/18.jpg)
Modelling: Important Features
18
![Page 19: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/19.jpg)
Module 2
19
![Page 20: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/20.jpg)
Background and Methodology
20
• Module 2 Objective: take the intermediate output generated by Module 1 and produce good quality text summarization
• We consider 5 different text summarization techniques that range from simple frequency based to semantic based analysis
• We consider two metrics (Levenshtein distance, Jaccard distance) to compare the output generated by these 5 methods
• Experimental evaluation and comparison of summarization output
• Lessons learned from summarization exploration
Data Prep& Input Data
Text Summarization Methods
SummarizedText Output
Comparison Framework
LevenshteinJaccard
Summarizationw/Comparison Score
Original Document
Module 1
Module 1Output
Module 2ETL
Module 2Input
![Page 21: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/21.jpg)
How Text Summarization Works?
Abstractive Summarization: This method produces summarization that is more human like where important concepts are produced.
This method selects words based on semantic understanding and tries to summarize based on important concepts. Most methods interpret and examine the text using advanced natural language techniques in order to generate a new shorter text that conveys the most critical information.
Input document → understand context → semantics → create own summary.
Extractive Summarization: Sentences are ranked based on important part of the sentences. Summarization method chooses top ranked sentences.
Different algorithm and techniques are used to define weights for the sentences and further rank them based on importance and similarity among each other.
Input document → sentences similarity → weight sentences → select sentences with higher rank.
Extractive Summarization returns top-N sentences as summarized output whereas Abstractive Summarization produces a key set of concepts as summarization based on semantic analysis. The latter is often hard and more complex but more human-like.
21
Broadly two categories of Text Summarization: Extractive and Abstractive
![Page 22: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/22.jpg)
Module 1 Output schema
1. document: name of the document
2. page : page number where each text belongs to
3. text: the text from each line is store in this column
4. Class: the classification of each line text line
Module 2 Input Schema
1. document : document name
2. secIDin: the section id of a particular text
3. text: the text for each section
Data Preparation for Summarization Step
22
Data Preparation (Module 2 ETL)• Original document is processed by Module 1 to generate a set of meta tags• Module 2 ETL utilizes Module 1 Output to generate input data with appropriate features for Text Summarization Methods
Module 2 Data ETL
![Page 23: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/23.jpg)
Summarization ModelsLuhn Model Lex Rank Model Tex Rank Model LSA Model NLTK
Core Idea Each sentence is assigned a score based on frequency of occurrence and distance among significant words; next is to extract top-N sentences with top scores.
Sentences are assigned a score based on TF-IDF and creating a graph with edges between similar sentences; PageRank based approach is used to compute rank of each sentence; top-N ranked sentences are extracted.
Similar to LexRank; While LexRank uses cosine similarity of TF-IDF vectors, TextRank uses a measure based on the number of words two sentences have in common.
LSA projects data into a lower dimensional space using SVD; singular vectors can capture and represent word combination patterns; magnitude of singular value indicates importance of the pattern in a document.
Simple text based approach summarization using basic NLP techniques such as word tokenization.
Category Extractive Extractive Extractive Close to abtractive Extractive
Frequency based ranking
Graph based ranking
ML Unsupervised
Semantic
23
![Page 24: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/24.jpg)
Comparing Summarization Quality with Similarity Metrics … cont
• Our hypothesis: If summarization output produced by these methods are ”very similar” to each other, this consensus is an indicator that summarization quality may be good. Conversely, if the output are “highly dissimilar”, the summarization quality is at least is non conclusive.
• We want to experimentally validate if “maximal consensus” is a good policy of picking good summarization.
• Automated hypothesis testing: We choose two metrics to measure similarity between two strings
○ Levenshtein distance: measures similarity at character level○ Jaccard distance: measures dissimilarity at word level
How do we know whether summarization is good quality?
24
![Page 25: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/25.jpg)
Jaccard distance: dissimilarity between two strings
represents the total number of attributes where A and B both have a value of 1.
represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1.
represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0.
represents the total number of attributes where A and B both have a value of 0.
Levenshtein distance: similarity between two strings
Mathematically, the Levenshtein distance between two strings (of length and respectively) is given by
Comparing Summarization Quality with Similarity Metrics
where is the indicator function equal to 0 when and equal to 1 otherwise, and is the distance between the first characters of and the first characters of
25
![Page 26: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/26.jpg)
Experiments
26
![Page 27: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/27.jpg)
Experiments and Results
27
Key Results
• L-score is more optimistic compared to J-score
• All methods have lowest similarity J-score with LSA
• Luhn and Text Rank seem to have highest similarity J-score
• LexRank and Text Rank summarization differs significantly although both use PageRanking/Graph based model!
• Maximal Consensus (highest number of methods with similar summarization) provided good summarization and validates our hypothesis
• Associativity of similarity does not hold with summarization!
![Page 28: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/28.jpg)
Summarization Output
Best Model:● Maximal consensus on summarization seems to be a good choice ● Luhn and Text Rank have highest similarity score in our analysis● Jaccard score is a better candidate for text summarization comparison
28
![Page 29: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/29.jpg)
Lesson Learned & Future Work
29
Lessons
● Check the integrity of your dataset until the last moment● Make sure to manually inspect where your model is making mistakes ● ML is not a panacea to all ills, so be flexible about other ways of supporting it● NLTK based summarization are counterintuitive as was shown in metrics table● Jaccard score is a better metric for comparison ● Maximal consensus based summarization gives better quality results
Future Work
● Evaluate abstractive summarization ● Explore CNN vector representations ● Evaluate models using other metrics such as Rouge, Blue, and Meteor
![Page 30: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/30.jpg)
3030
Thanks!Questions?
![Page 31: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/31.jpg)
Liliana Cruz-Lopez
● Module 1: Converted PDFs to HTMLs, extracted raw features from HTMLs and contributed to engineered features
● Module 2: completed end-to-end text summarization
Pranjal Bajaj
● Model concept and development
● Model implementation: Choosing Metrics and Implementing best practices using scikit-learn
* In alphabetical order
Main contribution from team members
31
Minsu Yeom
● Preprocessing: Feature engineering (Line space(LS), Ratio of title word to total), Converted PDFs to XMLs
● Model implementation: XGBoost
Gayani Perera
● PDF Ingestion, Feature engineering, model implementation : Random Forest
● Extractive text summarization
![Page 32: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/32.jpg)
Appendix
32
![Page 33: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/33.jpg)
Scikit-learn Pipeline
33
![Page 34: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/34.jpg)
34
Precision - Recall vs Threshold for Best Model: Random Forest
![Page 35: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/35.jpg)
35
HTML-based features (“raw”)
![Page 36: Algorithmic Comment Processing - Columbia DataSciencecolumbia.edu… · Future Work Evaluate abstractive summarization Explore CNN vector representations Evaluate models using other](https://reader036.fdocuments.us/reader036/viewer/2022062506/5f0db4a47e708231d43bad95/html5/thumbnails/36.jpg)
36
Table of Best Results