FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Problem: Text Soup
-
Upload
jim-salmons -
Category
Data & Analytics
-
view
516 -
download
2
Transcript of FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Problem: Text Soup
Problem: Text SoupOCR’s “Dirty” Little Secret
FactMiners & PRImA’sKnight News Challenge Entry
Turn Text Soup into Smart Data in
Newspaper & Magazine Archives”
A self-running video slideshow.
One slide every 15 seconds.
Pause as needed.
Q: What is “Text Soup”?
• A: The uncorrected and
usually hidden text “layer”
that is generated by OCR(optical character recognition)
during bulk scanning and digitization of historic and cultural heritage documents.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Scanned Images
or photos of pages!
Q: How is “Text Soup” Used?
• A: Primarily “behind the scenes” to support “full text” search.
• Good for things like:• Show me the pages with the
word “razor” on them in this book.
• What books are about shaving?
• What words are found in proximity to the word “strop” ?
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Scanned
Image of text!
Hidden text
layer…
Q: What are Text Soup’s limits?
• Automated OCR (text recognition) is a “one size fits all” process in the workflow of bulk scanning and digitization.
• Good for basic books & monographs with simple document structure…
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Q: What are Text Soup’s limits?
• Newspapers & magazines have complex document structures
• Multiple articles, multiple authors, text continuations, advertisements, images, sidebars, text used as artin design, etc.
• All this data is locked in our archives waitingto be “fact-mined”
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Q: What are Text Soup’s limits?
• On these pages from Softalk magazine we have lots of “facts” in ads and a monthly column
• We can’t “locate” facts and assess their meaning based on the jumbled ormissing info in its Text Soup.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Complex
document
structures
not identified!
We have to “tame” Text Soup to unlock “facts” in archive data.
• Our project will focus on recognizing complex document structure and on “fact-revealing” content modeling.
• In the next slideshow, we describe our vision for
“fact-mining” Smart Data from newspaper & magazine digital archives…
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
FactMiners & PRImA: Our Knight News Challenge Entry
• “Turn Text Soup into Smart Data in Newspaper & Magazine Archives” -https://goo.gl/99Vn5M
• Team• Jim Salmons, FactMiners
• Timlynn Babitsky, FactMiners
• Apostolos Antonacopoulos, PRImA
• Christian Clausner, PRImA
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”