FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Problem: Text Soup

Problem: Text SoupOCR’s “Dirty” Little Secret

FactMiners & PRImA’sKnight News Challenge Entry

Turn Text Soup into Smart Data in

Newspaper & Magazine Archives”

A self-running video slideshow.

One slide every 15 seconds.

Pause as needed.

Q: What is “Text Soup”?

• A: The uncorrected and

usually hidden text “layer”

that is generated by OCR(optical character recognition)

during bulk scanning and digitization of historic and cultural heritage documents.

FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

Scanned Images

or photos of pages!

Q: How is “Text Soup” Used?

• A: Primarily “behind the scenes” to support “full text” search.

• Good for things like:• Show me the pages with the

word “razor” on them in this book.

• What books are about shaving?

• What words are found in proximity to the word “strop” ?


Scanned

Image of text!

Hidden text

layer…

Q: What are Text Soup’s limits?

• Automated OCR (text recognition) is a “one size fits all” process in the workflow of bulk scanning and digitization.

• Good for basic books & monographs with simple document structure…



• Newspapers & magazines have complex document structures

• Multiple articles, multiple authors, text continuations, advertisements, images, sidebars, text used as artin design, etc.

• All this data is locked in our archives waitingto be “fact-mined”



• On these pages from Softalk magazine we have lots of “facts” in ads and a monthly column

• We can’t “locate” facts and assess their meaning based on the jumbled ormissing info in its Text Soup.


Complex

document

structures

not identified!

We have to “tame” Text Soup to unlock “facts” in archive data.

• Our project will focus on recognizing complex document structure and on “fact-revealing” content modeling.

• In the next slideshow, we describe our vision for

“fact-mining” Smart Data from newspaper & magazine digital archives…


FactMiners & PRImA: Our Knight News Challenge Entry

• “Turn Text Soup into Smart Data in Newspaper & Magazine Archives” -https://goo.gl/99Vn5M

• Team• Jim Salmons, FactMiners

• Timlynn Babitsky, FactMiners

• Apostolos Antonacopoulos, PRImA

• Christian Clausner, PRImA


https://goo.gl/99Vn5M

FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Problem: Text Soup

Data & Analytics

Transcript of FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Problem: Text Soup