FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Problem: Text Soup

8
Problem: Text Soup OCR’s “Dirty” Little Secret FactMiners & PRImA’s Knight News Challenge Entry Turn Text Soup into Smart Data in Newspaper & Magazine Archives” A self-running video slideshow. One slide every 15 seconds. Pause as needed.

Transcript of FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Problem: Text Soup

Page 1: FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Problem: Text Soup

Problem: Text SoupOCR’s “Dirty” Little Secret

FactMiners & PRImA’sKnight News Challenge Entry

Turn Text Soup into Smart Data in

Newspaper & Magazine Archives”

A self-running video slideshow.

One slide every 15 seconds.

Pause as needed.

Page 2: FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Problem: Text Soup

Q: What is “Text Soup”?

• A: The uncorrected and

usually hidden text “layer”

that is generated by OCR(optical character recognition)

during bulk scanning and digitization of historic and cultural heritage documents.

FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

Scanned Images

or photos of pages!

Page 3: FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Problem: Text Soup

Q: How is “Text Soup” Used?

• A: Primarily “behind the scenes” to support “full text” search.

• Good for things like:• Show me the pages with the

word “razor” on them in this book.

• What books are about shaving?

• What words are found in proximity to the word “strop” ?

FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

Scanned

Image of text!

Hidden text

layer…

Page 4: FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Problem: Text Soup

Q: What are Text Soup’s limits?

• Automated OCR (text recognition) is a “one size fits all” process in the workflow of bulk scanning and digitization.

• Good for basic books & monographs with simple document structure…

FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

Page 5: FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Problem: Text Soup

Q: What are Text Soup’s limits?

• Newspapers & magazines have complex document structures

• Multiple articles, multiple authors, text continuations, advertisements, images, sidebars, text used as artin design, etc.

• All this data is locked in our archives waitingto be “fact-mined”

FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

Page 6: FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Problem: Text Soup

Q: What are Text Soup’s limits?

• On these pages from Softalk magazine we have lots of “facts” in ads and a monthly column

• We can’t “locate” facts and assess their meaning based on the jumbled ormissing info in its Text Soup.

FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

Complex

document

structures

not identified!

Page 7: FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Problem: Text Soup

We have to “tame” Text Soup to unlock “facts” in archive data.

• Our project will focus on recognizing complex document structure and on “fact-revealing” content modeling.

• In the next slideshow, we describe our vision for

“fact-mining” Smart Data from newspaper & magazine digital archives…

FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

Page 8: FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Problem: Text Soup

FactMiners & PRImA: Our Knight News Challenge Entry

• “Turn Text Soup into Smart Data in Newspaper & Magazine Archives” -https://goo.gl/99Vn5M

• Team• Jim Salmons, FactMiners

• Timlynn Babitsky, FactMiners

• Apostolos Antonacopoulos, PRImA

• Christian Clausner, PRImA

FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”