Content analysis for ECM with Apache Tika

Paolo Mottadelli -

Paolo Mottadelli

paolo@apache.org

Paolo Mottadelli

ON BOARD!

Paolo Mottadelli

Agenda

Paolo Mottadelli

Main challenge

Luceneindex

Paolo Mottadelli

Other challenges

Paolo Mottadelli

A real world challenge

Searching .docx .xlsx .pptx in Alfresco ECM

Paolo Mottadelli

Agenda

Paolo Mottadelli

What is Tika?

Another Indian Lucene project? No.

Paolo Mottadelli

What is Tika?

It is a Toolkit

Paolo Mottadelli

Current coverage

Paolo Mottadelli

A brief history of Tika

Sponsored by the Apache Lucene PMC

Paolo Mottadelli

Tika organization

Changing after graduation

Paolo Mottadelli

Getting Tika

… and contributing

Paolo Mottadelli

Tika Design

Paolo Mottadelli

The Parser interfacevoid parse(InputStream stream, ContentHandler

handler, Metadata metadata) throws IOException, SAXException, TikaException;

Paolo Mottadelli

Tika Design

Paolo Mottadelli

Document input stream

Paolo Mottadelli

Tika Design

Paolo Mottadelli

XHTML SAX events<html xmlns="http://www.w3.org/1999/xhtml">

<head>

</head>

</html>

Paolo Mottadelli

Why XHTML?

• Reflect the structured text content of the document

• Not recreating the low level details• For low level details use low level parser libs

Paolo Mottadelli

ContentHandler (CH) and Decorators (CHD)

Paolo Mottadelli

Tika Design

Paolo Mottadelli

Document metadata

Paolo Mottadelli

… more metadata: HPSF

Paolo Mottadelli

Tika Design

Paolo Mottadelli

Parser implementations

Paolo Mottadelli

The AutoDetectParser

• Encapsulates all Tika functionalities• Can handle any type of document

Paolo Mottadelli

Type DetectionMimeType type = types.getMimeType(…);

Paolo Mottadelli

tika-mimetypes.xml

An example: Gzip

<mime-type type="application/x-gzip">

</magic>

</mime-type>

Paolo Mottadelli

Supported formats

Paolo Mottadelli

A really simple exampleInputStream input =

MyTest.class.getResourceAsStream("testPPT.ppt");

Metadata metadata = new Metadata();

ContentHandler handler = new BodyContentHandler();

new OfficeParser().parse(input, handler, metadata);

String contentType = metadata.get(Metadata.CONTENT_TYPE);

String title= metadata.get(Metadata.TITLE);

String content = handler.toString();

Paolo Mottadelli

Future Goals

Paolo Mottadelli

Who uses Tika?

Paolo Mottadelli

Agenda

Paolo Mottadelli

ECM: what is it?

Paolo Mottadelli

ECM: Manage

• Indexing• Categorization

Paolo Mottadelli

ECM: we love SEARCHING!

Paolo Mottadelli

Don’t do it on your own

Tika shields ECMfrom usingmany single components

Paolo Mottadelli

Agenda

Paolo Mottadelli

Alfresco: short presentation

Paolo Mottadelli

Alfresco: short presentation

Paolo Mottadelli

Who uses Alfresco?

Paolo Mottadelli

Alfresco RepositoryJSR-170 Level2 Compatible

Paolo Mottadelli

Repository Architecture

Hibernate

Content

Lucene

Content IndexDatabase

SearchNode

Node Content QueryIndex

Services

Components

Storage

Paolo Mottadelli

Repository Architecture

Hibernate

Content

Lucene

Content IndexDatabase

SearchNode

Node Content QueryIndex

Services

Components

Storage

Paolo Mottadelli

Alfresco Search

Paolo Mottadelli

Alfresco Search

Paolo Mottadelli

Use case

Paolo Mottadelli

Use case

Paolo Mottadelli

Without Tika:

Paolo Mottadelli

Step 1

Paolo Mottadelli

Step 2

for (ContentTransformer transformer : transformers)

long transformationTime = transformer.getTransformationTime();

if (bestTransformer == null || transformationTime < bestTime)

bestTransformer = transformer;

bestTime = transformationTime;

return bestTransformer;

ContentTransformerRegistryProvides the most appropriate

ContentTransformer

Paolo Mottadelli

Step 2 (explained)Too many differentContentTransformer implementations

Paolo Mottadelli

Step 3Transform

public void transformInternal(ContentReader reader, ContentWriter writer, TransformationOptions options) throws Exception { ... HSSFWorkbook workbook = new HSSFWorkbook(is); ... for (int i = 0; i < sheetCount; i++) { HSSFSheet sheet = workbook.getSheetAt(i); String sheetName = workbook.getSheetName(i); writeSheet(os, sheet, encoding); } ... }

Example: PoiHssfContentTransformer

Paolo Mottadelli

Step 3 (explained)

Too many differentContentTransformer implementations

... again !?!

Paolo Mottadelli

Step 4

Lucene index creationContentReader reader = contentService.getReader(nodeRef, propertyName);

ContentTransformer transformer = contentService.getTransformer(reader.getMimetype(), MimetypeMap.MIMETYPE_TEXT_PLAIN);

transformer.transform(reader, writer); reader = writer.getReader();

. . . . . . . .

doc.add(new Field(attributeName, reader, Field.TermVector.NO));

Paolo Mottadelli

Let’s do it using Tika

Paolo Mottadelli

Step 1 + Step 2 + Step 3

String name = “resource.doc”InputStream input = getResourceAsStream(name);

Metadata metadata = new Metadata(); ContentHandler handler = new BodyContentHandler();

new AutoDetectParser().parse(input, handler, metadata);

String title = metadata.get(Metadata.TITLE);String content = handler.toString();

Paolo Mottadelli

Step 1 to 4 (compressed)

String name = “resource.doc”InputStream input = getResourceAsStream(name);

Reader reader = new ParsingReader(input, name);

. . . . . .

doc.add(new Field(attributeName, reader, Field.TermVector.NO));

Paolo Mottadelli

Results: 1 & 2

Paolo Mottadelli

Extension use caseAdding support forMicrosoft Office Open XML Documents(Office 2007+)

Paolo Mottadelli

Apache POI

Apache POI providesText Extraction support

for Office OpenXML formatsand

An advanced coverage ofSpreadsheetML specification

(WordprocessingML & PresentationML to come)

Paolo Mottadelli

Apache POIApache POI status

Paolo Mottadelli

Apache POI TextExtractors

POIXMLDocument document;

Package pkg = Package.open(stream);

textExtractor = ExtractorFactory.createExtractor(pkg);

if (textExtractor instanceof XSSFExcelExtractor) {

setType(metadata, OOXML_EXCEL_MIMETYPE

document = new XSSFWorkbook(pkg);

else if (textExtractor instanceof XWPFWordExtractor){…}

else if (textExtractor instanceof XSLFPowerPointExtractor){…}

setPOIXMLProperties(metadata, document);

Paolo Mottadelli

Can we find it?

Paolo Mottadelli

Results: 3 & 4

Paolo Mottadelli

p.mottadelli@sourcesense.com

Content analysis for ECM with Apache Tika

Technology

Transcript of Content analysis for ECM with Apache Tika

Apache Tika API Usage Examples

Anti Bio Tika

TREC Dynamic Domain · Each web crawl used Apache Nutch as the core framework for web crawling and Apache Tika as the main content detection and extraction framework.

CASE Tika Asma

Cloudera Search User · PDF fileCloudera Search User Guide | 5 ... SolrCloud, Apache Tika, and Solr Cell. ... fail to provide deep insight into utilization,

Ka tika muri, ka tika mua - Te Mana

Scientific data curation and processing with Apache Tika

REDISCOVERING ARMENIA - Apache Tika Corpora

Apache Tika What’s new with 2.0? · CTO, Quanticate “small, yellow and leech-like, and probably the oddest thing in the Universe ...

PSA Tika Fix

Text and metadata extraction with Apache Tika

Jurnal Tika

Apache Tika

Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application.

Arab is Tika

Referat Invaginasi Tika

Scientific data curation and processing with Apache Tika Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant.

Tika Progress Report May

About the Tutorial - Current Affairs 2018, Apache Commons ... · Apache Tika 5 What is Apache Tika? Apache Tika is a library that is used for document type detection and content extraction

Tika - Mastoiditis