Content analysis for ECM with Apache Tika

Post on 08-May-2015

4.002 views 0 download

description

Presentation at ApacheCon US 2008 (New Orleans) by Paolo Mottadelli. This is about the Apache Tika project and how it was integrated in Alfresco in order to support Open XML format Full Text Search.

Transcript of Content analysis for ECM with Apache Tika

Content analysis for ECM with Apache Tika

Paolo Mottadelli -

Paolo Mottadelli

paolo@apache.org

2

Paolo Mottadelli

ON BOARD!

3

Paolo Mottadelli

Agenda

4

Paolo Mottadelli

Main challenge

5

Luceneindex

Paolo Mottadelli

Other challenges

6

Paolo Mottadelli

A real world challenge

? ? ?

7

Searching .docx .xlsx .pptx in Alfresco ECM

Paolo Mottadelli

Agenda

8

Paolo Mottadelli

What is Tika?

9

Another Indian Lucene project? No.

Paolo Mottadelli

What is Tika?

It is a Toolkit

10

Paolo Mottadelli

Current coverage

11

Paolo Mottadelli

A brief history of Tika

Sponsored by the Apache Lucene PMC

12

Paolo Mottadelli

Tika organization

13

Changing after graduation

Paolo Mottadelli

Getting Tika

… and contributing

14

Paolo Mottadelli

Tika Design

15

Paolo Mottadelli

The Parser interfacevoid parse(InputStream stream, ContentHandler

handler, Metadata metadata) throws IOException, SAXException, TikaException;

16

Paolo Mottadelli

Tika Design

17

Paolo Mottadelli

Document input stream

18

Paolo Mottadelli

Tika Design

19

Paolo Mottadelli

XHTML SAX events<html xmlns="http://www.w3.org/1999/xhtml">

<head>

<title>...</title>

</head>

<body> ... </body>

</html>

20

Paolo Mottadelli

Why XHTML?

• Reflect the structured text content of the document

• Not recreating the low level details• For low level details use low level parser libs

21

Paolo Mottadelli

ContentHandler (CH) and Decorators (CHD)

22

Paolo Mottadelli

Tika Design

23

Paolo Mottadelli

Document metadata

24

Paolo Mottadelli

… more metadata: HPSF

25

Paolo Mottadelli

Tika Design

26

Paolo Mottadelli

Parser implementations

27

Paolo Mottadelli

The AutoDetectParser

• Encapsulates all Tika functionalities• Can handle any type of document

28

Paolo Mottadelli

Type DetectionMimeType type = types.getMimeType(…);

29

Paolo Mottadelli

tika-mimetypes.xml

An example: Gzip

<mime-type type="application/x-gzip">

<magic priority="40">

<match value="\037\213" type="string“ offset="0" />

</magic>

<glob pattern="*.tgz" />

<glob pattern="*.gz" />

<glob pattern="*-gz" />

</mime-type>

30

Paolo Mottadelli

Supported formats

31

Paolo Mottadelli

A really simple exampleInputStream input =

MyTest.class.getResourceAsStream("testPPT.ppt");

Metadata metadata = new Metadata();

ContentHandler handler = new BodyContentHandler();

new OfficeParser().parse(input, handler, metadata);

String contentType = metadata.get(Metadata.CONTENT_TYPE);

String title= metadata.get(Metadata.TITLE);

String content = handler.toString();

32

Paolo Mottadelli

Demo

33

?

Paolo Mottadelli

Future Goals

34

Paolo Mottadelli

Who uses Tika?

35

Paolo Mottadelli

Agenda

36

Paolo Mottadelli

ECM: what is it?

37

Paolo Mottadelli

ECM: Manage

• Indexing• Categorization

*

*

38

Paolo Mottadelli

ECM: we love SEARCHING!

39

Paolo Mottadelli

ECM: we love SEARCHING!

40

Paolo Mottadelli

ECM: we love SEARCHING!

41

Paolo Mottadelli

Don’t do it on your own

Tika shields ECMfrom usingmany single components

42

Paolo Mottadelli

Agenda

43

Paolo Mottadelli

Alfresco: short presentation

44

Paolo Mottadelli

Alfresco: short presentation

45

Paolo Mottadelli

Who uses Alfresco?

46

Paolo Mottadelli

Alfresco RepositoryJSR-170 Level2 Compatible

47

Paolo Mottadelli

Repository Architecture

Hibernate

Content

Lucene

Content IndexDatabase

SearchNode

Node Content QueryIndex

Services

Components

Storage

48

Paolo Mottadelli

Repository Architecture

Hibernate

Content

Lucene

Content IndexDatabase

SearchNode

Node Content QueryIndex

Services

Components

Storage

49

Paolo Mottadelli

Alfresco Search

50

Paolo Mottadelli

Alfresco Search

51

Paolo Mottadelli

Use case

52

Paolo Mottadelli

Use case

53

Paolo Mottadelli

Without Tika:

54

Paolo Mottadelli

Step 1

55

Paolo Mottadelli

Step 2

for (ContentTransformer transformer : transformers)

{

long transformationTime = transformer.getTransformationTime();

if (bestTransformer == null || transformationTime < bestTime)

{

bestTransformer = transformer;

bestTime = transformationTime;

}

}

return bestTransformer;

ContentTransformerRegistryProvides the most appropriate

ContentTransformer

56

Paolo Mottadelli

Step 2 (explained)Too many differentContentTransformer implementations

57

Paolo Mottadelli

Step 3Transform

public void transformInternal(ContentReader reader, ContentWriter writer, TransformationOptions options) throws Exception { ... HSSFWorkbook workbook = new HSSFWorkbook(is); ... for (int i = 0; i < sheetCount; i++) { HSSFSheet sheet = workbook.getSheetAt(i); String sheetName = workbook.getSheetName(i); writeSheet(os, sheet, encoding); } ... }

Example: PoiHssfContentTransformer

58

Paolo Mottadelli

Step 3 (explained)

Too many differentContentTransformer implementations

... again !?!

59

Paolo Mottadelli

Step 4

Lucene index creationContentReader reader = contentService.getReader(nodeRef, propertyName);

ContentTransformer transformer = contentService.getTransformer(reader.getMimetype(), MimetypeMap.MIMETYPE_TEXT_PLAIN);

transformer.transform(reader, writer); reader = writer.getReader();

. . . . . . . .

doc.add(new Field(attributeName, reader, Field.TermVector.NO));

60

Paolo Mottadelli

Let’s do it using Tika

61

Paolo Mottadelli

Step 1 + Step 2 + Step 3

String name = “resource.doc”InputStream input = getResourceAsStream(name);

Metadata metadata = new Metadata(); ContentHandler handler = new BodyContentHandler();

new AutoDetectParser().parse(input, handler, metadata);

String title = metadata.get(Metadata.TITLE);String content = handler.toString();

62

Paolo Mottadelli

Step 1 to 4 (compressed)

String name = “resource.doc”InputStream input = getResourceAsStream(name);

Reader reader = new ParsingReader(input, name);

. . . . . .

doc.add(new Field(attributeName, reader, Field.TermVector.NO));

63

Paolo Mottadelli

Results: 1 & 2

64

Paolo Mottadelli

Extension use caseAdding support forMicrosoft Office Open XML Documents(Office 2007+)

65

Paolo Mottadelli

Apache POI

Apache POI providesText Extraction support

for Office OpenXML formatsand

An advanced coverage ofSpreadsheetML specification

(WordprocessingML & PresentationML to come)

66

Paolo Mottadelli

Apache POIApache POI status

67

Paolo Mottadelli

Apache POI TextExtractors

POIXMLDocument document;

Package pkg = Package.open(stream);

textExtractor = ExtractorFactory.createExtractor(pkg);

if (textExtractor instanceof XSSFExcelExtractor) {

setType(metadata, OOXML_EXCEL_MIMETYPE

document = new XSSFWorkbook(pkg);

}

else if (textExtractor instanceof XWPFWordExtractor){…}

else if (textExtractor instanceof XSLFPowerPointExtractor){…}

setPOIXMLProperties(metadata, document);

68

Paolo Mottadelli

Can we find it?

69

Paolo Mottadelli

Results: 3 & 4

70

Paolo Mottadelli

Q & A

71

p.mottadelli@sourcesense.com