Download - Content analysis for ECM with Apache Tika

Transcript
Page 1: Content analysis for ECM with Apache Tika

Content analysis for ECM with Apache Tika

Paolo Mottadelli -

Page 2: Content analysis for ECM with Apache Tika

Paolo Mottadelli

[email protected]

2

Page 3: Content analysis for ECM with Apache Tika

Paolo Mottadelli

ON BOARD!

3

Page 4: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Agenda

4

Page 5: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Main challenge

5

Luceneindex

Page 6: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Other challenges

6

Page 7: Content analysis for ECM with Apache Tika

Paolo Mottadelli

A real world challenge

? ? ?

7

Searching .docx .xlsx .pptx in Alfresco ECM

Page 8: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Agenda

8

Page 9: Content analysis for ECM with Apache Tika

Paolo Mottadelli

What is Tika?

9

Another Indian Lucene project? No.

Page 10: Content analysis for ECM with Apache Tika

Paolo Mottadelli

What is Tika?

It is a Toolkit

10

Page 11: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Current coverage

11

Page 12: Content analysis for ECM with Apache Tika

Paolo Mottadelli

A brief history of Tika

Sponsored by the Apache Lucene PMC

12

Page 13: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Tika organization

13

Changing after graduation

Page 14: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Getting Tika

… and contributing

14

Page 15: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Tika Design

15

Page 16: Content analysis for ECM with Apache Tika

Paolo Mottadelli

The Parser interfacevoid parse(InputStream stream, ContentHandler

handler, Metadata metadata) throws IOException, SAXException, TikaException;

16

Page 17: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Tika Design

17

Page 18: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Document input stream

18

Page 19: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Tika Design

19

Page 20: Content analysis for ECM with Apache Tika

Paolo Mottadelli

XHTML SAX events<html xmlns="http://www.w3.org/1999/xhtml">

<head>

<title>...</title>

</head>

<body> ... </body>

</html>

20

Page 21: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Why XHTML?

• Reflect the structured text content of the document

• Not recreating the low level details• For low level details use low level parser libs

21

Page 22: Content analysis for ECM with Apache Tika

Paolo Mottadelli

ContentHandler (CH) and Decorators (CHD)

22

Page 23: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Tika Design

23

Page 24: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Document metadata

24

Page 25: Content analysis for ECM with Apache Tika

Paolo Mottadelli

… more metadata: HPSF

25

Page 26: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Tika Design

26

Page 27: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Parser implementations

27

Page 28: Content analysis for ECM with Apache Tika

Paolo Mottadelli

The AutoDetectParser

• Encapsulates all Tika functionalities• Can handle any type of document

28

Page 29: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Type DetectionMimeType type = types.getMimeType(…);

29

Page 30: Content analysis for ECM with Apache Tika

Paolo Mottadelli

tika-mimetypes.xml

An example: Gzip

<mime-type type="application/x-gzip">

<magic priority="40">

<match value="\037\213" type="string“ offset="0" />

</magic>

<glob pattern="*.tgz" />

<glob pattern="*.gz" />

<glob pattern="*-gz" />

</mime-type>

30

Page 31: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Supported formats

31

Page 32: Content analysis for ECM with Apache Tika

Paolo Mottadelli

A really simple exampleInputStream input =

MyTest.class.getResourceAsStream("testPPT.ppt");

Metadata metadata = new Metadata();

ContentHandler handler = new BodyContentHandler();

new OfficeParser().parse(input, handler, metadata);

String contentType = metadata.get(Metadata.CONTENT_TYPE);

String title= metadata.get(Metadata.TITLE);

String content = handler.toString();

32

Page 33: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Demo

33

?

Page 34: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Future Goals

34

Page 35: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Who uses Tika?

35

Page 36: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Agenda

36

Page 37: Content analysis for ECM with Apache Tika

Paolo Mottadelli

ECM: what is it?

37

Page 38: Content analysis for ECM with Apache Tika

Paolo Mottadelli

ECM: Manage

• Indexing• Categorization

*

*

38

Page 39: Content analysis for ECM with Apache Tika

Paolo Mottadelli

ECM: we love SEARCHING!

39

Page 40: Content analysis for ECM with Apache Tika

Paolo Mottadelli

ECM: we love SEARCHING!

40

Page 41: Content analysis for ECM with Apache Tika

Paolo Mottadelli

ECM: we love SEARCHING!

41

Page 42: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Don’t do it on your own

Tika shields ECMfrom usingmany single components

42

Page 43: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Agenda

43

Page 44: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Alfresco: short presentation

44

Page 45: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Alfresco: short presentation

45

Page 46: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Who uses Alfresco?

46

Page 47: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Alfresco RepositoryJSR-170 Level2 Compatible

47

Page 48: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Repository Architecture

Hibernate

Content

Lucene

Content IndexDatabase

SearchNode

Node Content QueryIndex

Services

Components

Storage

48

Page 49: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Repository Architecture

Hibernate

Content

Lucene

Content IndexDatabase

SearchNode

Node Content QueryIndex

Services

Components

Storage

49

Page 50: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Alfresco Search

50

Page 51: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Alfresco Search

51

Page 52: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Use case

52

Page 53: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Use case

53

Page 54: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Without Tika:

54

Page 55: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Step 1

55

Page 56: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Step 2

for (ContentTransformer transformer : transformers)

{

long transformationTime = transformer.getTransformationTime();

if (bestTransformer == null || transformationTime < bestTime)

{

bestTransformer = transformer;

bestTime = transformationTime;

}

}

return bestTransformer;

ContentTransformerRegistryProvides the most appropriate

ContentTransformer

56

Page 57: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Step 2 (explained)Too many differentContentTransformer implementations

57

Page 58: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Step 3Transform

public void transformInternal(ContentReader reader, ContentWriter writer, TransformationOptions options) throws Exception { ... HSSFWorkbook workbook = new HSSFWorkbook(is); ... for (int i = 0; i < sheetCount; i++) { HSSFSheet sheet = workbook.getSheetAt(i); String sheetName = workbook.getSheetName(i); writeSheet(os, sheet, encoding); } ... }

Example: PoiHssfContentTransformer

58

Page 59: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Step 3 (explained)

Too many differentContentTransformer implementations

... again !?!

59

Page 60: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Step 4

Lucene index creationContentReader reader = contentService.getReader(nodeRef, propertyName);

ContentTransformer transformer = contentService.getTransformer(reader.getMimetype(), MimetypeMap.MIMETYPE_TEXT_PLAIN);

transformer.transform(reader, writer); reader = writer.getReader();

. . . . . . . .

doc.add(new Field(attributeName, reader, Field.TermVector.NO));

60

Page 61: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Let’s do it using Tika

61

Page 62: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Step 1 + Step 2 + Step 3

String name = “resource.doc”InputStream input = getResourceAsStream(name);

Metadata metadata = new Metadata(); ContentHandler handler = new BodyContentHandler();

new AutoDetectParser().parse(input, handler, metadata);

String title = metadata.get(Metadata.TITLE);String content = handler.toString();

62

Page 63: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Step 1 to 4 (compressed)

String name = “resource.doc”InputStream input = getResourceAsStream(name);

Reader reader = new ParsingReader(input, name);

. . . . . .

doc.add(new Field(attributeName, reader, Field.TermVector.NO));

63

Page 64: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Results: 1 & 2

64

Page 65: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Extension use caseAdding support forMicrosoft Office Open XML Documents(Office 2007+)

65

Page 66: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Apache POI

Apache POI providesText Extraction support

for Office OpenXML formatsand

An advanced coverage ofSpreadsheetML specification

(WordprocessingML & PresentationML to come)

66

Page 67: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Apache POIApache POI status

67

Page 68: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Apache POI TextExtractors

POIXMLDocument document;

Package pkg = Package.open(stream);

textExtractor = ExtractorFactory.createExtractor(pkg);

if (textExtractor instanceof XSSFExcelExtractor) {

setType(metadata, OOXML_EXCEL_MIMETYPE

document = new XSSFWorkbook(pkg);

}

else if (textExtractor instanceof XWPFWordExtractor){…}

else if (textExtractor instanceof XSLFPowerPointExtractor){…}

setPOIXMLProperties(metadata, document);

68

Page 69: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Can we find it?

69

Page 70: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Results: 3 & 4

70

Page 71: Content analysis for ECM with Apache Tika

Paolo Mottadelli

Q & A

71

[email protected]