NLP and Graph Databases infiles.meetup.com/7616132/Lumify_DC-NLP+GraphDB_meetup.pdf · built by...

27
NLP and Graph Databases in Charlie Greenbacker & Joe Kerner

Transcript of NLP and Graph Databases infiles.meetup.com/7616132/Lumify_DC-NLP+GraphDB_meetup.pdf · built by...

NLP and Graph Databases in

Charlie Greenbacker & Joe Kerner

Agenda

Graph Databases

Lumify Overview

Introductions

Natural Language Processing

photo:&Columbia&Pictures&

About me: @greenbacker Theories: popular tripe Methods: sloppy Conclusions: highly questionable

Best reason for not finishing PhD

@ExploreAltamira

is an open source big data analysis and visualization platform built by Altamira engineers

Key Lumify Concepts

structure for organizing information (i.e., your data model) Ontology

any “thing” you want to represent (e.g., person, place, event) Entities

a link between two entities (e.g., leader-of, works-for, sibling-of) Relationships

data about an entity (e.g., first name, last name, date of birth) Properties

collection of entities and the relationships between them Graph

Live Demo

Who can Lumify help?

Lumify helps analysts fuse structured and unstructured data from myriad sources into actionable intelligence.

Intelligence Analyst

Law enforcement personnel can use Lumify to explore criminal networks, uncover hidden connections, and develop leads.

Police Investigator

Lumify analyzes financial data and transaction records to help detect fraud and identify possible insider threats.

Financial Analyst

photo:&Ken&Teegardin&(h9ps://flic.kr/p/9rn9Yh)&

Scientists, law firms, news organizations, and others can track their research in Lumify to unearth latent knowledge and discover critical new insights.

Research Staff

photo:&UK&NaConal&Archives&(h9p://bit.ly/1n9dhR8)&

Why Lumify?

•  Distributed under the permissive Apache 2.0 license

•  No restrictions on modifications

•  No licensing or usage constraints

Free and Open Source

Built on Scalable Open Source Tech

Hadoop&CDH&4&

Accumulo&

ElasCcSearch&

tesseract&CLAVIN& CMU&Sphinx&OpenNLP& OpenCV& ffmpeg&

Apache&Storm&

Secure&Graph&

custom&code&

•  Separate security restrictions at the entity, property, and relationship level

•  Implemented in and enforced by Accumulo cell-level security

Highly Secure

Joaquin Guzman Loera

DOB: 1957-04-04 POB: Badiraguarto Nationality: Mexican

Founded: 2010-01-11 Location: Mexico City Employees: 121

Zarka de Mexico

•  Full-time development staff

•  Custom development and customization services

•  Commercial support offerings

Supported

•  Day-to-day development done on Amazon infrastructure

•  Primarily use EC2, VPC, S3, SES, CloudWatch

•  Altamira is an AWS consulting partner

AWS Compatible

Natural Language Processing in

Text Extraction

video

text docs structured data

images OCR tesseract

audio CMU Sphinx

CMU Sphinx

OCR tesseract

extractor

Text Enrichment

•  Apache OpenNLP •  Named Entity Recognition •  Extracts names of entities

from unstructured text •  Persons, Orgs, & Locations •  Highlighted in preview text •  User must confirm/resolve

•  CLAVIN •  Geospatial Entity Resolution •  Resolves extracted location

names to gazetteer records •  Solves “Springfield problem” •  Disambiguates place names •  Turns text docs into maps!

Machine-powered entity extraction and resolution, combined with human QA and supplementation, supports rich semantic analysis of raw text.

Enriched Text Documents

Drug Lord “El Chapo” Captured in Mexico

PUBLISHED DATE SOURCE

Audit

2014/02/22 Wikipedia

Add Property

Although Guzman had long hidden successfully in remote areas of the Sierra Madre mountains, the arrested members of his security team told the military he had begun venturing out to Culiacan and the beach town of Mazatlan. A week prior to his capture, Guzman and Zambada were reported to have attended a family reunion in Sinaloa. The Mexican military followed the bodyguards tips to Guzman’s ex-wife’s house, but they had trouble ramming the steel-reinforced front door, which allowed Guzman to escape through a system of secret tunnels that connected six houses, eventually moving south to Mazatlan. He planned to stay a few days in Mazatlan to see his twin baby daughters before retreating to the mountains. On 22 February 2014, at around 6:40 a.m., Mexican authorities arrested Guzman at a hotel in a beach front area in Mazatlan, Sinaloa, following an operation by the Mexican Navy, with joint intelligence from the DEA and

Benefits to Users

quickly find relevant data without reading Increases Discoverability

machines process text faster than humans Helps Deal with Information Overload

enables object-based analysis & investigations Uncovers Hidden Connections

Future NLP Integration

e.g., Stanford NER, SUTime, MITIE Support other NER tools

e.g., OpenIE (formerly ReVerb) Event/Relationship Extraction

augmenting/extending GATE/ANNIE Coreference Resolution

e.g., frequency analysis, topic modeling, sentiment analysis Additional Text Analytics

use non-English language models for NER, etc. Multilingual Support

Graph Databases in

view part 2 of the presentation here: github.com/altamiracorp/secure-graph-presentation

Questions?

more info: lumify.io