Natural Language Processing and Graph Databases in Lumify

Click here to load reader

Embed Size (px)


Lumify is an open source platform for big data analysis and visualization, designed to help organizations derive actionable insights from the large volumes of diverse data flowing through their enterprise. Utilizing both Hadoop and Storm, it ingests and integrates virtually any kind of data, from unstructured text documents and structured datasets, to images and video. Several open source analytic tools (including Tika, OpenNLP, CLAVIN, OpenCV, and ElasticSearch) are used to enrich the data, increase its discoverability, and automatically uncover hidden connections. All information is stored in a secure graph database implemented on top of Accumulo to support cell-level security of all data and metadata elements. A modern, browser-based user interface enables analysts to explore and manipulate their data, discovering subtle relationships and drawing critical new insights. In addition to full-text search, geospatial mapping, and multimedia processing, Lumify features a powerful graph visualization supporting sophisticated link analysis and complex knowledge representation. Charlie Greenbacker, Director of Data Science at Altamira, will provide an overview of Lumify and discuss how natural language processing (NLP) tools are used to enrich the text content of ingested data and automatically discover connections with other bits of information. Joe Ferner, Senior Software Engineer at Altamira, will describe the creation of SecureGraph and how it supports authorizations, visibility strings, multivalued properties, and property metadata in a graph database.

Transcript of Natural Language Processing and Graph Databases in Lumify

  • NLP and Graph Databases in Charlie Greenbacker & Joe Kerner
  • Agenda Graph Databases Lumify Overview Introductions Natural Language Processing
  • photo:&Columbia&Pictures& About me: @greenbacker Theories: popular tripe Methods: sloppy Conclusions: highly questionable
  • Best reason for not finishing PhD
  • @ExploreAltamira
  • is an open source big data analysis and visualization platform built by Altamira engineers
  • Key Lumify Concepts structure for organizing information (i.e., your data model) Ontology any thing you want to represent (e.g., person, place, event) Entities a link between two entities (e.g., leader-of, works-for, sibling-of) Relationships data about an entity (e.g., first name, last name, date of birth) Properties collection of entities and the relationships between them Graph
  • Live Demo
  • Who can Lumify help?
  • Lumify helps analysts fuse structured and unstructured data from myriad sources into actionable intelligence. Intelligence Analyst
  • Law enforcement personnel can use Lumify to explore criminal networks, uncover hidden connections, and develop leads. Police Investigator
  • Lumify analyzes financial data and transaction records to help detect fraud and identify possible insider threats. Financial Analyst photo:&Ken&Teegardin&(h9ps://
  • Scientists, law firms, news organizations, and others can track their research in Lumify to unearth latent knowledge and discover critical new insights. Research Staff photo:&UK&NaConal&Archives&(h9p://
  • Why Lumify?
  • Distributed under the permissive Apache 2.0 license No restrictions on modifications No licensing or usage constraints Free and Open Source
  • Built on Scalable Open Source Tech Hadoop&CDH&4& Accumulo& ElasCcSearch& tesseract&CLAVIN& CMU&Sphinx&OpenNLP& OpenCV& mpeg& Apache&Storm& Secure&Graph& custom&code&
  • Separate security restrictions at the entity, property, and relationship level Implemented in and enforced by Accumulo cell-level security Highly Secure Joaquin Guzman Loera DOB: 1957-04-04 POB: Badiraguarto Nationality: Mexican Founded: 2010-01-11 Location: Mexico City Employees: 121 Zarka de Mexico
  • Full-time development staff Custom development and customization services Commercial support offerings Supported
  • Day-to-day development done on Amazon infrastructure Primarily use EC2, VPC, S3, SES, CloudWatch Altamira is an AWS consulting partner AWS Compatible
  • Natural Language Processing in
  • Text Extraction video text docs structured data images OCR tesseract audio CMU Sphinx CMU Sphinx OCR tesseract extractor
  • Text Enrichment Apache OpenNLP Named Entity Recognition Extracts names of entities from unstructured text Persons, Orgs, & Locations Highlighted in preview text User must confirm/resolve CLAVIN Geospatial Entity Resolution Resolves extracted location names to gazetteer records Solves Springfield problem Disambiguates place names Turns text docs into maps!
  • Machine-powered entity extraction and resolution, combined with human QA and supplementation, supports rich semantic analysis of raw text. Enriched Text Documents Drug Lord El Chapo Captured in Mexico PUBLISHED DATE SOURCE Audit 2014/02/22 Wikipedia Add Property Although Guzman had long hidden successfully in remote areas of the Sierra Madre mountains, the arrested members of his security team told the military he had begun venturing out to Culiacan and the beach town of Mazatlan. A week prior to his capture, Guzman and Zambada were reported to have attended a family reunion in Sinaloa. The Mexican military followed the bodyguards tips to Guzmans ex-wifes house, but they had trouble ramming the steel-reinforced front door, which allowed Guzman to escape through a system of secret tunnels that connected six houses, eventually moving south to Mazatlan. He planned to stay a few days in Mazatlan to see his twin baby daughters before retreating to the mountains. On 22 February 2014, at around 6:40 a.m., Mexican authorities arrested Guzman at a hotel in a beach front area in Mazatlan, Sinaloa, following an operation by the Mexican Navy, with joint intelligence from the DEA and
  • Benefits to Users quickly find relevant data without reading Increases Discoverability machines process text faster than humans Helps Deal with Information Overload enables object-based analysis & investigations Uncovers Hidden Connections
  • Future NLP Integration e.g., Stanford NER, SUTime, MITIE Support other NER tools e.g., OpenIE (formerly ReVerb) Event/Relationship Extraction augmenting/extending GATE/ANNIE Coreference Resolution e.g., frequency analysis, topic modeling, sentiment analysis Additional Text Analytics use non-English language models for NER, etc. Multilingual Support
  • Graph Databases in view part 2 of the presentation here:
  • Questions? more info: