Importing*into*Neo4j /No*witty*subtitle*/ · Model*the*Problem 4 Employee Various-Properties Event...

35
Importing into Neo4j No witty subtitle Dave Fauth @davefauth July 27

Transcript of Importing*into*Neo4j /No*witty*subtitle*/ · Model*the*Problem 4 Employee Various-Properties Event...

Importing  into  Neo4j-­‐ No  witty  subtitle  -­‐

Dave  Fauth@davefauthJuly  27

Life  as  a  Field  Engineer

2

Ask  the  right  questions

3

Model  the  Problem

4

Employee

Various  Properties

Event

Various  Properties

Expense  Report

Various  Properties

:SUBMITTEDSubmitDateReportNbr

Relationship  entity  has  a  name  and  also  various  properties

:ATTENDED

Simple   relationship  has  a  name,  but  no  properties

Employee

Various  Properties

:ATTENDED

:REFERENCES

User  stories

Derive  questions

Which  peopleclaimed  expenses  for  the  same  event?

From  user  story  to  model

MATCH (e:EMPLOYEE)-[:ATTENDED]->(ev:EVENT)<-[:ATTENDED]-(e1:EMPLOYEE)WITH e,ev,e1MATCH (ev)-[:EXPENSED_ON]->(er:EXPENSE_REPORT)RETURN e,ev,e1,er;

(person)-[:ATTENDED]->(event)<-[:ATTENDED]-(colleague)

person  ATTENDED  eventperson  SUBMITTED  exense_report

?Which  people claimed  expenses  for  the  same  event?

Get  some  representative  data

8

Loading  your  Data

Load  CSV

10

More  than  just  a  basic  data  ingestion  mechanism

It's  an  ETL  Power  Tool

Load  CSV

ETL  Power  Tool• Combines  multiple  aspects  in  a  single  operation• Supports  loading  /  ingesting  CSV  data  from  an  URI  (file://,  http://,  https://,  ftp://)

• Direct  mapping  of  input  data  into  complex  graph/domain  structure

• Data  conversion• Supports  complex  computations• Create  or  merge  data,  relationships  and  structure

11

Load  CSV

Nodes  – Indexes  -­‐ Relationships• Do  multiple  passes  to  create  nodes  and  relationships  instead  of  large,  combined  statements

LOAD  CSV  WITH  HEADERS  FROM  “file:///path/to/file.csv“   AS  lineMERGE  (a:Person {id:line.id})  ON  CREATE  SET  a.name=line.name;CREATE  INDEX  on  :Person(id);CREATE  INDEX  on  :Movie(id);LOAD  CSV  WITH  HEADERS  FROM  file:///path/to/file.csv AS  lineMATCH  (m:Movie {id:line.movieId})MATCH  (a:Person {id:line.personId})CREATE  (a)-­‐[:ACTED_IN  {roles:[line.role]}]-­‐>(m);

12

Load  CSV

Periodic  Commit• ALWAYS  prefix  your  LOAD  CSV  with  USING  PERIODIC  COMMIT.  

• The  number  given  is  the  number  of  import  rows  after  which  a  commit  of  the  imported  data  happens.  • Depending  on  the  complexity  of  your  import  operation,  you  might  create  from  100  elements  per  1000  rows  (if  you  have  a  lot  of  duplicates)  up  to  100,000  when  you  have  complex  operations  that  generate  up  to  100  nodes  and  relationships  per  row  of  input.  

• That’s  why  a  commit  size  of  1000  might  be  a  safe  bet.

13

Load  CSV

More  Tips• Rather  than  a  long  merge/create  statement  that  attempts  to  create  multiple  entities  in  one  pass,  favor  short,  simple  statements  and  do  multiple  passes  over  the  input  as  needed

• If  you  load  your  CSV  file  over  the  network  make  sure  the  network  is  fast  enough  to  sustain  the  ingestion  rate  you’d  like  to  have.  Otherwise:

• If  possible  download  it,  and  use  a  file://  URL.• Column  names  are  case  sensitive.• Misspelled  column  names  result  in  null  values.

14

Load  CSV

Considerations• Make  sure  you  have  sufficient  RAM• Use  file:///path/to/file.csv on  OSX  and  Unix,  use  file:c:/path/to/file.csv on  Windows

• Check  correct  delimiters   and  columns• Columns  are  case  sensitive• Empty  columns  are  treated  as  null• Default  data  type  is  String.  Use  toInt or  toFloat to  convert• Change  the  delimiter   if  needed  with  …AS  line  FIELDTERMINATOR  ‘;’• Create  necessary  indexes  and  constraints  upfront• Use  the  Neo4j-­‐Shell  for  larger  imports

15

CSV

CSV  Files  for  Northwind  Data

CSV  Files  for  Northwind  Data

Step-­‐by-­‐step  Creating  the  Graph

1.Import  Nodes

2.Create  Indexes

3.Import  Relationships

LOADing  the  Data

// Create customersUSING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/customers.csv" AS rowCREATE (:Customer {companyName: row.CompanyName, customerID: row.CustomerID, fax: row.Fax, phone: row.Phone});

// Create productsUSING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/products.csv" AS rowCREATE (:Product {productName: row.ProductName, productID: row.ProductID, unitPrice: toFloat(row.UnitPrice)});

LOADing  the  Data// Create suppliersUSING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/suppliers.csv" AS rowCREATE (:Supplier {companyName: row.CompanyName, supplierID: row.SupplierID});

// Create employeesUSING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/employees.csv" AS rowCREATE (:Employee {employeeID:row.EmployeeID, firstName: row.FirstName, lastName: row.LastName, title: row.Title});

LOADing  the  Data// Create categoriesUSING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/categories.csv" AS rowCREATE (:Category {categoryID: row.CategoryID, categoryName: row.CategoryName, description: row.Description});

// Create ordersUSING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/orders.csv" AS rowMERGE (order:Order {orderID: row.OrderID}) ON CREATE SET order.shipName =row.ShipName;

Creating  the  IndexesCREATE INDEX ON :Product(productID);CREATE INDEX ON :Product(productName);CREATE INDEX ON :Category(categoryID);CREATE INDEX ON :Employee(employeeID);CREATE INDEX ON :Supplier(supplierID);CREATE INDEX ON :Customer(customerID);CREATE INDEX ON :Customer(customerName);

Creating  the  RelationshipsUSING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/orders.csv" AS rowMATCH (order:Order {orderID: row.OrderID})MATCH (customer:Customer {customerID: row.CustomerID})MERGE (customer)-[:PURCHASED]->(order);

USING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/products.csv" AS rowMATCH (product:Product {productID: row.ProductID})MATCH (supplier:Supplier {supplierID: row.SupplierID})MERGE (supplier)-[:SUPPLIES]->(product);

Creating  the  RelationshipsUSING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/orders.csv" AS rowMATCH (order:Order {orderID: row.OrderID})MATCH (product:Product {productID: row.ProductID})MERGE (order)-[pu:PRODUCT]->(product)ON CREATE SET pu.unitPrice = toFloat(row.UnitPrice), pu.quantity =toFloat(row.Quantity);

USING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/orders.csv" AS rowMATCH (order:Order {orderID: row.OrderID})MATCH (employee:Employee {employeeID: row.EmployeeID})MERGE (employee)-[:SOLD]->(order);

Creating  the  RelationshipsUSING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/products.csv" AS rowMATCH (product:Product {productID: row.ProductID})MATCH (category:Category {categoryID: row.CategoryID})MERGE (product)-[:PART_OF]->(category);

USING PERIODIC COMMITLOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/developer-resources/gh-pages/data/northwind/employees.csv" AS rowMATCH (employee:Employee {employeeID: row.EmployeeID})MATCH (manager:Employee {employeeID: row.ReportsTo})MERGE (employee)-[:REPORTS_TO]->(manager);

Neo4j-­‐Import

Neo4j-­‐Import:  Handling  the  initial  firehose of  data

28

Neo4j-­‐Import

Create  your  initial  database• Allows  you  to  load  large  amounts  of  data  into  a  Neo4j  database

• Allows  you  to  specify  nodes  and  relationships  in  separate  files

• Supports  loading  /  ingesting  CSV  data  from  your  file  system• Does  not  support  data  transformations

29

Neo4j-­‐Import

Notes:• Fields  are  comma  separated  by  default  but  a  different  delimiter  can  be  specified.

• All  files  must  use  the  same  delimiter.• Multiple  data  sources  can  be  used  for  both  nodes  and  relationships.

• A  data  source  can  optionally  be  provided  using  multiple  files.• A  header  which  provides  information  on  the  data  fields  must  be  on  the  first  row  of  each  data  source.

• Fields  without  corresponding  information  in  the  header  will  not  be  read.

• UTF-­‐8  encoding  is  used.

30

Neo4j-­‐Import

Sample  Script./bin/neo4j-­‐import   -­‐-­‐into  /Users/davidfauth/testDB-­‐-­‐nodes  /Users/davidfauth/neo4j-­‐atlanta-­‐meetup/employee.csv-­‐-­‐nodes  /Users/davidfauth/neo4j-­‐atlanta-­‐meetup/locations.csv-­‐-­‐nodes  /Users/davidfauth/neo4j-­‐atlanta-­‐meetup/events.csv-­‐-­‐nodes  /Users/davidfauth/neo4j-­‐atlanta-­‐meetup/expense_report.csv-­‐-­‐relationships  /Users/davidfauth/neo4j-­‐atlanta-­‐meetup/events_rels.csv-­‐-­‐relationships  /Users/davidfauth/neo4j-­‐atlanta-­‐meetup/exp_rep_rels.csv-­‐-­‐relationships  /Users/davidfauth/neo4j-­‐atlanta-­‐meetup/expense_report_rels.csv-­‐-­‐bad-­‐tolerance  10000

31

Transactional  REST  Endpoint

Pass  Cypher  Statements  over  REST• The  Neo4j  transactional  HTTP  endpoint  allows  you  to  execute  a  series  of  Cypher  statements  within  the  scope  of  a  transaction.

• Can  use  the  same  transaction  for  multiple  HTTP  requests

32

Building  a  Neo4j  Application

#1 #2

#3

Additional  Resources• Max  DeMarzi (http://maxdemarzi.com)

• Wikipedia  into  Neo4j  with  Graphipedia• Scaling  concurrent  writes  in  Neo4j• Online  payment  risk  management  with  Neo4j

• Mark  Needham  (http://markneedham.com/blog)• Loading  data  – REST  API  vs  Batch  Import• The  Batch  Inserter  and  the  sunk  cost  fallacy

• Rik Van  Bruggen (http://blog.bruggen.com)• Import:  summarised• Using   LOAD  CSV`  to  import  data  from  a  Google  Spreadsheet• Food  networks,  countries,  diets,  health  and  Load  CSV• Some  Neo4j  import  tweaks,  what  and  where• Spreadsheet  method:  plenty!

• Michael  Hunger  (http://www.jexp.de/blog)• `LOAD  CSV`  into  Neo4j  quickly  and  successfully• Use  `LOAD  CSV`  to  import  Git history  into  Neo4j• On  Importing  Data  into  Neo4j  (blog)

34

Importing  into  Neo4j-­‐ No  witty  subtitle  -­‐

Thank  You