Solr JDBC: Presented by Kevin Risden, Avalon Consulting

26
OCTOBER 11-14, 2016 BOSTON, MA

Transcript of Solr JDBC: Presented by Kevin Risden, Avalon Consulting

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

Solr JDBC Kevin Risden

Apache Lucene/Solr Committer; Avalon Consulting, LLC

3

03 About me

•  Consultant with Avalon Consulting, LLC

•  ~4 years working with Hadoop and Search •  Contributed patches to Ambari, HBase, Knox, Solr, Storm •  Installation, security, performance tuning, development,

administration

•  Kevin Risden

•  Apache Lucene/Solr Committer

•  YCSB Contributor

4

03 Overview •  Background

•  Use Case

•  Solr JDBC

•  Demo

•  Future Development/Improvements

5

01 Background - What is JDBC?

The JDBC API is a Java API that can access any kind of tabular data, especially data stored in a Relational Database.

Source: https://docs.oracle.com/javase/tutorial/jdbc/overview/

JDBC drivers convert SQL into a backend query.

6

01 Background - Why should you care about Solr JDBC?

•  SQL skills are prolific.

•  JDBC drivers exist for most relational databases.

•  Existing reporting tools work with JDBC/ODBC drivers.

Solr 6 works with SQL and existing JDBC tools!

7

01 Use Case – Analytics – Utility Rates Data set: 2011 Utility Rates

Questions: •  How many utility companies serve the state of Maryland?

•  Which Maryland utility has the cheapest residential rates?

•  What are the minimum and maximum residential power rates excluding missing data elements?

•  What is the state and zip code with the highest residential rate?

How could you answer those questions with Solr?

Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/

•  Facets •  Filter Queries •  Filters •  Grouping

•  Sorting •  Stats •  String queries together

8

01 Use Case – Analytics – Utility Rates

Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/

Method: Lucene syntax

Questions: •  How many utility companies serve the state of Maryland?

http://solr:8983/solr/rates/select?q=state%3A%22MD%22&wt=json&indent=true&group=true&group.field=utility_name&rows=10&group.limit=1

•  Which Maryland utility has the cheapest residential rates? http://solr:8983/solr/rates/select?q=state%3A%22MD%22&wt=json&indent=true&group=true&group.field=utility_name&rows=1&group.limit=1&sort=res_rate+asc

•  What are the minimum and maximum residential power rates excluding missing data elements? http://solr:8983/solr/rates/select?q=*:*&fq=%7b!frange+l%3D0.0+incl%3Dfalse%7dres_rate&wt=json&indent=true&rows=0&stats=true&stats.field=res_rate

•  What is the state and zip code with the highest residential rate? http://solr:8983/solr/rates/select?q=res_rate:0.849872773537&wt=json&indent=true&rows=1

Is there a better way?

9

01 Solr JDBC Highlights

•  JDBC Driver for Solr

•  Powered by Streaming Expressions and Parallel SQL •  Thursday - Parallel SQL and Analytics with Solr – Yonik Seeley •  Thursday - Creating New Streaming Expressions – Dennis Gove

•  Integrates with any* JDBC client * tested with the JDBC clients in this presentation

Usage jdbc:solr://SOLR_ZK_CONNECTION_STRING?collection=COLLECTION_NAME

Apache Solr Reference Guide - Parallel SQL Interface

10

01 Solr JDBC - Architecture

11

01 Demo

Programming Languages •  Java •  Python/Jython •  R •  Apache Spark Web •  Apache Zeppelin •  RStudio

GUI – JDBC •  DbVisualizer •  SQuirreL SQL GUI – ODBC •  Microsoft Excel •  Tableau*

https://github.com/risdenk/solrj-jdbc-testing

12

01 Demo – Java import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.sql.*; public class SolrJJDBCTestingJava { private static final Logger LOGGER = LoggerFactory.getLogger(SolrJJDBCTestingJava.class); public static void main(String[] args) throws Exception { String sql = args[0]; try (Connection con = DriverManager.getConnection("jdbc:solr://solr:9983?collection=test")) { try (Statement stmt = con.createStatement()) { try (ResultSet rs = stmt.executeQuery(sql)) { ResultSetMetaData rsMetaData = rs.getMetaData(); int columns = rsMetaData.getColumnCount(); StringBuilder header = new StringBuilder(); for(int i = 1; i < columns + 1; i++) { header.append(rsMetaData.getColumnLabel(i)).append(","); } LOGGER.info(header.toString()); while (rs.next()) { StringBuilder row = new StringBuilder(); for(int i = 1; i < columns + 1; i++) { row.append(rs.getObject(i)).append(","); } LOGGER.info(row.toString()); } } } } } }

Apache Solr Reference Guide - Generic

13

01 Demo – Python #!/usr/bin/env python # https://pypi.python.org/pypi/JayDeBeApi/ import jaydebeapi import sys if __name__ == '__main__': jdbc_url = "jdbc:solr://solr:9983?collection=test” driverName = "org.apache.solr.client.solrj.io.sql.DriverImpl” statement = "select fielda, fieldb, fieldc, fieldd_s, fielde_i from test limit 10” conn = jaydebeapi.connect(driverName, jdbc_url) curs = conn.cursor() curs.execute(statement) print(curs.fetchall()) conn.close()

Apache Solr Reference Guide - Python/Jython

14

01 Demo – Jython #!/usr/bin/env jython # http://www.jython.org/jythonbook/en/1.0/DatabasesAndJython.html # https://wiki.python.org/jython/DatabaseExamples#SQLite_using_JDBC import sys from java.lang import Class from java.sql import DriverManager, SQLException if __name__ == '__main__': jdbc_url = "jdbc:solr://solr:9983?collection=test” driverName = "org.apache.solr.client.solrj.io.sql.DriverImpl” statement = "select fielda, fieldb, fieldc, fieldd_s, fielde_i from test limit 10” dbConn = DriverManager.getConnection(jdbc_url) stmt = dbConn.createStatement() resultSet = stmt.executeQuery(statement) while resultSet.next(): print(resultSet.getString("fielda")) resultSet.close() stmt.close() dbConn.close() Apache Solr Reference Guide - Python/Jython

15

01 Demo – R # https://www.rforge.net/RJDBC/ library("RJDBC") solrCP <- c(list.files('/opt/solr/dist/solrj-lib', full.names=TRUE), list.files('/opt/solr/dist', pattern='solrj', full.names=TRUE, recursive = TRUE)) drv <- JDBC("org.apache.solr.client.solrj.io.sql.DriverImpl", solrCP, identifier.quote="`") conn <- dbConnect(drv, "jdbc:solr://solr:9983?collection=test", "user", "pwd") dbGetQuery(conn, "select fielda, fieldb, fieldc, fieldd_s, fielde_i from test limit 10") dbDisconnect(conn)

Apache Solr Reference Guide - R

16

01 Demo – Apache Zeppelin

Apache Solr Reference Guide - Apache Zeppelin

17

01 Demo – RStudio

18

01 Demo – DbVisualizer

Apache Solr Reference Guide - DbVisualizer

19

01 Demo – SQuirreL SQL

Apache Solr Reference Guide - SQuirreL SQL

20

01 Demo – Microsoft Excel

21

01 Use Case – Analytics – Utility Rates

Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/

Method: Lucene syntax

Questions: •  How many utility companies serve the state of Maryland?

http://solr:8983/solr/rates/select?q=state%3A%22MD%22&wt=json&indent=true&group=true&group.field=utility_name&rows=10&group.limit=1

•  Which Maryland utility has the cheapest residential rates? http://solr:8983/solr/rates/select?q=state%3A%22MD%22&wt=json&indent=true&group=true&group.field=utility_name&rows=1&group.limit=1&sort=res_rate+asc

•  What are the minimum and maximum residential power rates excluding missing data elements? http://solr:8983/solr/rates/select?q=*:*&fq=%7b!frange+l%3D0.0+incl%3Dfalse%7dres_rate&wt=json&indent=true&rows=0&stats=true&stats.field=res_rate

•  What is the state and zip code with the highest residential rate? http://solr:8983/solr/rates/select?q=res_rate:0.849872773537&wt=json&indent=true&rows=1

Is there a better way?

22

01 Use Case – Analytics – Utility Rates Method: SQL

Questions: •  How many utility companies serve the state of Maryland?

select distinct utility_name from rates where state='MD';

•  Which Maryland utility has the cheapest residential rates? select utility_name,min(res_rate) from rates where state='MD' group by utility_name order by min(res_rate) asc limit 1;

•  What are the minimum and maximum residential power rates excluding missing data elements? select min(res_rate),max(res_rate) from rates where not res_rate = 0;

•  What is the state and zip code with the highest residential rate? select state,zip,max(res_rate) from rates group by state,zip order by max(res_rate) desc limit 1;

How should you answer those questions with Solr? – Using SQL! Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/

23

01 Use Case – Analytics – Utility Rates

How should you answer those questions with Solr? – Using SQL! Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/

24

01 Future Development/Improvements •  Replace Presto with Apache Calcite - SOLR-8593

•  Improve SQL compatibility

•  Ability to specify optimization rules (push downs, joins, etc)

•  Potentially use Avatica JDBC/ODBC drivers

•  Streaming Expressions/Parallel SQL improvements - SOLR-8125

•  JDBC driver improvements - SOLR-8659

Info on how to get involved

25

01 Future Development/Improvements

SQL Join

Info on how to get involved

SELECT

movie_title,character_name,line

FROM

movie_dialogs_movie_titles_metadata a

JOIN

movie_dialogs_movie_lines b

ON

a.movieID=b.movieID;

select(

innerJoin(

search(movie_dialogs_movie_titles_metadata,

q=*:*,

fl="movieID,movie_title",

sort="movieID asc"),

search(movie_dialogs_movie_lines,

q=*:*,

fl="movieID,character_name,line",

sort="movieID asc"),

on="movieID”

),

movie_title,character_name,line

)

Streaming Expression Join

26

01 Questions?