Solr JDBC: Presented by Kevin Risden, Avalon Consulting
-
Upload
lucidworks -
Category
Technology
-
view
240 -
download
5
Transcript of Solr JDBC: Presented by Kevin Risden, Avalon Consulting
3
03 About me
• Consultant with Avalon Consulting, LLC
• ~4 years working with Hadoop and Search • Contributed patches to Ambari, HBase, Knox, Solr, Storm • Installation, security, performance tuning, development,
administration
• Kevin Risden
• Apache Lucene/Solr Committer
• YCSB Contributor
5
01 Background - What is JDBC?
The JDBC API is a Java API that can access any kind of tabular data, especially data stored in a Relational Database.
Source: https://docs.oracle.com/javase/tutorial/jdbc/overview/
JDBC drivers convert SQL into a backend query.
6
01 Background - Why should you care about Solr JDBC?
• SQL skills are prolific.
• JDBC drivers exist for most relational databases.
• Existing reporting tools work with JDBC/ODBC drivers.
Solr 6 works with SQL and existing JDBC tools!
7
01 Use Case – Analytics – Utility Rates Data set: 2011 Utility Rates
Questions: • How many utility companies serve the state of Maryland?
• Which Maryland utility has the cheapest residential rates?
• What are the minimum and maximum residential power rates excluding missing data elements?
• What is the state and zip code with the highest residential rate?
How could you answer those questions with Solr?
Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
• Facets • Filter Queries • Filters • Grouping
• Sorting • Stats • String queries together
8
01 Use Case – Analytics – Utility Rates
Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
Method: Lucene syntax
Questions: • How many utility companies serve the state of Maryland?
http://solr:8983/solr/rates/select?q=state%3A%22MD%22&wt=json&indent=true&group=true&group.field=utility_name&rows=10&group.limit=1
• Which Maryland utility has the cheapest residential rates? http://solr:8983/solr/rates/select?q=state%3A%22MD%22&wt=json&indent=true&group=true&group.field=utility_name&rows=1&group.limit=1&sort=res_rate+asc
• What are the minimum and maximum residential power rates excluding missing data elements? http://solr:8983/solr/rates/select?q=*:*&fq=%7b!frange+l%3D0.0+incl%3Dfalse%7dres_rate&wt=json&indent=true&rows=0&stats=true&stats.field=res_rate
• What is the state and zip code with the highest residential rate? http://solr:8983/solr/rates/select?q=res_rate:0.849872773537&wt=json&indent=true&rows=1
Is there a better way?
9
01 Solr JDBC Highlights
• JDBC Driver for Solr
• Powered by Streaming Expressions and Parallel SQL • Thursday - Parallel SQL and Analytics with Solr – Yonik Seeley • Thursday - Creating New Streaming Expressions – Dennis Gove
• Integrates with any* JDBC client * tested with the JDBC clients in this presentation
Usage jdbc:solr://SOLR_ZK_CONNECTION_STRING?collection=COLLECTION_NAME
Apache Solr Reference Guide - Parallel SQL Interface
11
01 Demo
Programming Languages • Java • Python/Jython • R • Apache Spark Web • Apache Zeppelin • RStudio
GUI – JDBC • DbVisualizer • SQuirreL SQL GUI – ODBC • Microsoft Excel • Tableau*
https://github.com/risdenk/solrj-jdbc-testing
12
01 Demo – Java import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.sql.*; public class SolrJJDBCTestingJava { private static final Logger LOGGER = LoggerFactory.getLogger(SolrJJDBCTestingJava.class); public static void main(String[] args) throws Exception { String sql = args[0]; try (Connection con = DriverManager.getConnection("jdbc:solr://solr:9983?collection=test")) { try (Statement stmt = con.createStatement()) { try (ResultSet rs = stmt.executeQuery(sql)) { ResultSetMetaData rsMetaData = rs.getMetaData(); int columns = rsMetaData.getColumnCount(); StringBuilder header = new StringBuilder(); for(int i = 1; i < columns + 1; i++) { header.append(rsMetaData.getColumnLabel(i)).append(","); } LOGGER.info(header.toString()); while (rs.next()) { StringBuilder row = new StringBuilder(); for(int i = 1; i < columns + 1; i++) { row.append(rs.getObject(i)).append(","); } LOGGER.info(row.toString()); } } } } } }
Apache Solr Reference Guide - Generic
13
01 Demo – Python #!/usr/bin/env python # https://pypi.python.org/pypi/JayDeBeApi/ import jaydebeapi import sys if __name__ == '__main__': jdbc_url = "jdbc:solr://solr:9983?collection=test” driverName = "org.apache.solr.client.solrj.io.sql.DriverImpl” statement = "select fielda, fieldb, fieldc, fieldd_s, fielde_i from test limit 10” conn = jaydebeapi.connect(driverName, jdbc_url) curs = conn.cursor() curs.execute(statement) print(curs.fetchall()) conn.close()
Apache Solr Reference Guide - Python/Jython
14
01 Demo – Jython #!/usr/bin/env jython # http://www.jython.org/jythonbook/en/1.0/DatabasesAndJython.html # https://wiki.python.org/jython/DatabaseExamples#SQLite_using_JDBC import sys from java.lang import Class from java.sql import DriverManager, SQLException if __name__ == '__main__': jdbc_url = "jdbc:solr://solr:9983?collection=test” driverName = "org.apache.solr.client.solrj.io.sql.DriverImpl” statement = "select fielda, fieldb, fieldc, fieldd_s, fielde_i from test limit 10” dbConn = DriverManager.getConnection(jdbc_url) stmt = dbConn.createStatement() resultSet = stmt.executeQuery(statement) while resultSet.next(): print(resultSet.getString("fielda")) resultSet.close() stmt.close() dbConn.close() Apache Solr Reference Guide - Python/Jython
15
01 Demo – R # https://www.rforge.net/RJDBC/ library("RJDBC") solrCP <- c(list.files('/opt/solr/dist/solrj-lib', full.names=TRUE), list.files('/opt/solr/dist', pattern='solrj', full.names=TRUE, recursive = TRUE)) drv <- JDBC("org.apache.solr.client.solrj.io.sql.DriverImpl", solrCP, identifier.quote="`") conn <- dbConnect(drv, "jdbc:solr://solr:9983?collection=test", "user", "pwd") dbGetQuery(conn, "select fielda, fieldb, fieldc, fieldd_s, fielde_i from test limit 10") dbDisconnect(conn)
Apache Solr Reference Guide - R
21
01 Use Case – Analytics – Utility Rates
Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
Method: Lucene syntax
Questions: • How many utility companies serve the state of Maryland?
http://solr:8983/solr/rates/select?q=state%3A%22MD%22&wt=json&indent=true&group=true&group.field=utility_name&rows=10&group.limit=1
• Which Maryland utility has the cheapest residential rates? http://solr:8983/solr/rates/select?q=state%3A%22MD%22&wt=json&indent=true&group=true&group.field=utility_name&rows=1&group.limit=1&sort=res_rate+asc
• What are the minimum and maximum residential power rates excluding missing data elements? http://solr:8983/solr/rates/select?q=*:*&fq=%7b!frange+l%3D0.0+incl%3Dfalse%7dres_rate&wt=json&indent=true&rows=0&stats=true&stats.field=res_rate
• What is the state and zip code with the highest residential rate? http://solr:8983/solr/rates/select?q=res_rate:0.849872773537&wt=json&indent=true&rows=1
Is there a better way?
22
01 Use Case – Analytics – Utility Rates Method: SQL
Questions: • How many utility companies serve the state of Maryland?
select distinct utility_name from rates where state='MD';
• Which Maryland utility has the cheapest residential rates? select utility_name,min(res_rate) from rates where state='MD' group by utility_name order by min(res_rate) asc limit 1;
• What are the minimum and maximum residential power rates excluding missing data elements? select min(res_rate),max(res_rate) from rates where not res_rate = 0;
• What is the state and zip code with the highest residential rate? select state,zip,max(res_rate) from rates group by state,zip order by max(res_rate) desc limit 1;
How should you answer those questions with Solr? – Using SQL! Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
23
01 Use Case – Analytics – Utility Rates
How should you answer those questions with Solr? – Using SQL! Inspired By: http://blog.cloudera.com/blog/2015/10/how-to-use-apache-solr-to-query-indexed-data-for-analytics/
24
01 Future Development/Improvements • Replace Presto with Apache Calcite - SOLR-8593
• Improve SQL compatibility
• Ability to specify optimization rules (push downs, joins, etc)
• Potentially use Avatica JDBC/ODBC drivers
• Streaming Expressions/Parallel SQL improvements - SOLR-8125
• JDBC driver improvements - SOLR-8659
Info on how to get involved
25
01 Future Development/Improvements
SQL Join
Info on how to get involved
SELECT
movie_title,character_name,line
FROM
movie_dialogs_movie_titles_metadata a
JOIN
movie_dialogs_movie_lines b
ON
a.movieID=b.movieID;
select(
innerJoin(
search(movie_dialogs_movie_titles_metadata,
q=*:*,
fl="movieID,movie_title",
sort="movieID asc"),
search(movie_dialogs_movie_lines,
q=*:*,
fl="movieID,character_name,line",
sort="movieID asc"),
on="movieID”
),
movie_title,character_name,line
)
Streaming Expression Join