Legal Informatics with AWS CloudSearch - AWS Michigan

30
AWS Michigan October 9, 2012 LEGAL INFORMATICS WITH CLOUDSEARCH

description

Bommarito Consulting will be covering legal informatics on AWS, giving the first AWS Michigan talk on CloudSearch. The presentation will provide structure, detail, and comparison between solutions, including a solr vs. CloudSearch implementation comparison and code to take you from scratch to searching. An outline of the presentation is provided below: 1. Generalize information retrieval more than you’re used to. 2. Build a basic case search engine with solr/EC2. 3. Build a basic case search engine with CloudSearch. 4. Understand the relative strengths of solr and CloudSearch. 5. Discuss other legal services that CloudSearch can augment.

Transcript of Legal Informatics with AWS CloudSearch - AWS Michigan

Page 1: Legal Informatics with AWS CloudSearch - AWS Michigan

AWS MichiganOctober 9, 2012

LEGAL INFORMATICS WITH CLOUDSEARCH

Page 2: Legal Informatics with AWS CloudSearch - AWS Michigan

GOALS

1. Generalize information retrieval more than you’re used to.

2. Build a basic case search engine with solr/EC2.

3. Build a basic case search engine with CloudSearch.

4. Understand the relative strengths of solr and CloudSearch.

5. Discuss other legal services that CloudSearch can augment.

© Bommarito Consulting

Page 3: Legal Informatics with AWS CloudSearch - AWS Michigan

SEARCH:INFORMATION RETRIEVAL

(loose – history has built silos where none might exist)

Wiki: “[the science] of obtaining information resources relevant to an information need from a collection of information resources.”

Examples: Search: Google, Yahoo, Bing Citation: MLA, Harvard Blue Book Classification: Dewey Decimal, LOC

We’ll start by building a basic case search engine, like Lexis or West.

© Bommarito Consulting

Page 4: Legal Informatics with AWS CloudSearch - AWS Michigan

SEARCH:INFORMATION RETRIEVAL

Resources Store Engine

pages about Python

images of cats

locations near Ashley’s

• Examples:• Store=ext2, Engine=ext2• Store=btrfs, Engine=grep• Store=tar, Engine=inverted index• Store=Oracle, Engine=PL/SQL• Store=Postgres, Engine=Sphinx

© Bommarito Consulting

Page 5: Legal Informatics with AWS CloudSearch - AWS Michigan

Solr

SOLR

Resources Lucene

pages about Python

images of cats

locations near Ashley’s

• All APL.• Lucene

• The text search engine library of choice for Java developers.• Solr

• Wraps Lucene in a blanket of RESTful goodness.• Many other search and architectural additions as well.• (You should know about Tika.)

• (yes, there is also ElasticSearch.)

© Bommarito Consulting

Page 6: Legal Informatics with AWS CloudSearch - AWS Michigan

Solr

SOLR

Resources Lucene

pages about Python

images of cats

locations near Ashley’s

Using Solr1. Deploy and configure Solr infrastructure.2. Configure your schema, indexing, etc.3. Seed index with initial data.4. If index updates, figure out how to do it without breaking things.5. Connect to search client with REST.6. Oops, we don’t have enough capacity for search, update volume,

etc.

© Bommarito Consulting

Page 7: Legal Informatics with AWS CloudSearch - AWS Michigan

SOLR

Launch and connect to an m1.small.mjbommar@cluster0:~$ ec2run ami-c371cdaa -t m1.small --region us-east-1 –key ec2-keypair –g tomcatmjbommar@cluster0:~$ ec2din $INSTANCE_ID | grep ‘^INSTANCE’ | awk '{print $4}‘mjbommar@cluster0:~$ ssh -i ~/.ssh/ec2 ubuntu@$INSTANCE_HOST

Configure a solr deployment under /opt/solr.ubuntu@domU$ apt-get update --fix-missing && apt-get install default-jdk tomcat6ubuntu@domU$ cd /optubuntu@domU$ wget http://www.apache.org/dist/lucene/solr/4.0.0-BETA/apache-solr-4.0.0-BETA.tgzubuntu@domU$ tar xzf apache-solr-4.0.0-BETA.tgzubuntu@domU$ echo “SOLR_HOME=/opt/solr” >> /etc/environmentubuntu@domU$ wget –O /etc/tomcat6/Catalina/localhost/solr.xml http://bommarito-consulting.s3.amazonaws.com/legal-informatics-presentation/solr.xmlubuntu@domU$ mkdir solrubuntu@domU$ cp apache-solr-4.0.0-BETA/dist/apache-solr-4.0.0-BETA.war solr/solr.warubuntu@domU$ cd solr

OK, let’s try it.1. Deploy and configure Solr infrastructure.

© Bommarito Consulting

Page 8: Legal Informatics with AWS CloudSearch - AWS Michigan

SOLR

Configure the initial collection schema and options.ubuntu@domU$ mkdir collection1 && cd collection1ubuntu@domU$ wget http://bommarito-consulting.s3.amazonaws.com/legal-informatics-presentation/solr-conf-scotus.tar.gzubuntu@domU$ tar xzf solr-conf-scotus.tar.gzubuntu@domU$ less conf/solrconfig.xmlubuntu@domU$ less conf/schema.xml

Start tomcat and make sure we’re clean.ubuntu@domU$ chown –R tomcat6:tomcat6 /opt/solrubuntu@domU$ service tomcat6 restartubuntu@domU$ tail –f /var/log/tomcat6/catalina.out

2. Configure your schema, indexing, etc.

© Bommarito Consulting

Page 9: Legal Informatics with AWS CloudSearch - AWS Michigan

SOLR

Schema Example<schema name="scotus" version="1.5"> <fields> <field name="title" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="content" type="text_en" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" /> </fields> <uniqueKey>title</uniqueKey> <types> <fieldType name="string" class="solr.StrField" sortMissingLast="true" /> <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>…

2. Configure your schema, indexing, etc.

© Bommarito Consulting

Page 10: Legal Informatics with AWS CloudSearch - AWS Michigan

SOLR

Solr Config Handler Example<requestHandler name="/query" class="solr.SearchHandler"> <lst name="defaults"> <str name="echoParams">explicit</str> <str name="wt">json</str> <str name="indent">true</str> <str name="df">text</str> </lst> </requestHandler>

2. Configure your schema, indexing, etc.

© Bommarito Consulting

Page 11: Legal Informatics with AWS CloudSearch - AWS Michigan

SOLR

Download sample document.ubuntu@domU$ wget http://bommarito-consulting.s3.amazonaws.com/legal-informatics-presentation/sample.xml

Sample File<?xml version="1.0" encoding="UTF-8"?><add> <doc> <field name="title">Marbury v. Madison</field> <field name="content">The clerks of the Department of State of the United States may be called upon to give evidence of transactions in the Department which are not of a confidential character.</field> </doc></add>

POST the sample document to the Solr update handler.ubuntu@domU$ curl –header ‘Content-Type: application/xml’ -–data-binary @sample.xml http://localhost:8080/solr/update?commit=true

3. Seed index with initial data.

© Bommarito Consulting

Page 12: Legal Informatics with AWS CloudSearch - AWS Michigan

SOLR

ubuntu@domU$ curl http://localhost:8080/solr/query?q=content:confidential{ "responseHeader":{ "status":0, "QTime":0, "params":{ "q":"content:confidential"}}, "response":{"numFound":1,"start":0,"docs":[ { "title":"Marbury v. Madison", "content":"The clerks of the Department of State of the United States may be called upon to give evidence of transactions in the Department which are not of a confidential character."}] }}

5. Connect to search client with REST.

Some helpful references:• http://wiki.apache.org/solr/CommonQueryParameters• http://wiki.apache.org/solr/SearchHandler• http://lucene.apache.org/core/4_0_0-BETA/index.html© Bommarito Consulting

Page 13: Legal Informatics with AWS CloudSearch - AWS Michigan

SOLR

Let’s summarize:1. Deploy and configure Solr infrastructure.2. Configure your schema, indexing, etc.3. Seed index with initial data.4. If index updates, figure out how to do it without breaking things.5. Connect to search client over RESTful interface.6. Oops, we don’t have enough capacity for search, update volume,

etc. Scale!

Overall, solr is great. ElasticSearch makes some of these items easier as well. Can you imagine implementing this from scratch?

However, items 1, 4, and 6 seem like low-hanging fruit for PaaS.

© Bommarito Consulting

Page 14: Legal Informatics with AWS CloudSearch - AWS Michigan

CLOUDSEARCH

Resources

CloudSearch

pages about Python

images of cats

locations near Ashley’s

• Yet another AWS managed service. • Compare to RDS, but for search.

• Solr vs CloudSearch• Collection : Domain• Both RESTful• CloudSearch schema/text configuration much less flexible

© Bommarito Consulting

Page 15: Legal Informatics with AWS CloudSearch - AWS Michigan

CLOUDSEARCH

Resources

CloudSearch

pages about Python

images of cats

locations near Ashley’s

Using CloudSearch1. Create a domain2. Configure schema, indexing, access, etc.3. Seed index with data.4. Connect to search client with REST.5. Relax (well, just make sure your billing information is correct).

© Bommarito Consulting

Page 16: Legal Informatics with AWS CloudSearch - AWS Michigan

CLOUDSEARCH

© Bommarito Consulting

Page 17: Legal Informatics with AWS CloudSearch - AWS Michigan

CLOUDSEARCH

$ cd /opt$ sudo wget http://s3.amazonaws.com/amazon-cloudsearch-data/cloud-search-tools-1.0.0.1-2012.03.05.tar.gz$ sudo tar xzf cloud-search-tools-1.0.0.1-2012.03.05.tar.gz$ export CS_HOME=/opt/cloud-search-tools-1.0.0.1-2012.03.05$ export PATH=$PATH:$CS_HOME/bin$ export AWS_CREDENTIAL_FILE=/home/user/.ec2/credentials

OK, let’s try it. Make sure you have your JRE and AWS credentials configured.

Install CloudSearch command line tools.

© Bommarito Consulting

Page 18: Legal Informatics with AWS CloudSearch - AWS Michigan

CLOUDSEARCH

Request domain creation.$ cs-create-domain -d rehnquist-express

Monitor the domain until available. This can take more than 5-10 minutes.$ cs-describe-domain -d rehnquist-express$ grab-a-drink-or-two

1. Create a domain.

© Bommarito Consulting

Page 19: Legal Informatics with AWS CloudSearch - AWS Michigan

CLOUDSEARCH

Configure access policies. This can also take awhile.$ cs-configure-access-policies -d rehnquist-express --update --allow IP_ADDRESS --service doc$ cs-configure-access-policies -d rehnquist-express --update --allow all --service search$ cs-configure-access-policies –d rehnquist-express –retrieve

Configure the schema.$ cs-configure-fields -d rehnquist-express --name title --type text --option result$ cs-configure-fields -d rehnquist-express --name content --type text --option result

Show current stopwords.$ cs-configure-text-options –d rehnquist-express -psw===== Stop Words =====State: Active======================aan

Not shown: custom synonyms, stemming rules, stopwords, or ranking.

2. Configure schema, indexing, access, etc.

© Bommarito Consulting

Page 20: Legal Informatics with AWS CloudSearch - AWS Michigan

CLOUDSEARCH

$ cd /tmp$ wget http://bulk.resource.org/courts.gov/c/US.tar.bz2 && tar xjf US.tar.bz2$ for d in `find /tmp/US/ -type d`; do cs-generate-sdf --source "$d/*.html" -d rehnquist-express; done$ cs-index-documents –d rehnquist-express

3. Seed index with data.

© Bommarito Consulting

Page 21: Legal Informatics with AWS CloudSearch - AWS Michigan

CLOUDSEARCH

$ curl “http://search-rehnquist-express-5nvzkxgvupbufypbvcmg57lw7m.us-east-1.cloudsearch.amazonaws.com/2011-02-01/search?q=confidential&return-fields=title”

{"rank":"-text_relevance","match-expr":"(label 'confidential')","hits":{"found":449,"start":0,"hit":[{"id":"d__data_workspace_lle_data_us_454_454_us_170_80_1103_80_885_html","daa":{}},{"id":"d__data_workspace_lle_data_us_508_508_us_165_91_2054_html","data":{}},{"id":"d__data_workspace_lle_data_us_340_340_us_332_21_html","data":{}},{"id":"d__data_workssace_lle_data_us_484_484_us_19_86_422_html","data":{}},{"id":"d__data_workspace_lle_data_us_510_510_us_1103___2_html","data":{}},{"id":"d__data_workspace_lle_data_us_291_291_us___338_html","data":{}},{"id":"d__data_workspace_lle_data_us_537_537_us_941_01_1521_html","data":{}},{"id":"d__data_workspace_lle_data_us_537_537_us_941_01_1708_html","data":{}},,"id":"d__data_workspace_lle_data_us_357_357_us_144_621_html","data":{}},{"id":"d__data_workspace_lle_data_us_351_351_us_345_503_html","data":{}}]},"info":{"rid":"b7c167f6c2da6dd31b0fda497afcf1775b775c683dee5a356e2b9115965a3eb688f6fc18e0a36950","time-ms":3,"cpu-time-ms":0}}

4. Connect to search client with REST.

© Bommarito Consulting

Page 22: Legal Informatics with AWS CloudSearch - AWS Michigan

CLOUDSEARCH

What about pricing?

Type Estimated Capacity* $/hr. $/mo.

Small 1M docs 0.12 $86

Large 4M docs 0.48 $346

Extra Large 8M docs 0.68 $489

“Sticky” opex

* 1k documents, no result storage.

Variable opex• $0.10 per 1,000 upload• $0.98/GB per index• $0.12/GB network out

© Bommarito Consulting

Page 23: Legal Informatics with AWS CloudSearch - AWS Michigan

CLOUDSEARCH

What about pricing?

Building this sample:

Some cycles on office servers not counted.

© Bommarito Consulting

Page 24: Legal Informatics with AWS CloudSearch - AWS Michigan

CLOUDSEARCH

Next steps:• Try CloudSearch with Boto:

• http://boto.cloudhackers.com/en/latest/cloudsearch_tut.html

• Write your own custom content transformer with Tika:• http://tika.apache.org/1.2/formats.html

• Understand how types, search, faceting, and results interact and constrain in CloudSearch.

• Understand how response times are affected by document count, document size, types, search, faceting, and results.

© Bommarito Consulting

Page 25: Legal Informatics with AWS CloudSearch - AWS Michigan

SEARCH SUMMARY

Solr CloudSearch

• Can share infrastructure and application containers.

• Highly flexible configuration via XML.

• Can customize or extend by writing Java.

• Managed service.• Highly scalable per unit of

labor.• Like all AWS services, stop

paying immediately when you’re done.

Best solution probably depends on relative scarcity of SA/developer labor and document volume.

© Bommarito Consulting

Page 26: Legal Informatics with AWS CloudSearch - AWS Michigan

THINK MORE VARIABLE

Case search is a pretty simple, static example. You build a Lexis/West clone, monetize with some ads or a subscription, and figure out how to handle monthly or yearly updates as cheaply as possible.

Think more variable – what might CloudSearch help with in legal services? Prime tasks should have durations or scales that are hard to meet with fixed assets.

© Bommarito Consulting

Page 27: Legal Informatics with AWS CloudSearch - AWS Michigan

THINK MORE VARIABLE

Here’s an idea, shamelessly copied from some of my marketing material:

  Imagine you’re a smaller law firm that specializes in HR disputes.  As part of a time-sensitive non-solicitation claim filed by your client, you’ve subpoenaed email from fifteen employees at a client’s competitor.   It’s Friday afternoon at 5PM, and you finally receive a hard drive with the emails.  However, in an effort to overwhelm your small team, the other party has dumped 10GB of data on your plate.  There’s no way you can search through this by hand.  You have a hearing on Wednesday, but need to prepare a memo for your client by Monday morning.  Do you disappoint your client and motion to reschedule?  How could you possibly make the deadline?  If only you could just press a button and get something like Google for your data…

Discovery is a perfect task for CloudSearch.• Very tight deadlines.• Short project lifetimes.• Wide variety of data volumes.© Bommarito Consulting

Page 28: Legal Informatics with AWS CloudSearch - AWS Michigan

THINK MORE VARIABLE

For public corporations, you might have enough compliance or discovery work to keep something an application running 24/7.

Enter KnowCave – a SolidLogic project.• Deep search• Natural language content alert• Responsive, multi-device interface

http://knowcave.com

© Bommarito Consulting

Page 29: Legal Informatics with AWS CloudSearch - AWS Michigan

Bommarito Consult ing Blog Installing AWS Cloud Search Command Line Tools Building an AWS CloudSearch domain for the Supreme Court eDiscovery Consulting in the Cloud: Searching an Outlook mailbox and attachments Generating AWS CloudSearch SDF for Emails

Sol id Logic KnowCave Electronic Discovery Reference Model

EDRM Stages Explained Cornel l Legal Information Inst i tute

Federal Rules of Civil Procedure Amazon Web Services

CloudSearch documentation CloudSearch command line tools

Windows Linux/Mac

Internat ional Associat ion for Art ifi c ial Inte l l igence and Law Boto Apache Tika Apache Lucene Apache Solr Elast icSearch

REFERENCES

© Bommarito Consulting

Page 30: Legal Informatics with AWS CloudSearch - AWS Michigan

Michael J Bommarito II CEO, Bommarito Consulting,

LLC Email:

[email protected] Web: http://bommaritollc.com/

THANKS!

You can get these slides on my blog – http://bommaritollc.com/blog/.Here’s the post.

© Bommarito Consulting