Installing Warcbase on OS X and Linux - WordPress.com · Installing Warcbase • Finally, we need...

Post on 03-Oct-2020

0 views 0 download

Transcript of Installing Warcbase on OS X and Linux - WordPress.com · Installing Warcbase • Finally, we need...

Installing Warcbase on OS X and Linux

Ian Milligan Assistant Professor

@ianmilligan1

This Guide

• Installing Warcbase on OS X

• Installing Warcbase on Ubuntu (Linux)

• And a link to the rudimentary Windows instructions

This is a supplement to the install docs at http://lintool.github.io/

warcbase-docs/Getting-Started/.

Installing Warcbase• Warcbase is a tricky

installation!

• http://lintool.github.io/warcbase-docs/Getting-Started/

• These slides should walk through on major platforms.

Installing Warcbase

• On OS X, requires dependencies

• Install homebrew - https://brew.sh/

• brewinstallgit

• brewinstallmaven

• brewcaskinstalljava

Installing Warcbase

• For some reason, on OS X, JAVA_HOME is the bane of my existence. I don’t know why.

• exportJAVA_HOME=/usr/lib/jvm/java-8-oracle

Installing Warcbase

• Now with Git, Maven, and JAVA_HOME set, you want to install Warcbase

• Clone the repo by:

• gitclonehttp://github.com/lintool/warcbase.git

Installing Warcbase

• Now go to the warcbase directory (cd warcbase)

• and build just warcbase-core

• mvncleanpackage-plwarcbase-core-DskipTests

Installing Warcbase

• In theory, you should now see this at right. Hurray!

• If not, please feel free to post your build error as an issue on the GitHub repo (warcbase.org).

Installing Warcbase• Now we need to install

“Spark Shell” so we can interface with Warcbase.

• As of April 20th, install this version in a different directory.

• wgethttp://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz

Installing Warcbase

• Now we need to untar it

• tar-xvfspark-1.6.1-bin-hadoop2.6.tgz

Warning: The next screen has a path – make sure to change it to match your own system!

Installing Warcbase• We are almost ready!

• CD to the Spark-Shell directory

• And then run this following command (make sure to point the —jars at the warcbase jar)

• ./bin/spark-shell--jars~/penn/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar

Installing Warcbase• You should now be at a

prompt.

• Type :paste and press enter

• Now paste the script at right, MAKING SURE TO CHANGE THE PATH TO POINT TO WARCBASE-CORE

importorg.warcbase.spark.matchbox._

importorg.warcbase.spark.rdd.RecordRDD._

valr=RecordLoader.loadArchives("/Users/ianmilligan1/penn/warcbase/warcbase-core/src/test/resources/arc/example.arc.gz",sc)

.keepValidPages()

.map(r=>ExtractDomain(r.getUrl))

.countItems()

.take(10)

Installing Warcbase

• Success if you see the results at right?

Now for Linux (tested on Ubuntu 14)

Installing Warcbase

• Similar to OS X, but some differences.

• Install git

• sudoapt-getinstallgit

Installing Warcbase• Download Apache Maven

manually

• wgethttp://apache.mirror.gtcomm.net/maven/maven-3/3.5.0/binaries/apache-maven-3.5.0-bin.tar.gz

• Untar

• tar-xvfapache-maven-3.5.0-bin.tar.gz

Installing Warcbase• Now we need to set maven

variables so we can use it from anywhere

• exportM2_HOME=/home/ubuntu/apache-maven-3.5.0

• exportM2=$M2_HOME/bin

• exportPATH=$M2:$PATH

• You may need to change your paths accordingly (in M2_HOME)

Installing Warcbase• Finally, we need to install Java

JDK. The following should work on UBUNTU 14 (next slide for 16):

• sudoapt-getinstallopenjdk-7-jdk

• exportJAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

• Again you may need to change paths (above tested April 2017 on AWS Ubuntu 14 VM).

Installing Warcbase• Finally, we need to install Java JDK. The

following should work on UBUNTU 16 (last slide for 14):

• sudoadd-apt-repositoryppa:openjdk-r/ppa

• sudoapt-getupdate

• sudoapt-getinstallopenjdk-7-jdk

• exportJAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

• Again you may need to change paths (above tested April 2017 on AWS Ubuntu 14 VM).

Installing Warcbase

• Phew!

• Now we’re ready to install warcbase. Go back to home directory.

• gitclonehttp://github.com/lintool/warcbase.git

Installing Warcbase

• Now let’s build. From warcbase directory:

• mvncleanpackage-plwarcbase-core-DskipTests

Installing Warcbase• Hey we’re looking good

now!

• Now let’s download Spark shell.

• wgethttp://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz

Installing Warcbase• Hey we’re looking good now!

• Now let’s download Spark shell.

• wgethttp://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz

• And then untar with

• tar-xvfspark-1.6.1-bin-hadoop2.6.tgz

Installing Warcbase• Go to the spark-shell

directory (cd spark-1.6.1-bin-hadoop-2.6)

• And then run, MAKING SURE TO CHANGE PATH TO POINT AT WARCBASE JAR

• ./bin/spark-shell--jars~/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar

Installing Warcbase• You should now be at a

prompt.

• Type :paste and press enter

• Now paste the script at right, MAKING SURE TO CHANGE THE PATH TO POINT TO WARCBASE-CORE

importorg.warcbase.spark.matchbox._

importorg.warcbase.spark.rdd.RecordRDD._

valr=RecordLoader.loadArchives("/home/ubuntu/warcbase/warcbase-core/src/test/resources/arc/example.arc.gz",sc)

.keepValidPages()

.map(r=>ExtractDomain(r.getUrl))

.countItems()

.take(10)

For Windows, I unfortunately don’t have access to a box.

Instructions are at http://lintool.github.io/warcbase-docs/Getting-Started/

Thanks!

Ian Milligan Assistant Professor

@ianmilligan1