HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

26
Applications of HPCC Systems at Clemson University Amy Apon, PhD Linh Ngo, PhD Michael Payne Big Data Systems Laboratory Clemson University

description

Presenters: Amy Apon, Linh Ngo and Michael Payne, Clemson University Big Data research at Clemson University includes the development of computational science applications, evaluation of software and hardware infrastructure systems to support Big Data, and design and development of data analytics platforms. HPCC Systems has been designed to support production quality data storage and analytics in an enterprise environment. In this talk we describe how we are using the HPCC Systems platform in an academic research environment to support research in Big Data. Like many major research universities, Clemson University hosts a shared high performance supercomputing cluster. The Palmetto cluster at Clemson ranks in the top five among university-owned supercomputers in the U.S. and is a significant research resource for faculty, students, and staff. As a shared resource, computing nodes are allocated to users via a batch scheduler system, and users do not have administrative privileges. In this talk we describe how we use HPCC Systems on the shared campus resource using user privileges only. This work opens doors for use of HPCC Systems to a broad community of academic users who have access to shared computing resources. HPCC Systems is a key tool in several analytics projects at Clemson University. In this talk we describe how we use the HPCC Systems platform in our study of the research productivity of academic institutions. This research requires the use of detailed data over decades of time from various sources and formats without any consistency or a predefined data structure. LexisNexis and Reed Elsevier are key partners in the acquisition of data for this research. The strengths of HPCC Systems include its ability to support easy ingestion of various data formats, dynamic transformation and integration of ingested data, and a seamless interface between the data processing interface and the analytic tools written in R. NOTE: The video of this presentation is the 2nd one shown in the accompanying YouTube link.

Transcript of HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

Page 1: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

Applications of HPCC Systems at Clemson University

Amy Apon, PhD ● Linh Ngo, PhD ● Michael Payne Big Data Systems Laboratory

Clemson University

Page 2: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

Applications of HPCC Systems at Clemson University Clemson Strengths and Opportunities

PhD-level faculty & research staff Talented students

Significant industry collaborators

Palmetto – Top 5 in US Academic Supercomputers

~2000 nodes, 20K cores, 600 GPUs 100Gb Internet connectivity

Facilities People

Page 3: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

Applications of HPCC Systems at Clemson University Big Data Systems Lab Overview

Perform World Class Research on the Systems and Enabling Information

Technology for Advanced Data Analytics

Big Data Systems Lab Research Areas

Systems and Architectures

Tools and Operations

Data Analytics and Applications

Big Data Systems Lab Vision

Page 4: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

Applications of HPCC Systems at Clemson University Effect of High Performance Computing on Academic Research Productivity

Motivation: There is a lot of pressure on federal funding We propose efficiency as a measure from which to gain insights on return on investment We show that locally-available HPC has a positive effect on the ability of a university to do research

Page 5: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

Motivation: Government and business need information about public sentiment. Research: We develop and apply methods to analyze large amounts of textual data to enable inquiry of social and business problems.

Applications of HPCC Systems at Clemson University Text mining of news reports and social media for business intelligence

Page 6: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

Shared Execution Environment

Temporary Local Storage

User Privileges Only

Applications of HPCC Systems at Clemson University Shared Computing Resources among Researchers

Page 7: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

Applications of HPCC Systems at Clemson University

Linh Ngo, PhD HPCC Systems in a Shared Research Computing Environment

Page 8: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

Shared Execution

Environment

Temporary Local Storage

User Privileges Only

Applications of HPCC Systems at Clemson University Shared Computing Resources among Researchers

How to provision and configure an HPCC cluster dynamically for research purposes?

• Step 1: Configure, install, and deploy HPCC as a non-root user

• Step 2: Dynamically provision HPCC cluster in a shared research environment

Page 9: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

binutils ICU

XALAN APR

… /

/usr

/lib64

/lib

/opt

/

/home

$USER …

/parallel_scratch

/local_scratch

Applications of HPCC Systems at Clemson University Installation and Configuration of Dependencies

Page 10: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

Administrative privileges Non-administrative privileges

/ etc

init.d HPCCSystems

opt HPCCSystems

var lib

HPCCSystems

log HPCCSystems

configmgr mydafilesrv mydafilesrv

/ home

$USER

hpcc local_scratch

hpcc

parallel_scratch

lib HPCCSystems

log

$USER

lock pid

$USER

Applications of HPCC Systems at Clemson University Resolving Non-default Installation Path Conflicts

Page 11: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

Remove/relax root-level settings: i.e.: is_root

Reduce default configuration settings for resource

requirements: depended on resource allocation requests

Applications of HPCC Systems at Clemson University Non-root Deployment

Page 12: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

mydafilesrv mydafile myeclc mythor myroxie

PBS_NODEFILE environment.xml

1

2 3

4

5

user.palmetto.clemson.edu

Applications of HPCC Systems at Clemson University Dynamic Provisioning

Deploy to /local_scratch or /parallel_scratch?

Page 13: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

Applications of HPCC Systems at Clemson University

Michael Payne

Using HPCC Systems to Manage Academic Data LexisNexis Summer 2014 Internship

Page 14: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

• Research in Scholarly Data requires academic data from many different sources, which store data under various formats

• Aggregating these sources into a useful and cohesive structure requires a data-intensive approach to preprocessing, integration and analysis

• HPCC Systems is a platform to streamline this process

Applications of HPCC Systems at Clemson University Using HPCC Systems to Manage Academic Data

Page 15: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

Higher Education

Institutions

Research

High Performance Computing Capability

Funding Support

Applications of HPCC Systems at Clemson University Categories of Scholarly Data

Page 16: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

Higher Education

Institutions

Research

High Performance Computing Capability

Funding Support

Detailed Award (XML) Federal Funding (tab-delimited) Expenditures (tab-delimited) Institution Patent (Excel) Degrees Conferred (Excel)

NIH Award Data (CSV/Excel) Top500 Supercomputer affiliated with academic institutions (XML)

Institutions from States with EPSCoR status (tab-delimited)

Institutional Information with Carnegie’s Research Classifications (multi-sheet Excel)

Enrollment (multi-sheet Excel) Financial (multi-sheet Excel) Faculty (multi-sheet Excel)

Detailed list of articles by discipline with abstracts, references, and disambiguated authors. (XML)

Applications of HPCC Systems at Clemson University Scholarly Data Description

Page 17: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

Institution name/email

Name similarity Address

PI/A

utho

r nam

e In

stitu

tion

nam

e

Name similarity

Match name with WoS ‘s Organization-Enhanced name

Ackn

owle

dgm

ent a

ttrib

utes

(200

8 on

war

d, a

utom

atic

)

Applications of HPCC Systems at Clemson University Examples of Scholarly Data Links

Page 18: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

• Porting data analytic processes to ECL

• Applying Machine Learning techniques for article abstract classification

Applications of HPCC Systems at Clemson University Ongoing Work

Page 19: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

LexisNexis Internship Machine Learning

Manager Timothy Humphrey

Mentor Arjuna Chala

Applications of HPCC Systems at Clemson University Summer 2014 Internship - Logistic Regression for Dense Matrices

Page 20: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

• Prediction using continuous and discrete values

• No distributional assumptions on the predictors • May not be

normally distributed or linearly related

• Relationship between the discrete variable and the predictor is non-linear

Applications of HPCC Systems at Clemson University Logistic Regression

Page 21: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

Matrices can be partitioned

Schemes must be compatible

There are multiple choices!

X = 4 x 4 4 x 1 4 x 1

2 x 3

X = 2 x 1 1 x 1 2 x 1

3 x 1

Applications of HPCC Systems at Clemson University Parallel Block Basic Linear Algebra Subprograms (PB-BLAS)

Page 22: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

• Logistic Runtimes

• Hard Coded Mapping

• Full Higgs Dataset 11,000,000 x 28

0

5

10

15

20

25

30

35

Higgs 1,000 Higgs 10,000 Higgs 100,000

Tim

e in

Min

utes

PB-BLAS Non PB-BLAS

Applications of HPCC Systems at Clemson University Machine Learning in ECL

Page 23: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

• Logistic Runtimes

• Auto Mapping

• Full Elsevier Dataset 100,000 x 3,291 0

5

10

15

20

25

Elsevier 100

Tim

e in

Min

utes

PB-BLAS Non PB-BLAS

Applications of HPCC Systems at Clemson University Machine Learning in ECL

Page 24: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

050

100150200250300350400450

Elsevier 1,000

Tim

e in

Min

utes

PB-BLAS Non PB-BLAS

Applications of HPCC Systems at Clemson University Machine Learning in ECL

• Logistic Runtimes

• Auto Mapping

• Full Elsevier Dataset 100,000 x 3,291

Page 25: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

• Logistic Regression code and supporting functions have been documented and merged to ECL-ML GitHub repository

• Auto block vector mapping function for any user that wants to use PB-BLAS

• Ready to use element wise multiplication in PB-BLAS • Updated debugging statements that a clear

understanding of errors • Test functions for both block vector mapping function • Sample code for using logistic regression • Currently working on K-means implementation that

utilizes PB-BLAS

Applications of HPCC Systems at Clemson University Project Summary

Page 26: HPCC Systems Engineering Summit - Applications of HPCC Systems at Clemson University

Linh Ngo, PhD ● Alex Herzog, PhD ● Michael Payne ● Amy Apon, PhD

{lngo, aherzog, mpayne3, aapon}@clemson.edu

Big Data Systems Laboratory Clemson University