Financial Informatics_ Startup Low-cost Dataload Challeng...

41
Presented by, MySQL & O’Reilly Media, Inc. Financial Informatics: Startup, low-cost, dataload Challenges and Solutions

description

 

Transcript of Financial Informatics_ Startup Low-cost Dataload Challeng...

Page 1: Financial Informatics_ Startup Low-cost Dataload Challeng...

Presented by,

MySQL & O’Reilly Media, Inc.

Financial Informatics:

Startup, low-cost, dataload Challenges and Solutions

Page 2: Financial Informatics_ Startup Low-cost Dataload Challeng...

What are we talking about today?

Financial Data, more specifically stock market data as an example

The basic design of a MySQL database that contains a daily history of stock prices

Building a stock machine and some of the challenges posed

Some large data ‘gotchas’ and solves Some large mysql ‘gotchas’ and solves

Page 3: Financial Informatics_ Startup Low-cost Dataload Challeng...

Financial data

csv records of what happened that day or a signal often have unexplaind anomalies daily arrival of row data which doesn’t conform to spec

Page 4: Financial Informatics_ Startup Low-cost Dataload Challeng...

1. Big Picture

Who is your audience? Make your analytics and application work with a small dataset first

Market data rules: You can’t scrape Yahoo QA is not a bad word: Data Quality is key What’s a security? What’s a corporate action? OLAP: This is once a day processing Take performance of your dev boxes seriously:

Dell 2950 with 32GB of ram, 6 disks, RAID10.

Page 5: Financial Informatics_ Startup Low-cost Dataload Challeng...

Where does financial data come from?

Thompson / Reuters McGraw Hill / Interactive Data Securities and Exchange Commission Dow Jones Standard and Poors Bloomberg Lots of ‘boutique’ $100M companies

Page 6: Financial Informatics_ Startup Low-cost Dataload Challeng...

Market data rules Information about a security’s trade on an exchange is owned by

the exchange and distributed to those who have made a license agreement (Reuters, Interactive Data, et al.) Your license agreement with these 3rd parties will start at $20k-$50k a year

Scraping yahoo, msn money, Forbes or another site is infringement There are different license levels with financial data providers,

redistribution usually costs more than a quantitative black box After three days most data is less valuable / expensive, you may

get a bargain for dev phase Working with financial data providers is a slow process, it may take

you 8 weeks from your initial point of contact with a rep before securing a license agreement. Work with your business decision team to prepare for this

Even indexes like S&P 500 and industry data is under license.

Page 7: Financial Informatics_ Startup Low-cost Dataload Challeng...

What Data Do You Need?

Historical Price - Everyone needs this for charts, models, etc

Corporate Actions - Adjustments going forward for historical data

Real-time Price - You may want this for real-time charts (100’s of Megs a Day)

SEC Filings - You may want to decompose for quant models or present reports to users

3rd Party Quant Data - Black box trading solution, quant box

Page 8: Financial Informatics_ Startup Low-cost Dataload Challeng...

Don’t load everything day 1

AAPL, INTL, T, X, XOM, DVW, DELL, GE S&P 500 Russell 3000 FTSE APAC OTC / PINK / BB Mutual Funds Money Market Indexes

Page 9: Financial Informatics_ Startup Low-cost Dataload Challeng...

What’s a security? Stocks, bonds, mutual funds and more In this context traded on an exchange A note held for you by your broker Represents a debt to be paid by issuer -or- Represents a share of the issuer -or- Represents a bet on the issuer -or- Represents an index of multiple securities -or- Represents another abstraction of ownership or bet

Page 10: Financial Informatics_ Startup Low-cost Dataload Challeng...

SECURITY

Page 11: Financial Informatics_ Startup Low-cost Dataload Challeng...

What’s a corporate action? A change to an attribute of a security or a security’s

price Split; reverse split Dividend Name change Listing; delisting; Exchange change Notes change Regional change Currency change

Page 12: Financial Informatics_ Startup Low-cost Dataload Challeng...

CORPORATE_ACTIONS (ABRIDGED) 70 Cols!!!

Page 13: Financial Informatics_ Startup Low-cost Dataload Challeng...

QA is not a bad word QA of financial data is much different than qa of software row data can arrive empty, wrong, portions missing row data can fail to arrive stocks may be priced wrong corporate actions may be for the wrong stock Canadian stock can be listed in the us with Canadian dollar prices all kinds of other fun You must have Excel jockeys to identify and explain noise to:

Engineers Your data provider Your customers

Page 14: Financial Informatics_ Startup Low-cost Dataload Challeng...

2. Table Designs

Page 15: Financial Informatics_ Startup Low-cost Dataload Challeng...

2. Table Designs SECURITY - Attributes of a security RAW_PRICE - Attributes of a security’s trades from csv, unadjusted PRICE - Attributes of a security’s trades, adjusted for corporate actions CORPORATE_ACTIONS - Change records of a security or price attributes JOBS - Attributes of a job COUNTRY - A reference table for a security’s country EXCHANGE - A reference table for a security’s exchange REGION - A reference table for a security’s region SOURCE - A reference table of the data provider for a security

Page 16: Financial Informatics_ Startup Low-cost Dataload Challeng...

SECURITY

Page 17: Financial Informatics_ Startup Low-cost Dataload Challeng...

SECURITY

security_id is your abstraction of data industry identifiers SECURITY_ID, your identifier int NOT NULL AUTO_INCREMENT unsigned pk

SECURITY_NAME, exchanges name for company

SOURCE_ID, what data provides this char(1)

CUSIP, us and canada unique identifier, char(9)

TICKER, an identifier, a gotcha, varchar(14)

SYMBOL, an identifier, a gotcha, varchar(14)

EXCHANGE_ID, what exchange is it traded on

REGION_ID, what region does this trade in int

COUNTRY_ID, what country does this trade in int

Page 18: Financial Informatics_ Startup Low-cost Dataload Challeng...

SECURITY table

Uses internal identifier SECURITY_ID

If you’re experimenting with different providers, SOURCE_ID should be added to pk

Holds providers key for a security (ric, symbol, ticker, cusip)

500k rows max

Page 19: Financial Informatics_ Startup Low-cost Dataload Challeng...

RAW_PRICE

Page 20: Financial Informatics_ Startup Low-cost Dataload Challeng...

RAW_PRICE table (Load Everything)

The rows just as they’ve come from the provider with an artificial key

Price corrections with asof_date in the past may come in, check for these

Sometimes attributes don’t exist in source files, missing asof, open, etc, not null loses the whole row and it might take days to get another one resent

Page 21: Financial Informatics_ Startup Low-cost Dataload Challeng...

PRICE

Page 22: Financial Informatics_ Startup Low-cost Dataload Challeng...

PRICE table SECURITY_ID, your identifier int unsigned pk

ASOF_DATE, what data provides this char(1) pk

OPEN, the opening price decimal

LOW, the low price for the day decimal

CLOSE, the closing price for the day

HIGH, the high price for the day

VOLUME, how many shares sold that day

SPLIT ADJUSTMENT: (REUTERS, not COMSTOCK) multiplier decimal

Page 23: Financial Informatics_ Startup Low-cost Dataload Challeng...

PRICE table

Only one price per security per day

Validation happens from RAW_PRICE to PRICE

Instead of bouncing rows you may consider a suspect data flag which bubbles up to UI

Page 24: Financial Informatics_ Startup Low-cost Dataload Challeng...

CORPORATE_ACTIONS

Page 25: Financial Informatics_ Startup Low-cost Dataload Challeng...

CORPORATE_ACTIONS table

Comstock: Splits and Reverses are in this file Reuters: Splits and Reverses are in price file Denormalized - Boo! Much of this information is display information Changes to exchange or trading status are in

here (bankruptcy, emerging from bankruptcy, changing from NASDAQ to OTC.BB, etc)

Dividend information is in here too

Page 26: Financial Informatics_ Startup Low-cost Dataload Challeng...

COUNTRY, REGION, EXCHANGE

Page 27: Financial Informatics_ Startup Low-cost Dataload Challeng...

COUNTRY, REGION, EXCHNAGE tables COUNTRY, keeps track of what country a security trades in

USA

CANADA

REGION, keeps track of what region a security trades in NORTH AMERICA

APAC

EXCHANGE, keeps track of what Exchange a security is traded on VANCOUVER

NASDAQ

NASDAQ OTC.BB

Page 28: Financial Informatics_ Startup Low-cost Dataload Challeng...

SOURCE

Page 29: Financial Informatics_ Startup Low-cost Dataload Challeng...

SOURCE table

Keeps track of who provides what data in the security table

Good to side-by-side comparisons where data comes from two different providers

Helps build organizational knowledge over what providers have good data-quality

Page 30: Financial Informatics_ Startup Low-cost Dataload Challeng...

Data Gotchas Do: load everything, don’t build constraints based on provider

specs prior to understanding the data Do: use 5.0.31 or above with innodb Do: wrap batches in BEGIN / END Do: set innodb_rollback_on_timeout = ON Do: stage feeds in raw tables b/c if you adjust for splits in the live

history table and make mistakes you’re be loading millions of rows again

Don’t run things like:

exec(“mysql -u user -e “source /feed/load_statements.sql”);

Don’t: foreign keys until process is hardened or never

Page 31: Financial Informatics_ Startup Low-cost Dataload Challeng...

3. Gears

load_raw_prices(); daily_price_clean(); load_security(); load_price(); split(); Special sauce for you to write undo_split(); Ditto

Page 32: Financial Informatics_ Startup Low-cost Dataload Challeng...

An approach to data loads

Daily load phase 1 Get data from provider in csv or xml Don’t translate Import into raw tables Run variance checks to throw alerts (~50k securities)

is ( yesterday n rows / today n rows ) between 99.99 and 100.01%?

Daily load phase 2 Load data into live tables Make adjustments for corporate actions Run your models Run variance checks to throw alerts

Page 33: Financial Informatics_ Startup Low-cost Dataload Challeng...

load_raw_prices()function load_prices( $price_file ) {

$lines = file($price_file);

$counter = 0;

foreach ($lines as $line_num => $line ) {

$counter = $counter+1;

$row = explode(",",$line);

$cusip = $row[0];

$ric = $row[1];

$asof_date = $row[2];

$open = $row[3];

$high = $row[4];

$low = $row[5];

$close = $row[6];

$volume = str_replace( "\n", "", $row[7] );

$split_adjustment = str_replace( "\n", "", $row[8] );

$today = date('Y-m-d');

if($split_adjustment=='') {

$split_adjustment = '0.00000';

}

Page 34: Financial Informatics_ Startup Low-cost Dataload Challeng...

load_raw_prices() (cont’d)$query = "INSERT INTO RAW_PRICE ( CUSIP, RIC, ASOF_DATE, OPEN, HIGH, LOW, CLOSE, VOLUME,

SPLIT_FACTOR, LOAD_DATE ) VALUES ( "

. "'" . $cusip . "',"

. "'" .$ric . "',"

. "'" .$asof_date . "',"

. $open . ","

. $high . ","

. $low . ","

. $close . ","

. $volume . ","

. $split_adjustment . ","

. "'" . $today . "')" ;

# put the rows in the raw_prices table

sm_query( $query );

if (($counter%100)==0) {

echo $counter . " lines processed.\n";

}

}

echo $counter . " total lines processed.\n";

}

Page 35: Financial Informatics_ Startup Low-cost Dataload Challeng...

daily_price_clean()function daily_price_clean( $source_file, $new_file ) {

$lines = file($source_file);

foreach ($lines as $line_num => $line ) {

# strip "-9,999,401"

$line = str_replace("\"-9,999,401\"","NULL",$line);

# strip volume quotes and commas

$pieces = explode("\"",$line);

$pieces[1] = str_replace(",","",$pieces[1]);

$fixed_line = implode("",$pieces);

# do some more funky stuff to get the date re-arranged

$date_repair = explode(",",$fixed_line);

$date_digits = explode("/",$date_repair[2]);

$date_repair[2] = "20" . $date_digits[2] . "-" . $date_digits[0] . "-" . $date_digits[1];

$fixed_line2 = implode(",",$date_repair);

# write out new file

if ( !file_exists($new_file)) {

touch ($new_file);

}

$handle = fopen ($new_file, 'a');

fwrite($handle, $fixed_line2);

fclose($handle);

}

}

Page 36: Financial Informatics_ Startup Low-cost Dataload Challeng...

load_secuirty()function load_security( $security_file ) {

$lines = file($security_file);

$counter = 0;

foreach ($lines as $line_num => $line ) {

$counter = $counter+1;

$row = explode(",",$line);

$cusip = $row[0];

$ric = $row[1];

$ticker = $row[2];

$today = date('Y-m-d');

$query = "INSERT INTO SECURITY ( CUSIP, RIC, TICKER, CREATED_DATE ) VALUES ( "

. "'" . $cusip . "',"

. "'" . $ric . "',"

. "'" . $ticker . "',"

. "'" . $today . "')" ;

Page 37: Financial Informatics_ Startup Low-cost Dataload Challeng...

load_secuirty() (cont’d)# put the rows in the raw_prices table

sm_query( $query );

if (($counter%100)==0) {

echo $counter . " lines processed.\n";

}

}

echo $counter . " total lines processed.\n";

}

Page 38: Financial Informatics_ Startup Low-cost Dataload Challeng...

load_prices()function load_prices( $date ) {

$query = "INSERT INTO PRICE

SELECT

S.SECURITY_ID, RP.ASOF_DATE, RP.OPEN, RP.HIGH,

RP.LOW, RP.CLOSE, RP.VOLUME, RP.SPLIT_FACTOR,

date(now())

FROM

RAW_PRICE RP,

SECURITY S

WHERE

S.RIC = RP.RIC

AND

RP.ASOF_DATE = '" . $date . "'";

echo $query ;

sm_query( $query );

}

Page 39: Financial Informatics_ Startup Low-cost Dataload Challeng...

Dependency Task Scheduling Php and shell scripts are useful tools to download and process price data

But cron doesn’t do a very good job of keeping track in a database of when something starts, finishes, fails, fails to start

If email is broken or cron isn’t reporting correctly you may not know of problems until it’s too late

Often a layer of metadata fails b/c of failed or weird market data, a missing price can make a graph or signal look weird to customers

You can’t load prices if the ftp or feed fails

You can’t process corporate actions until you know the price

You can’t get accurate calculations against time-series if there’s holes in the series

You can’t send signals or present accurate graphs if anything related to a security fails

Keeping track of failed jobs gives you a flag that can also tell your users what they’re seeing is questionable and will be corrected

You can report on a jobs list and throw alerts on failed jobs

Page 40: Financial Informatics_ Startup Low-cost Dataload Challeng...

Tracking variances in data quality

Price weirdness:yesterday’s price / today’s price

Row weirdness:num rows yesterday / num rows today

Range weirdness:yesterday’s average of a sum / today’s average of a

Page 41: Financial Informatics_ Startup Low-cost Dataload Challeng...

Questions?

Acknowledgements Starmine: Tripp, Flanzer, Foster, Breffle, Miller Cake Financial: Reed