Our challenge for Bulkload reliability improvement

23
Our challenge for Bulkload reliability improvement Satoshi Akama July. 14, 2016 Treasure Data Tech Talk 201607 #tdtech ×

Transcript of Our challenge for Bulkload reliability improvement

Page 1: Our challenge for Bulkload reliability  improvement

Our challenge for Bulkload reliability improvement

Satoshi Akama

July. 14, 2016Treasure Data Tech Talk 201607 #tdtech

×

Page 2: Our challenge for Bulkload reliability  improvement

Satoshi Akama

Embulk plugins embulk-input-gcs embulk-input-azure_blob_storage embulk-output-azure_blob_storage

embulk-output-dynamodb embulk-output-sftp

Software Engineer (Java/Scala/Ruby)

github.com/sakama/

@oreradio

Treasure Data, Inc.

Page 3: Our challenge for Bulkload reliability  improvement

TopicsEmbulk plugin development

Retry! Retry!! Retry!!! Exception handling Battle with external service’s specs Write unit test Java or JRuby ?

Use embulk at Treasure Data Integration test Implement new API endpoint Infrastructure management

Page 4: Our challenge for Bulkload reliability  improvement

We’re using Embulk as bulkload toolPluggable bulkload tool Released as OSS We’re using same version of OSS

Page 5: Our challenge for Bulkload reliability  improvement

GUI interface is availableCurrently Output side only:)

Page 6: Our challenge for Bulkload reliability  improvement

Data Connector(Import) - CUIguess/preview/import

$ td connector:guess seed.yml -o load.yml

$ td connector:preview load.yml

$ td connector:issue load.yml —database td_sample_db \ —table td_sample_table

Scheduled execution

$ td connector:create \ daily_import \ “10 5 * * * “ \ td_sample_db \ td_sample_table \ load.yml \ —time-column created_at

GUI will come in the near future

Page 7: Our challenge for Bulkload reliability  improvement

Document and magazines Official website Qiita(JP only) Twitter

http://www.embulk.org/ http://qiita.com/search?q=embulk #embulk

Page 8: Our challenge for Bulkload reliability  improvement

Plugin development

Page 9: Our challenge for Bulkload reliability  improvement

Retry! Retry! Retry!!!

… Storage.Objects.Get getObject = client.objects().get(bucket, key); InputStream stream = getObject.executeMediaAsInputStream();

Embulk(embulk-core) provides RetryExecutor

Almost Official SDK contains retry logic, but not enough.

try { return retryExecutor() .withRetryLimit(3) .withInitialRetryWait(500) .withMaxRetryWait(30 * 1000) .runInterruptible(new Retryable<InputStream>() { @Override public InputStream call() throws InterruptedIOException, IOException { Storage.Objects.Get getObject = client.objects().get(bucket, key); return getObject.executeMediaAsInputStream(); } } } catch (RetryGiveupException ex) { … } catch (InterruptedException ex) {}

Fail

Retry with using Exponential Backoff

Page 10: Our challenge for Bulkload reliability  improvement

Java or JRuby ? Embulk support both of Java and JRuby based plugin

Java based plugin

JRuby based plugin

High performance Filter / Parser / Formatter / Encoder / Decoder plugin

These plugin need high performance Some enterprise service/software support provides Java SDK.

write with Java7(MapReduce Executor needs Java7)

Easy to write Network is bottleneck ( like cloud service).

Page 11: Our challenge for Bulkload reliability  improvement

Exception handling to avoid infinite retryConfigException

DataException

transaction method should validate all config values should throw ConfigException or its subclass when validation fails

public ConfigDiff transaction(ConfigSource config, FileInputPlugin.Control control) { … if (task.getFiles().isEmpty()) { throw new ConfigException(“File is empty”); } }

… } catch (CsvTokenizer.InvalidFormatException | CsvTokenizer.InvalidValueException … e) { if (stopOnInvalidRecord) { throw new DataException(“Invalid record”); // throw Exception if stopOnInvalidRecord : true } log.warn(“Invalid record”); // show warnings if stopOnInvalidRecord : false }

should throw DataException or its subclass when it finds an invalid record

Page 12: Our challenge for Bulkload reliability  improvement

Battle with external service’s specs

Azure Blob Storage

Google Cloud Storage

AWS S3

String path = "/path/to/file";

String str = String.format("%06d", path.length()) + "!" + path + "!" + "000028" + "!" + "9999-12-31T23:59:59.9999999Z" + "!"; String encodedString = BaseEncoding.base64().encode(str); String nextToken = "2" + "!" + encodedString.length + "!" + encodedString;

String path = "/path/to/file"; // use path string as next token

String path = "/path/to/file"; byte[] encoding; byte[] utf8 = path.getBytes(Charsets.UTF_8);

encoding = new byte[utf8.length + 2]; encoding[0] = 0x0a; encoding[1] = new Byte(String.valueOf(path.length())); System.arraycopy(utf8, 0, encoding, 2, utf8.length);

String nextToken = BaseEncoding.base64().encode(encoding);

Example to get next token for object storage. next token : next start point while getting file list stored at bucket or container.

Page 13: Our challenge for Bulkload reliability  improvement

Write unit testWe need 80% coverage to use at our platform. But difficult to write test for embulk plugin😞

SFTP : Create Java based virtual SFTP server at local machine.

DynamoDB : AWS provides downloadable version of DynamoDB.

Filter/Parser/Formatter/Encoder/Decoder plugin

80% coverage is difficult without connect to serviceSet confidential at environmental variables.

Use “Encryption keys” and “Encryption files” at Travis CI. Connect to remote service for each running test

Unit test without remote connection I’ve ever seen

Page 14: Our challenge for Bulkload reliability  improvement

Use embulk at Treasure Data

Page 15: Our challenge for Bulkload reliability  improvement

Architecture of Treasure Data

Load Balancer

TD API(API Servers)Web Console

td commands

Response

Response

Request

Request

Bulkload API (API Servers)

Perfect Queue

TD worker (worker process)

enqueue

dequeue

Submit Job (Retry if need)

Execute with MR / Local Executor

guess/preview MySQL

Page 16: Our challenge for Bulkload reliability  improvement

TD API / Bulkload API

TD API(API Servers)

Bulkload API(API Servers)

guess/preview is processed at different API Servers.

ResponseRequest

guess/preview

data importPerfect Queue

Load Balancer

QueuingHttp Request/Responseguess/preview needs quick response

enqueue

Page 17: Our challenge for Bulkload reliability  improvement

Comes huge data

Embulk Config with thousands of columns

Huge data

Need enough validation at transaction method Return clear error or warning messages at plugin

Retry logic of plugin is important Retry if retryable exception happens use MapReduce Executor

Reduce usage dirrerence at each instance.

Page 18: Our challenge for Bulkload reliability  improvement

Write integration testWrite integration for each connector(result output) with RSpec

td connector:guess(embulk guess) works? td connector:preview(embulk preview) works? td connector:issue(embulk run) works expectedly?

works with LocalExecutor? works with MapReduce Executor?

works with filter plugin? scheduled execution works expectedly?

for each servicemany test cases ×

Page 19: Our challenge for Bulkload reliability  improvement

Want to improve…

Target service is timeout 😞

Target service returns 50x error 😞

API limit exceeded 😞

CI failure

Long execution time

for each servicemany test cases ×

Page 20: Our challenge for Bulkload reliability  improvement

Want to implement…API endpoint is not enough

guess

preview

issue(run)

GUI console

CUI

Unclear until user run jobs( or guess or preview) and plugin return result or ConfigException.

Username and Password is valid?

Page 21: Our challenge for Bulkload reliability  improvement

Want to implement…

input

host port username password

valid?

new endpoint

GUI console

Validate before execute jobs

improve user experience reduce jobs at our platform

Page 22: Our challenge for Bulkload reliability  improvement

Infrastructure ManagementChef

Monitoring

Datadog

Server configuration

More reliability with MapReduce Executor

incident resolution

PagerDuty

Chef requires not a few time to build server.

Page 23: Our challenge for Bulkload reliability  improvement

Thank you!