AWS Data Transfer Services: Accelerating Large-Scale Data Ingest Into the AWS Cloud
Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector
-
Upload
guglielmo-iozzia -
Category
Data & Analytics
-
view
741 -
download
2
Transcript of Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector
Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data CollectorGuglielmo Iozzia, Big Data Infrastructure Engineer @ IBM Ireland
Data Ingestion for Analytics: a real scenario
In the business area (cloud applications) to which my team belongs there were so many questions to be answered. They were related to:
● Defect analysis● Outage analysis● Cyber-Security
Data Ingestion: multiple sources...
● Legacy systems● DB2● Lotus Domino● MongoDB● Application logs● System logs● New Relic● Jenkins pipelines● Testing tools output● RESTful Services
Issues
● The need to collect data from multiple sources introduces redundancy, which costs additional disk space and increases query times.
● A small team.● Lack of skills and experience across the team (and the business area in
general) in managing Big Data tools.● Low budget.
A single tool needed...
● Design complex data flows with minimal coding and the maximum flexibility.● Provide real-time data flow statistics, metrics for each flow stage.● Automated error handling and alerting.● Easy to use by everyone.● Zero-downtime when upgrading the infrastructure due to logical isolation of
each flow stage.● Open Source
Streamsets Data Collector: available processors
● Base64 Field Decoder● Base64 Field Encoder● Expression Evaluator● Field Converter● JavaScript Evaluator● JSON Parser● Jython Evaluator● Log Parser● Stream Selector● XML Parser
...and many others
Streamsets DC: performance and reliability
● Two available execution modes: standalone or cluster● Implemented in Java: so any performance best practice/recommendation for
Java applications applies here● REST services for performance monitoring available● Rules and alerts (metric and data both)
Streamsets Data Collector: security
● You can authenticate user accounts based on LDAP● Authorization: the Data Collector provides several roles (admin, manager,
creator, guest)● You can use Kerberos authentication to connect to origin and destination
systems● Follow the usual security best practices in terms of iptables, networking, etc.
for Java web applications running on Linux machines.
Useful Links
Streamsets Data Collector:
https://streamsets.com/product/
Thanks!
My contacts:
Linkedin: https://ie.linkedin.com/in/giozzia
Blog: http://googlielmo.blogspot.ie/
Twitter: https://twitter.com/guglielmoiozzia