Blog post, link to web session and source code on how to use BIML to generate Data Vault. BIML (Business Intelligence Markup Language) is a XML dialect for defining BI assets liek tables, ETL flows, etc. “. See Auto Generate Data Vault using Biml – Part 1 – Webinar Content” by Peter Avenent.
A LinkedIn discussion about “Data Vault, Data Virtualisation and agile DW” addresses the current approach to make Data Vault also more agile for Data Marts. Views in the Data Mart layer are often sufficient with faster hardware and/or in-memory column-oriented DBs.
“Archiving everything with Hadoop” by Mark Cusack on Roberto Zicari’s blog mentions three key features that Hadoop has to provide in order to be suitable as long-term storage: schema preservation, security/governance, and SQL access.
Mark Rittmann started a threepart blog series on Hadoop ETL using MapReduce, YARN, Tez, and Spark with examples and an overview how the tools work:
- Going Beyond MapReduce for Hadoop ETL Pt.1 : Why MapReduce Is Only for Batch Processing
- Going Beyond MapReduce for Hadoop ETL Pt.2 : Introducing Apache YARN and Apache Tez
- Going Beyond MapReduce for Hadoop ETL Pt.3 : Introducing Apache Spark
Gwen Shapira lists several links for more information about Kafka: “Getting started with Kafka – Resources“.
ETL tools are widely used in the classical DWH because of their supposed productivity and maintenance advantage compared to manual coding. But what is the role of ETL tools if code for data loading is generated automatically? Roelant Vos’ view on his blog article “Do we still want to automate against ETL tools?“.
Reference to the “Impala Cookbook” compiled by Cloudera’s Impala team covering schema and physical design, memory usage, query tuning basics, etc.
“Star statistician Hans Rosling takes on Ebola” by ScienceMagazine / Kai Kupferschmidt. Rosling is well-known from his inspiring talks while showing great visualisations.