Reference to blogs, tweets, discussions, etc that caught my attention during the last week.

Data Modeling

Part 2 of the article about loading Data Vault 2.0 using SAS DI Studio. The referenced LinkedIn discussion also contains some feedback from D. Linstedt. A very comprehensive framework for loading Data Vault with SSIS can be found on Roelant Vos’ blog.

Data Architecture

Keynotes and speaker slides from Strata & Hadoop World conference in Barcelona are online. There is an excellent and comprehensive tutorial on architecural considerations for Hadoop by Grover / Shapira / Malaska / Seidman.

Martin Kleppmann examines in his blog article “Hermitage: Testing the ‘I’ in ACID” the isolation levels in various database systems. Weak isolation is not just known since the rise of NoSQL DBs with BASE & CAP theorem. Weak isolation has been introduced long ago in RDBMS for better performance – but every vendor has their own definitions. He refers to a model of isolation levels which he used to implement in a test suite “Hermine” to support his research.

eBay and PayPal have large and challenging DB environments. Which database fits best their various needs? One-size-fits-all approach is not sufficient. Therefore eBay and PayPal aim to use the most suitable DB for a use case. Iggy Fernandez reports about presentations from the NoCOUG fall conference. His report includes screenshots about MongoDB, Cassandra, HBase and Couchbase use cases.

Reference card for Apache Spark by DZone: Apache Spark Cheat Sheet

Data Storage

Nikolay Savvinov’s Oracle blog article “Excessive Commits?” addresses the issue when reducing the number of commits in a process can lead to a weaker performance. Log buffer space increased after reducing the number of commits because redo generation was the actual bottleneck.

Does “alter session set container” reset session statistics in an Oracle multitenant DB? Franck Pachot examined the question in his blog article “When Oracle resets session statistics” and discovered inconsistent behaviour after setting the container: session statistics are reset but events and time model are not.

Data Tools

BigBench is a benchmark tool for running different workloads on a Hadoop cluster. Hive is the first implemented module containing 30 mixed queries/workloads against structured, semi-structured and unstructured data. Cloudera blog has an overview with references and a test run.