Reference to blogs, tweets, discussions, etc that caught my attention during the last week.

Data Modeling

Data Modeling and NoSQL? Data Modeling gets more and more important in the “schema-less” world because a suitable data model ensures data quality, performance, and other characteristics. The slides “Data Modeling deep dive” by MongoDB Inc. illustrate four MongoDB use cases.

Driving Keys and relationship history, one or more tables?” by Roelant Vos shows two alternatives how to implement a driving key relationship defined in a Data Vault Link table. A detailed example is used to show the approaches with one Satellite or two (more) satellites.

Data Architecture

Cache is the new RAM” by Carlos Bueno deals with several tech cycles in the past that were hyped to solve almost any problem, e.g. sharding, NoSQL, MapReduce, etc. And what about 2014 and 2016? In-Memory/RAM again? Very worthwhile and entertaining to read with some tongue-in-cheek messages.

Data Storage

Apache Hadoop 2.6.0 has been released as 4th major release for 2014 with nearly 900 Jira issues solved. The announcement from A. Murthy contains some of the changes with major topics like

  • heterogeneous storage tiers + archival storage
  • security features (e.g. transparent data at rest encryption (beta))
  • support for long-running services in YARN
  • rolling upgrades

See also Hortonworks blog, Cloudera blog or MapR blog for more details.

There is and will be a lot of change around program execution in the Hadoop stack, e.g. Tez, Spark, etc. But what about HDFS data storage? Changes like Parquet, ORCfile were rather small. Curt Monash argues that a more fundamental change in the data storage would make sense: “Hadoop’s next refactoring?” to address issues like caching and especially in-memory inter-program data exchange.

Instance caging was introduced in Oracle 11gR2 as a means to limit the available CPU resources. OS Processor Group integration goes a step further and allows to link an instance to a named subset of available CPUs. See Nikolay Manchev’s blog “Processor group integration in Oracle Database 12c” for a detailled description.

Data Tools

Storm and Spark are described as Real Time Streaming engines in the article Storm or Spark? Choose your Real-Time weapon by Andrew C. Oliver. There is no best approach but criteria like embedding the Real Time engine into an existing stack or as a stand-alone solution, multi-language support and others have to be considered.

Data Visualization

The infographic “The Internet is a zoo: The ideal length of everything online” by Mark Uzunian shows statistics about character counts, header length, etc to get most attraction focusing on quantitative criteria (e.g. slides: 6min – 61 slides, blogs: 1500 words, headers: 6 words).

Data Divers

Download a free reportWhen Hardware Meets Software: How the Internet of Things Transforms Design and Manufacturing” published by O’Reilly.

Oracle 12c Parallel execution white paper.