Reference to blogs, tweets, discussions, etc that caught my attention during the last week.

Data Modeling

Hashing in SQL Server and Oracle for the same output” by Roelant Vos contains code examples how to get the same MD5 hash values in Oracle and SQL Server in order to be used for Data Vault models (Hash keys and/or Hash difference calculation).

Data Architecture

Justin Kestelyn blogged about “Cloudera Enterprise 5.3 is Released“. Some of the additions in Cloudera Enterprise 5.3:

  • Security enhancements like folder-level HDFS encryption and sharing of data between Hive, Impala, and others using Apache Sentry
  • Apache Spark 2.1
  • Impala 2.1 and Hue 3.7
  • Apache Flume includes an Apache Kafka channel

Data Storage

Nick Heudecker expects in “Hadoop’s Achilles Heel in 2015” an impact on Hadoop enterprise adoption if Hadoop operations and development do not become much simpler. Hadoop skills are still rare. The Hadoop ecosystem is complex and continuously gets more complex by adding new stuff.

Gregory Steulet did a comprehensive performance comparison of different mySQL versions including MariaDB and Percona. See “MySQL versions performance comparison“.

See MapR blog article “What Kind of Hive Table is Best for Your Data?” by Jim Bates for workload-specific options to improve Hive performance. Different storage formats (text file, RCfile, ORCfile) and compression types were considered in the comparison. Scripts used for the comparison are available.

Data Flow

Getting started with Apache NiFi” by Joey Echeverria introduces Apache NiFi as  a data flow system. NiFi contains a graphical GUI to design, control and monitor data flows in Hadoop for easier and faster development. The National Security Agency (NSA) transferred NiFi (“Niagarafiles”) to Apache (incubator status). NSA already launched an Apache project in 2011: Accumulo which is a NoSQL database based on Google’s BigTable.

Data Tools

Apache Ranger (formerly called Apache Argus or HDP Advanced Security) is part of Hortonwork’s security approach with central security policy administration. The blog post “Apache Ranger Audit Framework” by Madhan Neethiraj describes some audit options in Hortonworks HDP 2.2.

Data Visualization

How David McCandless makes beautiful visualizations that go viral over and over” by Joseph Stromberg shows excellent visualization examples from David McCandless and an overview of his design process.

The pdf article “An Economist’s Guide to Visualizing Data” by Jonathan A. Schwabish in Journal of Economic Perspectives [28(1): 209-234] contains a consolidated overview of visualization:

  • various diagrams types (line chart, clutterplot, pie chart, etc)
  • good and bad visualization design
  • visualization tools / resources

Das Jahr 2014 in der Neuen Züricher Zeitung” shows NZZ newspaper article headings as interactive bubbles by category and by time.

Data Divers

The site CVE details provides an interface to CVE database vulnerabilities. Some statistics including 2014: