The book “Data Architecture: A Primer for the Data Scientist” by W.H. Inmon / D. Linstedt contains the subtitle “Big Data, Data Warehouse and Data Vault” which summarizes pretty well the main focus of the book.

"Big Data, Data Warehouse and Data Vault"

The first chapter introduces and defines structured and unstructured data. Unstructured data is further divided as repetitive and nonrepetitive. Repetitive unstructured data (e.g. metering data, clickstream data, etc) is mainly processed by Hadoop/NoSQL-centric tools while nonrepetitive unstructured data (e.g. emails, documents) must be processed by textual disambiguation. Repetitive unstructured data wrongly gets most attention nowadays because it is rather easy to manage compared to repetitive unstructured data. But the latter is the most important for data scientists to work on.

The next chapters introduce Big Data, Data Warehouse and Data Vault. All pieces are put together in the remaining chapters following “6.1 A brief history of Data Architecture”. Distillation and filtering are the types of processing repetitive unstructured data while various techniques for contextualization of nonrepetitive unstructured data is necessary, e.g. acronym resolution, tagging, stop word processing, word stemming, etc.

The last chapter subsumes the “composite data architecture” by means of timeliness of data:

  • Integration of data in the Data Warehouse / Data Vault
  • Analytics and Archival of Big Data
  • Metadata within and across environments
Data is detailed and granular as a system of record

The book provides an architectural, high-level overview of Big Data, DWH and Data Vault. The book ends with a concise data architecture blueprint: the “composite data architecture” combines the authors’ work from the past (e.g. “DWH 2.0” by W.H. Inmon and “Data Vault” by D. Linstedt) in one architecture. It is one view how to combine different types of data together and provides a lot of ideas to follow.

“If you are building a one-story, one-room log cabin in the forest, you don’t need much of a blueprint. But if you are building a large, complex expensive multistory building in the middle of a city, you need blueprints. There is much to be considered when it comes to building a multistoried structure in the middle of a modern city. And there is the same complexity and expense when it comes to a modern information infrastructure for technology and data.” (quotation extracted from the book, page 329)

Some negative remarks:

  • I found it strange and somewhat inconvenient that there are sub charter titels (e.g. for 1.1., 1.2, 2.1, 2.2. etc) but there is never a chapter title (e.g. for 1., 2., etc).
  • Unstructured text is discussed in detail – which is good. Image data, video data and audio data are neglected.
  • There are many illustrations to brigthen up the text but some of the illustrations are IMO too mundane.
  • Some repetitions of content.