Google describes in a recent paper „Goods: Organizing Google’s Datasets“ their approach for post-hoc metadata management. GOODS (GOOgle Dataset Search) is a system that crawls internal storage systems (e.g. GoogleFS, Bigtable, DBMSes) in order to collect, aggregate, and store metadata. The current catalog consists of 26 billion datasets and makes the gathered metadata available to Google employees. Tools are provided to work through the data marketplace e.g. searching datasets, navigating dataset profile pages, monitoring dataset dashboards. The catalog is a core part of the „data-culture“ to treat data as essential assets and share non-confidential data across the enterprise.

Enterprises adopted standards to manage their code and infrastructure through various tools like source code development GUIs, versioning systems or testing suites. But they still lack to manage their data with the same care and effort. GOODS shows possibilities how to automate metadata management and addresses challenges like

  • Data is scattered across many silos across the enterprise. Implementing one central DWH works very well in theory but not in practice for most organizations.
  • Projects rarely provide metadata – they provide mainly source code and infrastructure requirements. Additionally, metadata soon becomes outdated as it is rarely updated.
  • Available datasets are usually unknown to employees, especially experimental data.

GOODS continuously crawls internal storage systems for new or updated datasets. It versions the metadata if a dataset has changed. Versions are clustered in order to reduce maintenance overhead.

Metadata in the GOODS catalog contains information about

  • Size, format, last modified date (basic metadata)
  • Reading/writing jobs, downstream/upstream datasets (provenance metadata)
  • Number of records, schema, key fields (content based metadata)
  • And other categories

Some of the metadata are directly discovered while other metadata are deduced by analyzing logs, source code repositories, project databases, etc. An initial requirement was that projects don’t have to provide metadata or don’t have to change their code. Metadata gathering has to happen automatically. Projects can add additional information into the catalog manually. If a huge number of new datasets are deployed, the schema analyzer job may take days or even weeks to process the new data. Therefore a second job runs. The second job focuses on the most important datasets only. This job will deliver results earlier.

The paper finally mentions future work to further improve usefulness of the information for their employees, e.g

  • Rank datasets for importance
  • Better understand semantics of data