Enterprises adopted standards to manage their code and infrastructure through various tools like source code development GUIs, versioning systems or testing suites. But they still lack to manage their data with the same care and effort. GOODS shows possibilities how to automate metadata management and addresses challenges like
- Data is scattered across many silos across the enterprise. Implementing one central DWH works very well in theory but not in practice for most organizations.
- Projects rarely provide metadata – they provide mainly source code and infrastructure requirements. Additionally, metadata soon becomes outdated as it is rarely updated.
- Available datasets are usually unknown to employees, especially experimental data.
GOODS continuously crawls internal storage systems for new or updated datasets. It versions the metadata if a dataset has changed. Versions are clustered in order to reduce maintenance overhead.
Metadata in the GOODS catalog contains information about
- Size, format, last modified date (basic metadata)
- Reading/writing jobs, downstream/upstream datasets (provenance metadata)
- Number of records, schema, key fields (content based metadata)
- And other categories
Some of the metadata are directly discovered while other metadata are deduced by analyzing logs, source code repositories, project databases, etc. An initial requirement was that projects don’t have to provide metadata or don’t have to change their code. Metadata gathering has to happen automatically. Projects can add additional information into the catalog manually. If a huge number of new datasets are deployed, the schema analyzer job may take days or even weeks to process the new data. Therefore a second job runs. The second job focuses on the most important datasets only. This job will deliver results earlier.
The paper finally mentions future work to further improve usefulness of the information for their employees, e.g
- Rank datasets for importance
- Better understand semantics of data