Data Catalogs: You Probably Need One

Data Catalogs: You Probably Need One

This past November, I had the opportunity to attend the PASS Summit in Seattle, WA. This was the largest conference I have attended, chuck full of more useful sessions than I could undertake, and I needed to pare those down to a manageable list. While perusing through the list of 219 sessions, I found a session dealing with data catalogs. Like many of you out there, I had heard of data catalogs and had an idea of what they were, but something about the session description made me decide to check it out.

The Tunnel…

I learned that many businesses invest enormous sums of money on analytics hardware, software and services, and talent to achieve a data-driven culture. The hope is that such an investment will have a transformative effect on the success of the business. However, many projects fall short of achieving that reality or are abandoned, due to the three items affecting every project: scope, expense, and time.

A significant driver of failure is the difficulty involved in trying to find, interconnect, and analyze data in an enterprise data environment to extract its value. Analysts will spend considerable time looking for data in multiple locations, often interviewing one or more subject matter experts as to where to find the data. They will then need to gain confidence in the data that has been discovered by testing different scenarios. Documentation is virtually non-existent, and any that gets generated as part of the effort is usually filed away in the analyst’s “secret stash.”

The Light!

A business that experiences the most success in reaching a transformative state utilizes some sort of data catalog. Essentially, a data catalog is a system used to inventory data assets, with metadata, to facilitate the process of extracting business value by answering three main questions regarding data:

  1. Where can I find the data I’m looking for?
  2. What is the significance of the data?
  3. How trustworthy/reliable is it?

What Makes It Good?

I learned that a good data catalog would allow for search engine-like discovery to find data. Candidate data can also be further analyzed by viewing any business rules or definitions associated with the data, as well as any “tribal” knowledge that may exist. Perhaps even a preview or profile of the data will be available. There is also an indication of the reliability of the data, based on refresh frequency and rating or endorsement system, employing a social aspect to the data catalog. Lineage can also be tracked to see where data originated, or any transformations or calculations that have happened along the way to the end product. Of course, the technical metadata about the data would also be included in the catalog. Let’s not forget about any security and compliance information (GDPR, HIPPA, PCI, etc.) which can be represented by tagging data.

For a data catalog to be and remain successful, it must contain more than just the technical data and must be curated. Maintenance and curation of the data catalog may become an engineer’s full-time job. It also must be created with no assumptions that users will possess some “secret” knowledge, and must take current workflows and processes into consideration, so as not to be an inconvenience to incorporate. The data catalog should be populated with a business-first approach, keeping the needs of the data consumers in the foreground of thought. Getting off the ground and keeping momentum will require demonstrations, evangelism, and crowd-sourcing, engaging all stakeholders: from developers to analysts, PMs, data scientists, and governance teams. Start small and build from there.

Where Can I Get One?

There are numerous commercial options for data catalog products available from different vendors, with a varying array of features. If you want to experience a data catalog, you can try out Microsoft’s Azure Data Catalog on the Azure portal. However, there is some set up that you have to go through. The Apache Software Foundation also has the Apache Atlas framework in the opensource space.

While the one session will not make me an expert by any means, it certainly piqued my curiosity and encouraged me to dig deeper into this subject. I would consider anyone out there struggling with data projects, whether it be a new implementation, or the day-to-day of delivering value for new requests, to do the same.

What's your Opinion?