Thoughts on Medallion Architecture

When I first got into Data Warehousing over twenty-five years ago, nobody spoke of “Bronze”, “Silver” or “Gold” quality data, at least not in those particular terms. It’s not that the concept was any less true historically as there has always been a varying degree of enrichment going on with raw data outside of the jurisdiction of the enterprise data warehouse. Nonetheless, that outside enrichment was mostly irrelevant to the data warehouse architect of the past because it was happening in some other system outside the EDW.

As an Enterprise Data Warehouse architect, my responsibility was relegated to the Enterprise Data Warehouse, i.e. the “Gold” data quality tier alone. As a result, the mentality was pretty much that if it was happening in somebody else’s world (and it didn’t negatively affect my enterprise data warehouse), then I really didn’t consider it part of my jurisdiction. Part of being successful is knowing what NOT to concern yourself with so that you might more faithfully focus on the things you SHOULD be doing. My world was the data warehouse and any other database-in other words “Bronze” or “Silver” quality data-was just there for me to ingest into my data warehouse so I could turn it into “Gold” quality data. Remember “one version of the truth”? That’s the data warehouse. That’s the Gold Tier.

But then things changed. Businesses began to really understand how valuable their data was. They gobbled up what the data warehouse provided, but then they needed more, faster. Due to the data warehouse’s high-governance, strict standards and focus on data accuracy and integration, it takes more time to develop and release new subject areas to the business. The business got understandably impatient. Then tools like Alteryx, Tableau and Power BI appeared, allowing the average business analyst to go around the data warehouse and get their own data for reporting and analytics. This end-around violated a central tenant of data warehousing in that all business intelligence reporting and analytics had to be sourced from the data warehouse. Now, analysts could extract from multiple systems outside the EDW and stitch that data together themselves (in theory), effectively replacing the need for IT-controlled ETL.

Now that ETL process was owned by the business, as was governance. Or the lack thereof, as we’ll see.

The business loved these tools because now, for the first time, they didn’t have to wait three months for IT to finish building out a subject in the data warehouse before they could start getting some data to answer burning business questions. Now, they could just get a data dump from a vendor or reach out directly to the source system itself and get the raw data they needed. But that bypassed the enterprise data governance chain of command. In other words, if IT is cleansing, conforming, standardizing, and integrating the raw data before it hits enterprise reporting then going around that process strips the data of the valuable enrichment done by IT.

This was inevitable. Data warehousing, for all the good it provides, is admittedly rigid and slow to market. The business needed a faster solution, one that didn’t force them to wait weeks or months for their data, and they found it. Businesses started to trend toward lower governance, faster-to-market reporting solutions and away from higher governance, slow-to-market systems (like the EDW). As you can imagine, over the subsequent years this shift away from centralized governance created a lot of fragmentation. The flexibility that this so-called “citizen developer” approach provides the business can be great, but it means that the tech-savviness of the data analyst needs to be a lot higher across the board. It also comes at the cost of losing the investment in corporate standards. There are complicated formulas that are difficult to replicate and it’s nearly impossible to keep business rules completely in sync across disparate systems. This drift can result in poor data quality, confusion and an erosion of trust in the company’s reporting solutions.

Short terms wins that come with a lot of technical debt build up over time. The weight of this fragmentation of enterprise data governance began to put pressure on organizations. Suddenly, “one version of the truth” became difficult to achieve and the costs were becoming evident. Personally, I’ve noticed a renewed interest in Enterprise Data Governance in recent years to “right the ship”, so to speak, hence the revival of the Data Warehouse Architect role. But things have changed for us. Gone are the days of managing only Gold Tier data; our role as Enterprise Data Warehouse Architects has now expanded to include managing other data sets outside the EDW. We’re now simply “Enterprise Data Architects”, sans the “warehouse” part, but with those same data warehousing skills implied. Now the Bronze and Silver quality data aren’t just someone else’s problem, they’re our problem. This presents a need for data quality classification that didn’t exist before.

Different architects will have different approaches to defining data quality. These are valid and I’m not here to dissect or debate them. Rather, I put forth my approach around data quality tiering as I define it and as I practice it on a daily basis in my own career. The concept is simple by design; bronze/silver/gold = good/better/best. It’s easy to define and it’s non-technical, making it a perfect analogy for everyone in the business to understand, technical or not. But what criteria do we use to determine data quality? Where does each tier get stored? How do we effectively communicate quality tiering as a concept to the business?

With today’s modern data platforms, we data warehouse architects no longer have “one database to rule them all”. Rather, today’s enterprise data platforms are made up of disparate data stores, underpinned by different technologies, all utilized collectively as a “virtual” single platform. For example, some of the data lives in data lakes, some in the data warehouse and some data might live in other “reporting” databases. I call this centralized, virtual collection of IT-managed data assets the Enterprise Data Platform, or EDP for short.

This means that data at a particular quality tier can reside in different physical locations and still all be considered part of a single quality tier. For example, let’s say you have different Lakehouses holding data from different source systems. Source System A raw data is stored in Lakehouse A and Source System B raw data is stored in Lakehouse B. Although these two Lakehouses are physically separated, they’re both virtually part of a single Bronze quality tier. You might have more Bronze Tier data in Hadoop, or in a traditional RDBMS. It doesn’t matter where the data is physically stored, only that it’s properly identified at the right quality tier.

Silver Tier data is no different; it can live on any physical platform. That includes not only data enriched and stored within the IT-controlled EDP, but also any data that has been enriched in any way by any employee of the business sitting on any storage device owned by the company. For example, IT might enrich and store data within an Oracle database. That’s Silver quality data, stored in a database owned by IT. But, Sally in accounting has an Excel spreadsheet that she uses to collect and enrich data from one of the company’s vendors. Although this data lives on Sally’s computer, that computer (and the data on it) is owned by the company. This is defined as a Silver Tier data asset, it just resides outside the company’s IT-controlled EDP.

I’ll repeat; any company-owned data, anywhere in the organization, that has been enriched in any way by the business, is part of the Silver quality tier, even if IT doesn’t actively manage it.

This definition of Silver Tier is broad by design; it’s highly-flexible and helps to highlight to the business how they are providing additional value add, no matter where it’s happening. Sally’s spreadsheet might one day become owned by IT or could become an input to the EDW, but regardless it’s still a Silver Tier company data asset, no matter where it’s physically stored. (The same holds true for Bronze Tier data, in theory, but is more prevalent with business user-enriched Silver Tier data sets.)

So, what about Gold Tier data? Where does that go? That’s an easy one…it gets loaded into the Enterprise Data Warehouse. Now, being a data warehouse guy, this is the ONLY place to store Gold Tier data for me. I build an EDW wherever I go so that I have the proper platform and controls in place to deliver true Gold quality data. Could Gold Tier data be delivered outside of an EDW? In theory, yes. Some organizations will create disparate data marts instead of a single cohesive data warehouse. While I’d call these stove pipes, if that’s your best data then it’s your Gold Tier. Not everyone can or wants to build an EDW. That’s a strategic decision that you as an architect need to make (although the business value of the EDW makes a very compelling and cogent argument for its existence). I say skip the data warehouse at your own peril.

In the end, medallion architecture should be all about accurately classifying the level of trust the business can put into the data it uses for reporting and analytics. Classifying the data we as IT provide centrally to the business as part of the EDP is a no-brainer, but it’s equally important to extend that classification out to other enterprise-owned data sets. This broad approach maintains collective awareness of these enriched assets and allows for easy classification across the entire enterprise in a language everyone can understand, without implying ownership to any given system outside the ones you as an architect own.

Keep it simple; raw data is Bronze quality. ANY enriched data, anywhere in the company, is Silver quality. And Gold quality data lives only in the EDW, where it always has.

Simple wins every time.

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *