Over the last decade I’ve read more than my fair share of hit-pieces and outright nonsense about data warehousing, most of which predicted its imminent demise. But it’s still here and still delivering on its value-add promise all these years later. Why is that?
To get to this answer, we need to take a step back in time.
I started in data warehousing back in 2000. By that point, data warehousing was maybe a decade old, give or take. It’s unthinkable by today’s standards, but we had a hell of a time getting folks to understand that data itself was inherently valuable and could be used for more than just everyday application processing.
Yeah, you read that right…the very idea of data having additional, untapped value was considered DEBATABLE. Nobody would be caught dead bleating silliness like this these days, but for all kinds of “smart” people back in the day the jury was still out.
I was using Business Objects on top of a data warehouse built on the Informix database platform. The challenge back then was database scaling, performance, and ease of use. Proper ETL tools came along and made data movement and transformation better, faster, and more reliable. New reporting tools brought semantic layers and no-code access to data. Attitudes had shifted toward recognizing value in data, but the challenges of data management remained the same. These tools addressed these challenges and started make things easier.
Then in 2004 I remember the sales guys pimping the next big warehouse killer: federation. No more ETL pipelines to move or transform data. You just pointed the magic tool to all your databases and suddenly you could produce all your data assets by simply refreshing the report.
Yeah…right. While some tools have made advances in handling federated data, it’s never materialized as the panacea it was billed as. Meanwhile, the data warehouse kept delivering on its promises, same as it ever was.
Let’s jump to 2013. A higher-up at the company I was working for declared that we needed to remove all ETL from our systems (yes, that was a verbatim statement). Hadoop was going to accomplish this, he claimed. I never got the HOW on this (no surprise there) and it sure sounded silly at the time, but by then the salespeople had gotten ahold of the notion and it was all the rage to declare the data warehouse a dinosaur, slated for extinction. It was only a matter of time before Hadoop would supplant the data warehouse, even the relational database itself. One of these vendors even suggested all that was needed was a desktop query tool and all users would just, and I quote, “mash the data together” and produce all the answers. They sold a lot of leaders on this crap. The bloggers were just as bad, hopping on the bandwagon with click-bait articles full of cockamamie ideas that carried very little water.
All predicted the end of data warehousing, even the end of the relational database itself, but none of this happened. Go figure.
But that’s not to say that NOTHING happened. Data warehousing just didn’t go the way they predicted. Much like data federation, Hadoop was useful, but it never lived up to the hype. That’s unfair because the technology does work, just not the way it was portrayed. I’m sure the article clicks and sales contracts brought in some money for a few desperate folks for a while, but it dries up once the fallacies are sheared away. It’s not a long-term solution for success and the damage it did to the industry, to MY industry and my livelihood, wasn’t worth the few bucks these people made doing it.
Federation kinda works. It’s gotten better, but you need a smart and powerful engine behind it to figure out how to efficiently execute and merge the queries. Transformations are still needed, but without persisting data these must be performed at query time. Federation is still alive and well in tools like Power BI and Denodo, but nobody believes that it’s an ETL killer. This is good, because these tools can be useful, provided the folks using them know what they’re good for…and what they’re not.
Hadoop works fine, but it’s hard for the average user to understand, and even with Hive it never took off as a schema-on-read data warehouse platform. It’s a wonderful processing platform though.
I mean…not all your problems are nails, so you’re gonna need more than a hammer.
In my younger days, when I had less experience and confidence, I listened to the hype. While it was worrisome for my long-term career plans as a data warehouse architect, I couldn’t find anything substantial to back up the outrageous claims. After enough time passed, it’s become obvious.
With the advent of Hadoop, MPP became “cheap”. Teradata had been doing it for a couple of decades, but it wasn’t cheap. Redshift appeared, as did Microsoft’s SQL Data Warehouse, offering less expensive MPP DATABASE solutions. For data warehouses, these platforms are ideal.
So let’s jump ahead again, this time to the present. Data Lakes are now on the rise. Cloud computing has opened up all kinds of new options and new technologies. Newer MPP databases have become mature. Tools like PowerBI have empowered analysts, allowing for unprecedented access to data. Most companies these days know that cloud is the future with its low maintenance burden and infinite scalability. Collaboration is easier than ever.
But even with all these newfound capabilities, there’s no “easy” button that’ll do all these things for you. There’s no one-stop-shop for all data needs, free of coding. These are myths.
AI will almost definitely enable more “easy” buttons, but for the foreseeable future data warehousing is, by and large, a human-centric endeavor. The concepts are derived from human psychology. Analysts are human. I’m not educated on futurism enough to speak intelligently about what AI may or may not accomplish over the next twenty years, but I just don’t see data warehousing becoming a push-button exercise anytime soon. Even if it does, the human elements will remain. There’s really nothing better than the dimensional model for most reporting and analytic needs. It mirrors the way people think, and that’s not changing soon.
I recently read a Harvard Business Journal article suggesting we treat data like a product. I don’t disagree (it’s something I’ve also been saying for years), but when I got to the meat of the article, all it really was was repackaged data warehousing. Sure, there were bits in there about the Data Lake, but that’s not much different than sourcing a data warehouse from an ODS or any other data store. Creating a data product for Customer is no different than a conformed Customer Dimension. The Customer Dimension IS the data product of which they speak, and that concept has been covered to death by Ralph Kimball and countless others. Data warehouse architects already understand this.
The challenge these days really isn’t about thinking of data as a product. The challenge as I see it is governance, same as it’s always been. Governance takes time; it’s tedious and slow, and that tends to give way to deadlines, budgets, and ignorance. They need the data fast, they don’t have time to standardize it. They don’t know how to efficiently store it. They have a reporting tool that can connect directly to the source, so they just do that instead. They keep data in spreadsheets rather than institutionalizing it within a centralized system. But when you’re standing in front of an executive and your number doesn’t match with Sally in accounting’s number, how do you know you’re right? How do you explain that we have two different version of the same thing? How do you regain the trust you lost by producing bad data?
I ask…it it really cheaper? Are the shortcuts worth the havoc they wreak on the our reputation as data stewards? Quick and dirty is, just as it reads, both quick and DIRTY. That dirt equates to tech debt, and that’s gonna cost, with interest.
Gone are the days of the monolithic data warehouse storing ALL of the company’s data. Now our data is spread across disparate systems; stored on different platforms and at varying quality levels. Not all data needs to be, or should be, in the data warehouse. But make no mistake about it, the body of data itself is still a single logical instance, and as such needs to be distributed across your chosen platforms in a way that reduces duplication and logic-forking, optimizes storage and compute, and remains easy to use for everyone in the organization.
I’ll say it a different way; the data needs remain the same. The tooling has changed. I’d argue that the data needs have never really fundamentally changed over the years. It isn’t a new thing of users to want data as close to real time as possible; we just couldn’t oblige. Enabling technology is only feeding a need that’s already been there since the start.
Look, it’s entirely possible that the authors that Harvard Business School article already understand all this and are just repackaging this “data product” idea from original Kimball methodology because they feel it’s more business-friendly language and won’t come across as boring old data warehousing. I was even told by one employer to never say the term “data warehouse” because folk higher on the food chain saw data warehousing as expensive and failure-prone. Expensive, maybe, but failure is earned. I suppose repackaging might hold some value, but I don’t like it. We data warehouse architects shouldn’t be afraid to stand up for the methodology and defend it against its detractors. Data warehousing has been delivering on its promises for nearly thirty years and we shouldn’t have to repackage it so that someone’s misconceptions about data warehousing can be protected.
You can’t blame data warehousing for poor management or incompetent practitioners. The methodology works. Like anything, it has pros and cons, but IT WORKS. That’s a fact that can’t be debated. Whether or not it’s right for you and whether or not you’re skilled enough to implement it yourself or are savvy enough to hire the skills is another story altogether, one which I’ll touch on in a subsequent post.
As data warehouse architects, we’ve always understood that data is a product. The challenge for us now is managing the data outside the warehouse. From a technical perspective, it’s much less difficult to manage because we don’t have to apply the same kind of exhaustive methodology that warehoused data requires. The processing burden of the data might be less expensive, but the tracking/corralling burden of this new information is pretty challenging. That’s the challenge I’ve been facing for the past decade, and the journey isn’t complete.
Data lake, ODS, HDFS, EDW…these systems all now represent our logical enterprise data platform. Where should the data live and how do these systems work together at scale? I’ll talk more about this in subsequent posts. This is where data governance shines.
Data warehousing has survived its detractors, technology changes, multiple smear campaigns, and a shift to the cloud. Through it all, the data warehouse has delivered on its promises.
I believe it will continue to do so for years to come.