20 years of Data — Where have we been, where are we going?
Here are five best practices for the next two decades.
You’ve heard the statistics. According to DOMO, Inc. (a computer software company), approximately 2.5 quintillion bytes of data are created every day. But like the national debt, the number is so large as to be inconceivable.
For example, if you were to count every millisecond, it would take roughly 80 million years before you reached 2.5 quintillion. It gives a whole new dimension to exponential.
We didn’t get to this point overnight.
Although it often feels like we blinked and suddenly we are drowning in data and the flood waters are rising faster than we can swim, we forget that this is a recent phenomenon — a 21stcentury challenge. How we’ve handled the management of this commodity over time may help us understand how to manage it in the future.
So, as we approach the brink of the third decade in this century, here’s a quick understanding of where we’ve been so that we can chart new directions for where we’re going.
For years, we have been trying to create systems that house important records for customers, products, suppliers, and employees. We began by putting these invaluable records into databases.
Soon these databases grew to stores that we called data warehouses because they were large and organized in a way that would allow us to “pick” the right dataset for our use when we needed it.
We had data dictionaries that guided us and data curators that took care of our libraries of data.
In the first 20 years of this century, here are four key factors changed all of that:
1) Data continued to grow — exponentially,
2) Types of data collected changed and became more varied,
3) Cloud applications created data in different places across numerous databases and
4) Control and management of data became more and more expensive.
As the data grew, of course, we didn’t stop collecting it. In fact, we invented new sources that generated the need to collect data from internet activity, data from refrigerators, cars, cell phones, and of course, data about our bodily systems. As we began to recognize the enormity of our data volumes, we started looking for new ways and places to store it.
Big data, as we call it, just needed to be stored, curated and managed for projects that could show, for example, how influential your websites were in moving products or infer which customers might have the intent to purchase your products.
Big data had some characteristics that our first collections of structured data didn’t – it came to us faster and in much different forms.
Other data might be just as big in volume, but the transactions of the internet happen instantly and don’t always fit a column-row structure. We needed a different system for information like this to make it immediately available to those who were analyzing websites and traffic.
Enter Hadoop and the world of big data management systems.
Soon, the IT departments saw that they had an issue with how to manage this information.
They couldn’t put Hadoop into the data warehouse environment. The tools for access were different and the usages very different.
The data no longer was structured like data warehouse relational data, requiring columns and rows. The data was messy and needed tools that were very different from data warehouses that provided dashboards in Tableau and Business Objects for example.
The Hadoop systems were used to analyze information quickly, make decisions and then move on those decisions at the point where the data was collected – the website or the customer response center application.
The role of a Data Lake
So, what was the IT department to do.
They needed to keep control and protect the data, but they now had at least two structures – a data warehouse and an unstructured data source. So, IT invented the Data Lake.
A data lake is just an accumulation of all structured and unstructured data in an enterprise — put into one location.
Data lakes are important. But let’s face it, data lakes become data swamps very easily and the proliferation of data lakes throughout an enterprise can make a company feel like Minnesota and the Land of a Thousand Lakes.
Although they are watering holes where analysts can get the data they need, data lakes create their own datasets often duplicating the work of many analysts within the data lake or spawning data marts that are pulled outside the lake to form “ponds.”
Data Lakes weren’t organized like data warehouses so there was freedom, but also frustration. A data lake became the answer for organizations seeking corporate IT support but not corporate control.
Enter data governance programs
In this environment, enter data governance programs.
Governance programs traditionally provide guidance to departments or data users for how the data should be managed for accuracy, completeness, consistency and adherence to corporate guidelines for security and privacy. In providing this governed view, these programs also began to highlight the requirement for a corporate view of entities such as customer, employee, product, vendor, supplier and partner.
This “master view” is often the translator between systems for key domains such as customer. Sales may have one view, marketing another. Master data provides THE view that each can use as a starting point.
As master data practitioners began to rely on governance of master data, not necessarily the control of it, they saw the need for tools to help.
The rise of governance tools began to give the master data contingency the ability to know where data originated, who was responsible for it as it flowed throughout the data supply chain and what the completeness, accuracy and value of the data was. But these tools primarily gave organizations the ability to organize something that most don’t want or don’t think they need organized.
These tools gave rise to a new category called “data catalogs.”
Data catalogs began to put the power of mastering data into the hands of the analyst who needs the data at the point of business analysis. This point is usually at the edge of a system, downstream from where the data is created and many transformations away from the data field when it was originally entered.
Mastering data at this point means that the analyst must make the decision about the information included in the data set.
For example, does the column “agent” from the supplier system mean the same as “partner” from the sales system. Can those be combined, or must they be treated differently?
Another way that analysts at the edge must master data is in reviewing the completeness, consistency and validity of the data set being analyzed.
The person highlighted most in this new environment for data analysis is the data architect who must ensure that the systems supporting “edge” analysts can do so quickly and can scale when the business analysts outnumber the IT-trained professionals.
In today’s environment
In today’s environment, the data architect is the orchestrator of the complete data supply chain.
The data governance professional is the regulator and auditor. That means that the architect must manage the warehouse, the data lake and what is rapidly becoming the data river – all that data in constant streaming motion from IOT to web interactions.
Working with the data governance professional, the data architect ensures that systems capture valid data appropriately, making it easy to capture it completely and consistently.
In addition, the architects must provide for tools and capabilities to measure the data as it progresses through the data supply chain. Much like an oil pipeline, the data must be monitored as it flows through each checkpoint or makes its transformations from application to application, from cloud to cloud.
This is where the data governance team and their tools are helpful.
Then, at the end of the line, the business analyst should have the ability to manage datasets meaningful to the business and trust that the architect and the governance team have managed the data flow effectively. None of what happens in the end to help make better business decisions is helped by putting a “golden record” in a data warehouse or a referential database if that record is not immediately available to the analyst downstream.
Data today moves and changes depending on what application it is serving, and a master record is often viewed differently from one application to the next.
Cost
And then, of course, there are the costs.
For years, we have asked executives to pay millions of dollars to master data and make it available to all systems. What we should consider now going forward is a new way of ensuring data quality and appropriateness for new kinds of analyses.
With artificial intelligence and machine learning becoming more popular, we must ensure that the data our algorithms operate against is the best we can deliver to a variety of operations.
The adage – garbage in, garbage out – has never been more appropriate. Ensuring that the data we enter, transform and use is accurate is the only way to ensure that we derive the real benefits from AI and ML.
That will take a team of people, not just master data experts, analysts or architects.
Never in the history of data collection, management and use has it been more imperative for IT, business and data professionals to collaborate.
Here are five best practices for the next two decades:
- Get comfortable with the volumes of data and take advantage of the new AI and machine learning tools that will begin to help you find previously undetected patterns in these volumes.
- Look for more and more variety in the data collected. New data streams will begin to add more color to our algorithms as we take advantage of machine learning.
- Develop stronger skills in architecting data and its flow through your systems. If you don’t have a data architect team, get one. You probably have someone managing your product supply chain, why not someone handling your data supply chain? Both are important for managing your most valuable assets.
- Create citizen analysts and give back to the business leaders the ability to find the information they need when they need it. The tools are there, take advantage of them. It may take some training, but invest in it.
- Take a different view of master data. Consider that the “master” of any data element is an elusive concept and must be reevaluated in today’s environment. The point at which data is used is the most important in the data supply chain. Tools that help explain the lineage of a data element will be far more valuable than those that provide THE standard or a “golden record.”
And finally remember, data is one factor in decision making. It exists only to help us make better decisions. Those decisions we make, however, still rest with the business leaders who will chart the course for the next 20 years.
Related articles:
Applying Fast and Slow Thinking to Data Analytics