Data 2.0

A little ML on ETL pipelines ought to go a long way

Feb 18, 2021

Continuing my tribute to Katerina Kamprani’s art work. I chose this one to symbolize the tedious work of cleaning data.

The tedious work required to structure and clean massive amounts of data in combination with sparse Data Science talent is not going to solve the urge for faster business insights. ML will need to relieve ETL. Entrepreneurs wanted!

Why is it still hard for organizations to get value out of data? I have spent the last decade in the “big data” industry. A lot has changed in how data pipelines, data storage, and data processing is architected and deployed in large data centers around the world. But data quality and getting data into a usable form continue to be tedious problems. Data, now residing in more complex, multi- and hybrid-cloud environments, increases the complexity of getting to the right data sets fast.

There is a dire need for helping enterprises to extract value out of data. Beyond just providing tools to interact or process data, but actually enhance the data itself. To quote a very insightful friend of mine:

“Regular ETL with a little ML applied, would go a very long way.”

The big data wave started more than a decade ago. New technology made it possible to process large data sets on commodity hardware in a reasonable time. This opened up endless possibilities and made Machine Learning a viable option for any company, not just specialized labs or Fortune 100 enterprises.

Fast forward — the world’s largest organizations are drowning in their data lakes. Business leaders have been fighting to get ahead by drawing smarter conclusions from data - customer churn, predictive pricing or maintenance, process optimization in supply chain etc - you name it. These advanced and data-silo breaking use cases have forced CIOs to rethink their data strategies and architectures. The rise of the data lake was a natural consequence for IT to provide speedy solutions to serve the acute needs of the business.

Another friend of mine, in pharma, once told me:

“The only way to keep up with our business demands is to pour all data into the lake and let them have at it.”

And that is not a single instance. Many companies I have worked with over the years have kept on pouring data into their lakes and achieved tons of business value doing so, but only where savvy data science talent could be found! Or where the data continued to be structured. Others are stuck with lakes and are still struggling to get real return on investment, as they don’t have the cross-functional individuals who can understand business problems, the data domain, how to wrangle data, and program or build models on top of it.

If an organization has indeed found a Data Scientist who is the intersection of deep data domain expertise, decent computer science and programming skills, and also has a background in statistics and ML algorithms, they should consider themselves lucky. It is clear that, in most organizations, there is a huge cliff between C-level ambitions and what is actually achievable by IT and dev teams, given the tech resources on the ground.

The good news is: where there is a cliff, there is tech and there are innovation opportunities. Here are my observations and thoughts on this cliff:

Democratize Data Science through Tooling
We have seen a sprawl of Data Science tools with drag and drop ML modules — to help eliminate the actual programming step. But even with those, you still need to know what you need to do with the data to get it into the right form, what models need to be applied, to get the right outcome. In a way, these tools have a great UX but are not a replacement of the talent needed.
Bring ML to the Structured Data
Return to the basics. Stay with decade proven Data Warehouse data, but in a modern cloud-native DW stack (to scale), and make it easier to apply ML onto DW stored data. This of course comes with the downside that you probably need to narrow the scope of ML use cases, but it enables traditional Analysts to quickly grow into some, however limited, ML space
Relieve the Data Scientists with Automation
Part of what a Data Scientists spend a lot of time on is to create new interesting data sets, that then can be analyzed or used in various business use cases (mostly dashboards) or to train new models. What if you can automate that part? Have a service where you can point raw, common data sources to, and get data cleaned, data sets combined, linked, or categorized and somewhat standardized or at least useful state? Is there any innovation out there that eliminates wrangling? Perhaps with use of ML pattern recognition? As simple as a data set descriptor and categorizer perhaps? I call it “Data as a service”.
Another path would be if the model training and model subscription becomes a service for companies that can't find or afford Data Scientist teams. You stream your data through this service to train some common use case models and you can choose to just subscribe to models trained over publicly available data?

Perhaps this is too hard to do, or only possible for very narrow domains, but I think it is time for the next evolution of Data and Data Science: Data 2.0.

Bottomline — ETL pipelines are still ETL pipelines, though the volume of data has exploded and it now resides in a hybrid world. Yet, very little has been innovated around automated interesting data set generation. Not everyone wants to deal with Data Science themselves anymore, they want pre-packaged and as-a-service, so why not make it simpler for these companies?

The next wave should be Data 2.0: data, data science, and models as a service - think auto-categorized, standardized (json?) and ML-enhanced valuable data set generation, description, discovery and subscription. Would you agree?

Eva’s View

Discussion about this post