This is the third blog in a series on some of the challenges that I see coming for Enterprise, with regards to ML, including Generative AI. If you want to read my previous blog you can find it here, or you can also start from the beginning.
Two things to keep in mind:
As various Generative AI (and other types of ML models) emerge as services accessible through APIs, they will start to get integrated - with each other and with production systems. The future will entail ML mesh.
A few years ago you were a progressive company if you had 5 or 10 machine learning models running in production for business critical workloads. That number has increased to about 20-25. With the fuel of the new hype, executives I have talked to mention numbers in the hundreds (some even envision thousands).
This is why I feel anxious. I have seen this before. I have felt these pains in different shifts. The migrate-to-cloud shift and the off-load to SaaS shift. It starts with a few workloads and a few SaaS applications, but before you know it you are looking at a 100x more interconnected environment and 100s of SaaS applications that gets harder to manage and control. Costs explode. Vulnerability surface areas and associated risk increase.
What will the ‘productionize 100s of ML models’ aftermath look like? Let me paint out a few scenarios to help share what I am seeing.
Increased Complexity: If one model in a hierarchy or mesh of 100s of models (and/or outsourced ML services) suddenly starts mis-performing or gets compromised, how will you know? How will you debug? How long before you actually realize? Will you immediately know to look at the right model?
Compared with today’s complexity on this problem dimension: there is a small enough team (5?) to know the code and details and data involved of all the 10ish models and projects. Most of the models may not interact with each other. Maybe 20%-30%, meaning you only have to debug 2-3 models in a group to figure things out.
Tomorrow: will the team have the same in-detail knowhow of every model and data involved? Can a human overlook 20-30 models interacting and easily debug ML and stop security vulnerabilities in time before they cause real damage?Side-effects of differently skilled users: As ML goes as-a-service it widens the user audience to different skill levels beyond data science. Developers and application builders assume the same serviceability as any other SaaS service and are not expected to have deep ML or model-change know-how. A simple change in the application layer or the contributing data could have extreme compute impact or model size impact on the ML layer. How will cost controls and application performance controls be more intuitive and transparent across ML-as-a-service integrations with customer-applications? How will the gaps be addressed for a less-ML-savvy audience?
Know-how gaps and cost hang-overs: Training ML requires certain resources and skills. Productionize the model requires different kinds of resources and skills. The handoff between two worlds is a growing pain and bottleneck. Not just time to market but cost efficiency wise too - a lot of code needs to be rewritten, what works in a lab environment may become too costly in production. Auto-optimization is on the rise already, and that is without ML specifically in the mix. Your cloud cost hangover will become worse with more ML in the mix (which is why I am excited about companies addressing this as well, such as CloudNatix). As a side-note and comparison, an ML expert from a ML-famous enterprise told me parts of the training (including cloud costs) of chatGPT was estimated to >20M USD. That was not including the human effort and data preparation needed.
What I have listed above is not by far an exhaustive list of concerns related to ML at scale, but merely a few examples of the production vulnerability and scale challenges that will need to be addressed in this era of ML at scale. Ease of interoperability, visibility and traceability of multi-modal deployment health etc, are just a few of many related problems. We went through similar things going from 10 to 100s of API integrations and from fixed data center environments to a dynamic multi and hybrid cloud production environment. There will be more issues arising as the automated world goes ML-hybrid (ie. ML, multi-ML, non-ML, and external ML-SaaS). There is lots to worry about, and therefore tremendous opportunity to innovate.
What do you think will be the hardest AI scale challenges to tackle? Please share your thoughts and also point any early stage startups my way that you think can help with the enterprise ML-headaches ahead. Next blog will most likely be my last on this particular investment thesis of mine, unless you request more. :)
Stay tuned for the final piece.