It Starts with the Data

My 2023 AI Investment Thesis - Part 1

Jan 27, 2023

The new and more frequent releases of generally accessible, generative AI models as-a-service have triggered a high volume of interest among incumbent Internet leaders, VCs and entrepreneurs alike. After a bit of ML-fatigue from the big data boom, the door to once again dream about endless opportunities and what could become real one day has flung open. We are in a renewed and exciting wave of ML/AI hype.

As usual, new tech innovation buzz has a tendency to eventually seep into enterprise board rooms. So, once enterprises start to emerge from recent, unfortunate but necessary cost-cutting strategies, expect a big push on roadmaps and delivery on generative AI - for the enterprise! I feel and share the excitement of the community - how could I not - but I also feel anxiety. Enterprises have barely recovered from the big undertaking to move to big data and ‘self-service’ data science from last decade. Few have succeeded getting all the way. ML is barely (if even) operationalized in early-adopter big enterprises today. Mid and late bloomers industries as well as the underserved mid-market have just dipped their toes in.

I’ve spent some brain cycles lately thinking about how to help enterprises with what is coming. How as an enterprise executive could you prepare? What will generative AI for enterprise look like? To what extent is generative AI even adoptable by enterprise as it stands today? What would be the right level of futurism vs. reality? So many questions keep me intrigued and will for some time - there are so many challenges and hence so much opportunity ahead.

What follows in this multi-part blog series is a collection of some of the challenges that I see written on the wall. Hence also my answer to the frequently received question: “If I were an entrepreneur, what company would I start today?”

Data Access

It is important to recognize that generative models (as any other ML model) still need lots of clean and representative data, and lots of training, even if it could be unsupervised or semi-supervised. It is also important to recognize that raw data comes in vast sets of data types and categories and are often not organized, sorted, structured, or labeled - i.e. not ready for ML training.

The current open models are trained on the vast set of public data available on the internet. This is impressive, but as a side effect of everything being open and available, it also means there is no real competitiveness. If you train a model over public data, the next startup or enterprise up the road can do the same. Hence, if you are building a business around generative AI, trained on generally available data, only first to market will matter.

A possible middle ground perhaps would be if you have unique industry or workflow knowledge, that only a few can have, i.e. some kind of domain expertise, and can produce a winning workflow-focused user experience on top of that public data. You could also win on user adoption if you become the industry standard for that workflow.

I still believe AI over public data as a business is still weak - unless you are indeed the first to market, in a meaningful way (like Open AI). What would be much more interesting, to me, would be to build generative AI (or ML in general) over proprietary data or data that only you have access to and have the rights to use for ML purposes. In this scenario you have unique access to data that no one else has, and would be hard for others to compete with. This is a winning startup in my point of view. Proprietary data + ML = yes please.

Data Accuracy

People talk about model accuracy. Though that is important, it is critical to understand that your model can only be as good as the data you train it with. There are plenty of examples where AI came up with incorrect conclusions. Or even what seems like right answers, but based on the wrong premises (we all remember the Husky vs Wolf project that turned out identifying snow, don’t we?). If the data you train the model on is incorrect or non-representative, you can end up with costly mistakes, even if the model accuracy seems to be high.

To provide a concrete example, think: training “the next search engine” on all public internet data, in which content the word ‘research’ is used widely and fake news are intertwined. Now, just because it is “AI'' generating the response and we get wooed by the magic in its human-mimicking capabilities, we should not trust its answers more than a TikTok influencer in its current state. It may be good for quick lookups or help with your own blog draft development, but for education and use cases where data reliability is key, what is the accuracy level of the answers provided? What if you make business decisions on this kind of data? How much of the publicly available data is accurate and trustworthy? And on a parallel track, how will that trustworthiness shift now that anyone can generate fake images, fake audio, fake facts easily and quickly?

If someone can come up with ways to validate data and its “trustworthiness”, they will have a great future ahead serving enterprises. I am actively looking for them. Fake-data detection = yes please.

Data Representativeness

Another challenge with current open large natural language models is that they are trained on data only from the past - real time data inclusion is not there yet. We can surely see the writing on the wall that it may be there soon, but today there is no mature technology to cover ‘now’. If the source data is incomplete, and not representative enough of all desired outcomes, it will lead to costly mistakes. This has been a problem with synthetic data for a while, and it hasn’t been solved yet as far as I am aware. As a result there continues to be a high risk of generative AI to be inaccurate, or incomplete.

Not to repeat myself, but: if someone can come up with ways to validate data and in this case its “shape” and “representativeness”, they will have a great future ahead serving enterprises.

Data Quality

As a result of the hype, the intensified focus on ML in boardrooms and enterprise executive meetings ahead will undoubtedly result in more ML projects funded. But still in this early stage of the ML-driven Enterprise era, we still have many bottlenecks to get ML, be it generative or not, into a fully automated CICD-style pipeline-like machinery - from idea to production maintenance cycles. I’ll come back to the scale of ML in one of my next blog posts, but for now let’s agree that to scale you need to remove bottlenecks. One bottleneck today for any ML project is getting to a high quality data set to train models on. It could entail labeling. It could entail quickly understanding data sets available and how they relate. I have seen interesting companies exploring data and its users as a graph, using ML itself to label data for ML, and other innovative ideas. More of that please! This is not an easy problem to solve. I predict this will be one of the first bottlenecks that will explode - or re-explode? I must add “again” as I did already experience it in the “big data” movement when it first got wind. In this new wave of ML hype in the Enterprise it will be intensified.

Let me know your thoughts on the importance of data for the new wave of ML. In the meantime, you can look forward to my next blog that will continue to elaborate on my ML thesis.

Jonathan Hsieh

Jan 29, 2023

Hey Eva, given your thoughts on data quality I'm curious about your thoughts about the data observability space and the data reliability engineering movement. I've been a fan of Monica Rogati data science hierarchy of needs (https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007) and my current bet is in the middle layers of this space.

Expand full comment

2 replies

1 more comment...

Eva’s View

Discussion about this post