Whose Data is it?

My 2023 AI Investment Thesis - Part 2

Feb 02, 2023

A picture of me to the left and an “Asian” version of that picture, generated via Bria, to the right. Who owns the rights to the Asian-me picture?

This is the second blog in a series on some of the challenges that I see coming for enterprises, with regards to ML challenges ahead. Feel free to read my previous post as well, that highlighted the critical innovation areas of:

Data Access
Data Accuracy
Data Representativeness

Data Provenance

Imagine being an enterprise with a business that relies on data. Be it their own, or their customers’ produced data, or from the web scraped data. Then that data or some of it gets used for ML models to produce business value. How do you prove that your specific data contributed to the value of an ML model based business? How does that impact monetization sharing of ML model value? Further, let’s say you end up in a situation where you need to prove what data influenced a decision. How can you easily achieve traceability through a growing mesh of models?

The world we are heading towards will make it more complex finding out where data originates from and almost impossible to see the trace through various pipelines, models, and outcomes. If you are a data business or if you need data auditability, you will have to worry tenfold about keeping your data protected and start worrying about data traceability.

Data Observability

This is a trend that has been going on for a while, but it will become more intense in this new wave of ML focus. If you start applying auditability and traceability onto ML-model hierarchies, to trace the data, you will most certainly also need to know if the data was corrupt or missing or if a source has suddenly shifted its metric, etc. You have to prove where things went wrong, quickly. At the same time, the expected increase of complexity as ML hierarchies grow forward, to fuel more automated decision making pipelines, makes the world of data tracing more important. You become much more sensitive and much more at risk of impacting business outcomes, without being aware that the data has suddenly shifted. ML model guardrail tech and data observability will be on the rise - again.

Copyright in an Generative AI world

Who owns the generative AI content? If an image generated by generative AI needs a seed image to generate from, who owns the end image? If you don’t have rights to the seed image, do you have rights to the output image? On some level, isn’t the output image an ML-generated combination of a vast set of known images? Isn’t the output pieces of existing data somewhere else that has been used to train the model? How different will an image need to be in order to be considered a new image? In this new world of ML-generated output can copyright be preserved? The ML is a black box on where the “inspiration” came from. I predict a new era of licensing and copyright that will have to come into play for ML. And again, data needs to be traceable - even through the ML model.

I think it may not be urgent for enterprises until the first major lawsuits happen, but when the claims start trickling in, a pattern similar to cyber security’s early days will certainly emerge. My prediction is that it will rapidly become hot to find companies that validate rights of a generative AI-generated data as truly (or enough) unique or on the other hand help validate its origin(s).

A temporary workaround may be to develop features to facilitate “similar to this” internet lookups to serve the user with very basic validation means - at the same time pushing the responsibility to the end users. But for Enterprises this will not be enough. They can’t have their employees responsible. Maybe there will be a tie into blockchain here, to really be able to trace who owns the rights to an image? We can of course also extrapolate this proposed issue of image origin to voice, video, face, or body movement patterns, or the data that an individual generates. So far, I have not seen anyone focused on Enterprise + copyright and blockchain, in relation to generative ML, just yet. Time will tell..

What are your thoughts on these topics? Is anyone already working on these tracing issues? What is truly unique - even in a human context? I would be curious to know. In the meantime, stay tuned for my next AI thesis blog post.

Jonathan Hsieh

Feb 20, 2023

Data traceability and data observability are linked in my opinion. When data observability finds a problem or anomaly, the lineage (another name for Data Provenance), helps make sense of all the signals. There are lots if metrics observed in different parts of a pipeline. Lineage enables the data/ml operations folks correlate different signals by following dependencies chains to root causes. So if a ML model starts producing strange results, the lineage cane be used to help identify the origin -- going back to the features, the data model, the etl, the ingest, or even the source of the data.

At the end of the day, good AI/ML models requires good data!

Expand full comment

Eva’s View

Discussion about this post