Provenance as key technology to reduce information pollution

To verify the veracity of information, the authenticity and authoritativeness of that information is important to assess. Tracking the origins of information is important to understand its authenticity and authoritativeness. So far, authenticity and authoritativeness has largely been assessed only by understanding the content itself and has been the focus of media literacy programmes. However, information pollution has made it increasingly difficult for non-experts to assess the authenticity and authoritativeness of information, a problem that is likely to become worse. This document introduces the concepts and discusses existing efforts and finally outlines the commitment needed from governments.


The term provenance is commonly used in technical literature to describe the practice of tracking who and what not only originates information, but who and what changes it. Conventionally, provenance is metadata that follows the information.

Consider an image, shot by a certain camera by a certain person. The camera information and an identifier of the photographer should in most cases be available as provenance metadata. Consider then the possibility that the image is photoshopped, then at least information that it has been photoshopped and by whom should be added. If the image is then published in social or editor-run media, the provenance information should be checked to see if it is likely consistent with the content of the image.

To allow citizens to assess the authenticity and authoritativeness of the information, there should usually be tools that can help them verify the correctness of the provenance metadata. Usually, the public should realize that information that comes with verifiable provenance metadata is more trustworthy than information without. However, this is clearly not desirable in all cases, in some cases that metadata is highly sensitive. In the case where the information comes from a whistleblower, it must be removed. In such cases, free editor-run media is important, as they can assess the veracity of the information, and the public would then need to rely on the reputation of the media to assess the trustworthiness of the information.

Recommended reading and existing initiatives

Henry Story has written quite extensively on the role of provenance in his work on Epistemology in the Cloud. In an OpenAI-sponsored paper titled Generative Language Models and Automated Influence Operations: Emerging Threats and Potential Mitigations the authors describe the outcome of a symposium where potential mitigations were discussed. Provenance was discussed in the paper, noting that

Because technical detection of AI-generated text is challenging, an alternate approach is to build trust by exposing consumers to information about how a particular piece of content is created or changed.

They also reference a prominent technical approach by industry actors that has formed the Coalition for Content Provenance and Authenticity (C2PA), and has produced technical specifications including an harms modelling overview.

An earlier, and more academic standardization effort was undertaken under the auspices of the World Wide Web Consortium to produce the PROV family of documents. It did not see very extensive implementation, but I have used it to provide an unbroken chain of linked provenance data from individual changes to my own open source contributions through my packaging and to Debian packages. Unfortunately, the data sources this relied on are now defunct.

Nevertheless, it is clear that the main challenges are not technical. Conceptually, it is quite well understood and if it is done on a large scale, it could have a significant impact. It is therefore much more important to discuss the implications for how technology is governed.

Normative Technology Development

Whatever technology choices are made in this area will set down societal norms. This illustrates a crucial point that not only law sets norms in the form of regulation. There are also cultural, religious and biological norms, and also technological norms. As the C2PA coalition analyzed harms, they made choices as to what should be considered a potential harm, and in designing a system, those choices informed the technical decisions that they made, and thus formed norms. This is not to say that the choices that they made were wrong, but they are to a great extent political choices that should have been influenced by elected representatives.

Now, given that they sport significant implementation capacity, it makes sense that the companies that form this coalition to have a representation too, but it should have been balanced by other participants that have a democratic mandate. For these participants to have an impact, they must too have implementation capacity so that they can demonstrate that the solutions proposed by those with a democratic mandate can be implemented and have an impact.

Doing this in open ecosystems with non-exclusionary terms so that they become Digital Commons are essentially what I refer to as normative technology.

With a strong public engagement, democratic institutions will also gain knowledge into the details of the technology, which will also greatly assist lawmakers as eventually these norms may be written into law. Moreover, democratic institutions will need to assist in the dissemination of the technology throughout society, as this is probably the most difficult obstacle for provenance technology to have societal impact.