The popular understanding of artificial intelligence imagines objective, highly technical and clinical scientific methods that work efficiently for the user’s advantage. We think of automated processes, large and powerful computers, vast servers in neat racks in laboratory-like buildings. One thing we seldom think of when imagining AI is of racist, misogynistic and elitist world views. What American photographer Trevor Paglen does brilliantly in his 2019 exhibition at the Barbican From ‘Apple’ to ‘Anomaly’ (Pictures and Labels): Selections from the ImageNet dataset for object recognition is to uncover the obscure underlying frameworks that determine the ways AI systems read images. Through a process of archaeological[1] research into the taxonomy of the ImageNet dataset – an image database with more than 14 million entries employed by Paglen and his collaborator Kate Crawford in their enquiry into machine vision – Paglen has produced an art installation that reveals the politics of classifying images.
At the entrance of Paglen’s From ‘Apple’ to ‘Anomaly’ the viewer is presented with The Treachery of Object Recognition (2019), the visual representation of an algorithm’s reading of René Magritte’s Ceci n’est pas une pomme (1964). There are frames within frames: the print is physically framed; the picture has a painted frame; and there is a series of intersecting green frames within the print produced by what we later learn to be an algorithm. These green frames have textual components that on the surface seem like obvious descriptors: ‘a large white sign’, ‘red and green apple’. The text painted in the original picture – ceci n’est pas une pomme – reminds the viewer that what is in front of us is not an apple in itself but a representation of an apple. What one line of text negates, the other affirms. At the core of this contradiction we find the unstable relationship between images and language.[2]

As the viewer of the installation moves on to the main space at the Barbican’s Curve Gallery, she encounters a mosaic of thousands of prints pinned to the wall that end with the image of a single apple. Amongst the sea of square photographs, typological clusters gather around a number of parent keywords – such as ‘ham and egg’, ‘apple’, ’spam’, ‘anomaly’ and others – attempting to establish an order in the excess of prints.
As with other bodies of work by Paglen, From ‘Apple’ to ‘Anomaly’ is concerned with making visible the otherwise hidden. This project splits open the mechanisms used by artificial intelligence models to teach themselves how to read images. Prior to the creation of this installation Paglen collaborated with Kate Crawford – co-founder and director of research at the AI Now Institute at New York University – on an extensive research project into the social implications of machine learning and artificial intelligence[3]. In order to generate machine-learning programmes, developers use databases of millions of photographs that have been individually tagged with keywords to enable software to compare the images and eventually recognise similarities and differences. On the surface, this process appears to be clinical, methodical and objective. However, the imposition of a taxonomical order on the world of digital images soon reveals the underlying ideology, political stance and world-view with which this order is constructed. ImageNet was created by researchers at Stanford University and includes around 14 million images. Using Amazon’s Mechanical Turk[4] (AMT) crowdsourcing tools – and becoming the largest academic user of this service in the process – ImageNet indirectly employs a workforce to tag keywords to the images that allows the dataset to group and retrieve them. The taxonomical system that ImageNet has adopted to label the enormous quantities of images has been inherited from WordNet, a 1980s Princeton research project. WordNet aimed to create a semantic system of categories and subcategories of words – nouns specifically – by grouping them in nested synonym clusters or ‘synsets’ that go from the general to the particular.

Housing images under reductive categories can become very problematic, particularly when it comes to classifying images of people. It is uncertain how the terms used to label the photographs in the dataset’s ‘person’ category became available to AMT operators or who is accountable for this. According to ImageNet, the crowdsourcing tool interface presents the operator with a term and its definition and instructions to match it to relevant photographs, so it could be assumed that the terms are determined by the researchers (Li 2010). Some of the terms used to define images within the ‘person’ category include: ‘loser’, ‘non-starter’, ‘ball-buster’ and ‘slut’. Whether this reflects a research bias or a collective social response, a misogynist, racist and homophobic ideology is transferred into the taxonomy: the reductionist nature of the act of labelling images has embedded a political world-view that remains hidden. The genealogy of the semantic structure used to classify the images is not entirely transparent, and it is unclear who is accountable for the choice of image labels and how offensive and often nonsensical terms found their way into the ImageNet labelling system. ImageNet’s ‘person’ category is no longer available to browse, making it difficult to research deeper into the matter, and this arguably shows how delicate and controversial the classification system is. The taxonomical orders in which the different image hierarchies are grouped also reveal another problem: the design of the system relies on discreet results from image analysis. To give an example, the algorithm identifies – to varied levels of accuracy – the gender of the person photographed as a percentage (e.g. 80% female), excluding non-binary gender categories.[5] Similar assumptions and reductions are made with a person’s sexuality, profession, as well as intellectual and social abilities. Leaving aside the political implications of this structural deficiency – which in no way are negligible – it becomes evident that there is a larger problem in the form of the epistemological limits of photography. Photographs are complex and ambiguous in their nature, making their reading a difficult task in the best of cases. Countless theories have been constructed that have sought to describe the mechanisms of photographic interpretation. Barthes, to name just one example, described photography as a ‘message without a code’ and underlined the importance of context – mostly in a textual form – to anchor a photograph’s meaning (Barthes 1977: 17). Something as complicated (and uncoded) as context is virtually impossible to define using words, let alone programming parameters. The assumption that knowledge can be extracted from a photograph in mathematical terms ignores these challenges.
Foucault saw biopolitics as a framework that puts human life in order (Kelly 2014). The organisation of images of human life can be imagined as an extension of biopolitics, as it shares aspects of biopower and disciplinary power. The focus of biopower on the population at large – on demographics rather than individuals and on overall rates rather than singular cases – can be extrapolated to the mechanisms in which the dataset’s taxonomy puts the images of human life in order. Disciplinary power – understood as a form of institutional control of the body – relies on, amongst other things, hierarchical observation and normalising judgement (Foucault 1991). These two concepts also have a parallel in the dataset’s architecture. Hierarchical observation does not relate to the structure of the taxonomical order but to the role of vision in the exercise of disciplinary power. Normalising judgement, on the other hand, finds its place when the photographs are interpreted and labelled. It is in this process that images are set against an imaginary ‘norm’, identified as one thing or another, and tagged with a specific keyword. Establishing this comparative ‘norm’ is arguably an essential part of every human communication, and shared meaning is usually constructed in an organic and collaborative manner. A hidden and arbitrary normalising judgement is, however, very problematic particularly when the mechanisms of its implementation are unknown by the public.
ImageNet is only one of many datasets that are used to train AI to see images and is one of the few, mainly because of its academic origin, that remain open to public scrutiny. The structure of the dataset’s ‘synset’ can be accessed online[6] but, as stated, the ‘person’ category is now unavailable. Most of the datasets that dominate the way in which machines will read the world remain completely closed off to the public and belong to the big five technology giants: Facebook, Amazon, Apple, Microsoft and Google. Whereas the photographic archives that traditionally were employed for policing and population control belonged to the state, now their ‘domiciliation’[7] – to borrow Derrida’s term – has found its way into the hands of transnational corporations. The physical control of the records provides an almost exclusive agency over their interpretation. According to John Tagg, photography acquired its power as evidence as it served a particular purpose in the context of the new forms of state that emerged in the last quarter of the nineteenth century. The creation of ‘institutions of knowledge’ by industrialised states provided the mechanisms that enabled photography to function as proof. Tagg (1993) argues that photographs can stand as evidence, not through natural or existential properties, but through historical relations and institutional power. This institutional power, whether in the form of a state or corporation, benefits from the acceptance of the evidential value of photographs and from controlling the modes of their interpretation. Today, the public is not only shut out of the design and construction of the systems that enable machine learning but often contributes to the creation of the dataset by providing the content in the form of the data mined by these large corporations. Needless to say, the value created by the use of these large image databases does not go back to the producers of the content but to the stakeholders of these corporations.
Another way of gathering images to be mined by machines is by creating them with the aid of a plethora of cameras that can include security systems, mobile devices, drones etc. A large number of these images are no longer made visible for humans but remain in a state of productive invisibility that I call ‘digital latency’. Looking at different photographic processes, one can identify different stages of latency and invisibility. Film or other photosensitive support can hold an inaccessible image until it is processed, but it differs from the state of digital latency in the sense that it cannot be read or used unless it is chemically developed. Equally, images that are created electronically need some degree of processing to be made visible for humans, but they can be read and classified by algorithms whilst remaining invisible to us. Millions of images can be generated, processed and exchanged without even being seen by human eyes. This new currency of digital latency where images are handled ‘in the dark’ and read as pure data has the potential of an unaccountable automated vision.
For a limited amount of time, unfortunately now offline, Paglen made available through his site ImageNet Roulette a platform that allowed users to upload their own images to be analysed by a neural network. The results were displayed at the users’ own risk, since it was probable that the words used to tag the images would, almost invariably, prove insulting. From the images that I uploaded of myself, I was described as a ‘parrot: a copycat who does not understand the words or acts being imitated’, ‘ape’, ‘ice-skater’, ‘trumpeter or cornetist’. Despite finding it amusing to be described as an ape or a brass musician, it is much more problematic that there are machine-learning systems that assume that the inner characteristics of a person can be extracted from a single image. The real danger lies in the implementation of algorithms that are constantly learning to read images based on biased taxonomies that have embedded within them flawed assumptions. Examples of the implementation of machine vision can be found in insurance assessments (insurance firm Liberty Mutual uses AI to assess car damages and Insurtech’s Collective Health start-up mines data to tailor policies), recruitment platforms, crowd control systems that look for violent behaviour and criminal profiling (Dickson 2020).

In his essay The Body and the Archive (1986), Allan Sekula describes the failed attempts for a similar use of photography in criminal profiling in the nineteenth-century practices of phrenology and physiognomy by Bertillon (identification through anthropometrics and a front and profile portrait composite) and Galton (physiometrics and fingerprint classification). Photographic representation led positivist social scientists to believe that a more efficient judicial process could be achieved through the identification of a criminal typology by visual means. Police archives were designed to embed photographs into criminal files, turning the bureaucratic cabinet into a tool of police enforcement. The encyclopaedic endeavour by Bertillon to create a visual typology of criminals was unsuccessful. The retrieval of records to match the image of a suspect turned out to be impracticable by the impossibility of accessing and reading the excess of images and filed cards within the archive. The amount of entries in the cabinets surpassed the capacity of the archive operators to make sense of it all. In a 2011 essay, Fontcuberta points at the impossibility of an analogue viewer (humans) to keep up with the excess of images produced digitally (Cadava and Nouzeilles 2013). Fontcuberta locates the surplus in the production of images to the personal over-documentation of daily life in relation to the time spent reviewing these images, but also to the embedding of cameras in riot police gear and the impossibility of reviewing hundreds of hours of video footage resulting from this.[8] Machine vision has overcome the limits of the analogue viewer – human vision – and responded to the excess of image production with a reductive reading.
The possibility of vision, human or otherwise, is determined by structures of institutional or state power. Foucault (1991) examines the displacement of the spectacle of physical punishment and the infliction of pain from public space to the hidden state-managed institution of the prison in Europe and North America at the end of the eighteenth and beginning of the nineteenth centuries. In doing so, the state shifted the enforcement of punishment from publicly visible torture and execution to imprisonment, confinement and deportation. The administration of justice ceased to be a public spectacle, becoming instead a private process invisible to the population. The invisibility of the AI judgement is not only removed from the public gaze but also moved away from a control that previously was exclusive to the state. This is not to say that the state has lost agency or control in this mechanism, but there is a displacement of the archives – and the tools for their analysis – to the private sector. The panoptical[9] nature of digital imaging (given that users often allow data tracking and give up the rights for commercial uses of the images they produce and distribute) can also be regarded as the next level of surveillance architecture.
Sekula is careful to distinguish the private and public uses of photography and recognises that the public gaze not only works in a repressive way but also in an ‘honorific’ fashion (Sekula 1986). Contemporary uses of photography in the public sphere also have this double application. The ‘honorific’ gaze has shifted to aspirational and self-obsessed social media practices whilst the repressive forms have shifted to data-mining practices. Along with the attractive features of technological packages, there is a layer below in which capitalisation practices take place. Google learns from user behaviour to provide a tailored search response but also to position the adequate ad; Facebook might track our interests to suggest relevant services but also sells our political anxieties to the highest bidder; Alexa listens to every demand but also to our preferences to suggest relevant purchases on Amazon.
Physiognomy and phrenology are based on the view that the face and skull were able to show the signs of a person’s character. Implementations of AI rely on similar assumptions and move beyond facial recognition to the identification of potential psychological tendencies from consumer profiling to violent behaviour and crowd control. Agamben locates the nineteenth-century implementation of Bertillon’s cards and Galton’s fingerprint records as the first moment when identity was no longer linked to a social persona but to ‘naked life, a purely biological datum’ (Agamben 2011: 50). Identity was no longer a set of social relations, but a set of optical and statistical anthropometrics. Documents to prove one’s identity have been evolving from earlier printed anthropometric measurements to today’s biometric chip that contains – one can only assume, as the bearer of the document cannot read the contents of the chip – a digital transcript of the data for machines to read. In this reduction from the social to the biometric, Agamben finds that identity recognition becomes bound to ‘the Great Machine’ and comments ‘I am not forgotten if the Great Machine has recorded my numerical or digital data’ (Agamben 2011: 53). But the ‘Great Machine’ not only recognises biometrics, it also places us within a social landscape. AI reads our ‘zoe’ (biological datum, our bare life) but it also thinks of our ‘bios’ (our way of living, our social and political space), interprets our emotions and predicts our behaviour. The ‘Great Memory’ not only remembers us but it never forgets.
Criminology emerged as a science from the idea of a criminal biotype: an organically different body with a criminal pathology (Sekula 1986). Sekula makes clear his concern with an overpromise in relation to the power of the optical realism that photography offered in the nineteenth century and recognises the filing cabinet – the archive – as the central artefact in a taxonomical system of biotypes. Sekula states that the sheer number of photographs, its excess, frustrated the promise of the archive as a system of geometrical, mathematical knowledge retrieval. With the current image banks, what could happen? Where the filling cabinet failed, could the algorithm triumph?
Footnotes
[1] Crawford and Paglen describe their methodology as ‘an archeology of datasets […] digging through the material layers, cataloguing the principles and values by which something was constructed, and analyzing what normative patterns of life were assumed, supported, and reproduced’ (Crawford and Paglen 2019).
[2] In the first pages of Camera Lucida Barthes also alludes to a Magritte painting, but with the intention to highlight the difference between photographs and other images, as in the former he sees the referent in the image as inseparable from the photograph. For Barthes ‘the Photograph […] has something tautological about it: a pipe, here, is always a pipe’ (Barthes 1993: 5).
[3] A written piece from their research can be accessed at https://www.excavating.ai/.
[4] Mechanical Turk is an online platform that allows piecemeal workers to perform remote click work such as tagging images; it is named after an automaton that was purported to be mechanical but in fact had a person hiding inside.
[5] Paglen made this particular issue visible in the work Machine Readable Holly Herndon (2017) in which a grid of portraits of composer Holly Herndon are captioned with the results – in percentages – from a facial recognition algorithm query into her age, gender and emotional state.
[7] Derrida uses one of the etymologies of the word archive, arkheion, which means house or address, to reflect on the physical space or place where the official documents are filed. ‘It is thus, in this domiciliation, in this house arrest, that archives take place’ (Derrida 1995: 10).
[8] Fontcuberta illustrates this with an example of the reactions of the Catalonian government in relation to the demonstrations in Barcelona.
[9] Foucault, Sekula and Tagg make reference to Bentham’s panopticon, the ever-seeing central tower in a prison that would change the behaviour of the inmates by making them aware of the potential of being seen.
Pablo Antolí is a London-based Mexican artist interested in the creative tensions between the documentary and constructed image. His work stems from an interest in history and geopolitics. He obtained a MA in Photography from London College of Communication in 2012 and since then he has worked on research based projects in Mexico and Europe. Alongside his practice as a visual artist, he has worked as a visiting lecturer and was granted a bursary from the Mexican Endowment for the Arts and Culture (FONCA) in 2017-2018. He is currently a practice-based MPhil/PhD researcher at CREAM, University of Westminster, London.