As with the clippy experiments looking into banal AI, I’ve been using Powerpoint’s Alt-Text AI on a variety of images to examine what the assemblage of Microsoft AI + user labelled clips feeding back to the Microsoft AI ‘thinks’ about images. I have not yet been able to find out what techniques Microsoft are using, but then I’ve only just started a PhD, so perhaps my lack of rigorour is not yet symptomatic of a lack of vigour in my character. Anyway, as you can see above, a person looking away from camera, is labelled by Microsoft as a person looking at the camera. If it is registering a symmetrical silhouette of a human being, is it more likely to ascribe them as looking at camera because that is the most common type of image previously labelled? Do all humans look at cameras? Or do most humans upload images of humans looking at cameras, and therefore I’m just atypical in having a publicity shot (for a documentary about conspiracy theories I made) that deliberately conceals my face?
Whilst I have not yet found what methods Microsoft’s Alt-Text feature uses to label images, there are a few likely contenders. Convolutional neural networks (CNN’s) are a common way to classify images. CNN’s are a class of deep neural networks that use a mathematical operation – a convolution – and are apparently inspired by the way the brain processes vision. Above, a CNN used by computer vision researchers at Berkeley shows a human readable representation of what the computer sees when inputted with an image of Magritte’s painting The Treachery of Images. The human readable feature visualisation handily shows an area demarcated by a bounding box, a text label, and a confidence rating (0-1, with 1 being highest). It’s pretty confident that it’s a pipe. But then it appears to have ignored Magritte’s words.
In an attempt to see how well PowerPoint’s AI works in identifying the painting of a pipe in Magritte’s Treachery of Images as against the Computer Vision Group I put a similar image through the Alt Text feature.
Whilst Alt-Text was less confident at identifying the image in the painting as a pipe (it registered a low confidence), it was, to my mind at least, far more descriptive, in that it registered it as a picture. It also gains some poetry, registering that it has mug-ness and pan-ness in its shape. Is this what is meant by the wisdom of the crowd (sourcing of image labels between human and computers) ? In one sense, is this assemblage more expressive of the idea that a computer sees like a human?