Mindy McAdams – Page 8 – AI in Media and Society

What is called ‘AI’ but really isn’t

August 24, 2020August 24, 2020 Mindy McAdams

Because “artificial intelligence” and “AI” have become such potent buzzwords in business — and so many firms are trying to sell some kind of “AI” system or software or strategy to every business possible — we should all take a step back and evaluate whether there is actual AI operating in some of these systems.

That won’t always be easy to discern. If a company claims there is “AI” in its product, they are not going to divulge exactly how it works. If they want to convince you, their literature or their engineers will likely throw out a tangled net of terms that, while accurate, might not help anyone but another engineer understand what’s inside the black box.

I was thinking about this recently as I worked on assignments for an online computer science course in AI. One of the early projects was to program a tic-tac-toe game in which a human can play against “an AI.” Just like most humans, the AI can force a tie in every tic-tac-toe game unless the human makes a mistake, and then the human will lose. I wrote the code that enables the AI to play — that was the assignment. But I didn’t invent the code from nothing. I was taught in the course to use an algorithm called minimax. Further, I was encouraged to make my program faster by using another algorithm called alpha-beta pruning.

*Illustration of alpha-beta pruning (Wikipedia, by Jez9999, GNU license)*

There is no machine learning involved in those two algorithms. They are simply a time-tested way for a computer language to direct a certain kind of look-ahead in a two-player game (not only tic-tac-toe).

Don’t despair or tune out — look at the diagram and understand that the computer, through instructions in my code, is able to rapidly advance through every possible outcome in tic-tac-toe and see how to: (a) prevent a win for the opponent, and (b) win if a win is possible.

There is no magic here.

*Tic-tac-toe with “AI” playing X, human playing O.*

Another assignment in the same course has the students programming “an AI” that plays Minesweeper. This game is quite different from tic-tac-toe in that there is only one player, and there is hidden knowledge: The player doesn’t know where the mines are. One move at a time, the player builds knowledge about the game board.

*Completed Minesweeper game, with AI playing all moves.*

A human player doesn’t click on a mine, because she chooses squares that are next to a 0 (indicating no mines touch that square) and marks a mine square when it becomes obvious that a mine is hidden there.

The “AI” builds knowledge in a way that it is programmed to do (that is the assignment). In this case, there is no pre-existing algorithm, but there are principles of logic. I programmed “knowledge” that was stored in the program each time the AI clicked a square and a number was revealed. The knowledge is: (a) that number, and (b) the coordinates of all the surrounding squares. Thus the AI “knows” that, for example, among eight specified squares there are two mines.

If among eight specified squares there are zero mines, my code tells the AI to mark all eight of those squares as safe. My code also tells the AI that if there are any safe moves left to be made, then make a safe move. If not, make a random move. That is the only time when the AI can possibly set off a mine.

Once again, there is no magic here.

In contrast to these two simple examples of a computer successfully playing a game, AlphaGo (which I wrote about previously) uses real AI and could not have beaten a human Go master otherwise. Some games can’t be programmed with only simple algorithms or logic — if they are to win, they need something akin to intuition.

Programming a computer to develop and use an approximation of human intuition is what we have in today’s machine learning with deep neural networks. It’s still not magic, but it’s a lot more complicated than the kind of strictly mapped-out processes I wrote for playing tic-tac-toe or Minesweeper.

AI in Media and Society by Mindy McAdams is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Include the author’s name (Mindy McAdams) and a link to the original post in any reuse of this content.

Visual Chatbot: What can AI tell you?

August 21, 2020September 18, 2020 Mindy McAdams

To see for yourself the product, or end results, of an AI system, check out the Visual Chatbot online. It’s free. It’s fun.

Screenshot of dialog with Visual Chatbot

This app invites you to upload any image of your choice. It then generates a caption for that image. As you see above, the caption is not always 100 percent accurate. Yes, there is a dog in the photo, but there is no statue. There is a live person, who happens to be a soldier and a woman.

You can then have a conversation about the photo with the chatbot. The chatbot’s answer to my first question, “What color is the dog?”, was spot-on. Further questions, however, reveal limits that persist in most of today’s image-recognition systems.

The chat is still pretty awesome, though.

Public domain photo of a soldier and a dog indoors, probably in an airport, with a "Welcome Home" balloon. U.S. Department of Defense photo. — *U.S. Department of Defense photo, 2015 (public domain)*

The image appears in chapter 4 of in Artificial Intelligence: A Guide for Thinking Humans, where author Melanie Mitchell uses it to discuss the complexity that we humans can perceive instantly in an image, but which machines are still incapable of “seeing.”

In spite of the mistakes the chatbot makes in its answers to questions about this image, it serves as a nice demonstration of how today’s chatbots do not need to follow a set script. Earlier chatbots were programmed with rules that stepped through a tree or flowchart of choices — if the human’s question contains x, then reply with y.

You can see more info about Visual Dialog if you’re curious about what the Visual Chatbot entails in terms of data, model, and/or code.

Below you can see some more questions I asked, with the answers from Visual Chatbot.

Some of my favorite wrong answers are on the last two screens. Note, you can ask questions that are not answered with only yes or no.

Who labels the data for AI?

August 20, 2020August 24, 2020 Mindy McAdams

In yesterday’s post, I referred to the labels that are required for supervised machine learning. To train a model — which enables an AI system to correctly identify or sort images or documents or iris flowers (and so much more) — each data record must include one or more labels. For an image of a dog, for example, the labels might be dog and Great Dane. For an iris flower, the label is the name of the exact species of that individual flower.

Nowadays there are people all around the world sitting at computers and labeling data.

In the 6-minute video above, BBC journalist Dave Lee travels to Kenya, where about 2,000 people work in a Nairobi office for Samasource, which produces training data for use in machine learning.

You’ll see exactly how every single item in one video frame is marked and tagged — this is what a vision system for a self-driving car needs if it is to avoid crashing into mailboxes or people.

In the Nairobi office, 52 percent of the workers are women. The pay is terribly low by Silicon Valley standards, but high for Kenya. Lee doesn’t gloss over this aspect of the story — in fact, it’s central to the telling.

Financial Times journalist Madhumita Murgia wrote about Samasource in July 2019. Her story also covers iMerit, a similar company with offices in Kolkata, India, as well as California and Louisiana.

“An hour of video takes eight hours to annotate. In fact, a McKinsey report from 2018 listed data labeling as the biggest obstacle to AI adoption in industry.”
—Financial Times

Some very large and widely used datasets such as ImageNet were labeled by self-employed workers for extremely low rates of pay — often through the Amazon-owned Mechanical Turk crowdsourcing website (which also offers up far worse tasks for similarly low compensation). In contrast, Samasource’s CEO Leila Janah told Murgia that the company’s pay rate is “almost quadruple” the previous income of their workers in developing countries.

Janah also pointed out that these workers are not just labeling cats and dogs. They have been trained, for example, to label diseased cells in photos of cross-sections of plants for one particular project. They are providing real human intelligence that is specialized to very particular problem sets.

Fortune journalist Jeremy Kahn wrote about other companies that also provide data-labeling services for top multinational firms. Labelbox and Scale AI have received heaps of funding from venture capitalists, but I couldn’t find any information about their workers who label the data. Is this something we should be concerned about? Probably so.

Both Samasource and iMerit are upfront about who their workers are and where they do the work (this might have changed since the spread of COVID-19 in early 2020). Are the dozens of other companies supplying labeled data to corporations and universities in the wealthy countries paying their workers a living wage?

“Often companies have a need for both general and more expert labeling and employ a combination of outsourcing firms, freelancers, and in-house experts to affix these annotations.”
—Fortune

Labelbox, in fact, doesn’t employ people who do the labeling work, according to Fortune. It provides “a tool for managing labeling projects and data across different contract labelers, who often work for large outsourcing firms.”

ImageNet and labels for data

August 19, 2020September 3, 2020 Mindy McAdams

Supervised learning is a type of machine learning in which a model is trained using labeled data. You begin with a very large collection of labeled data. (In the case of ImageNet, the data were all digital images. For the Iris Data Set, the data all refer to individual iris flowers, which can be divided into three related species. For the MNIST dataset, the data are images of about 70,000 handwritten numbers, 0 through 9.)

You divide the dataset into two parts, the training data and the test data. The split might be 70/30, or 80/20. You don’t choose which data goes into which group. Then you run the training data many, many, many times, adjusting certain parameters in the code along the way, until the code consistently returns good results — that is, the thing the code identifies (an object in an image, an iris species, a number) matches the label (which is hidden from the code).

At that point, you have a trained model. You feed the test data set to it and see whether the accuracy rate is also high. (It’s important that none of the test data were used to train the model.) Again, the proof is in the labels.

In a later post I will discuss how data come to be labeled. (Hint: It’s not elves.) In this post, I will discuss bad labels. Specifically, I want to highlight the work that AI researcher Kate Crawford and artist-researcher Trevor Paglen did around the famous ImageNet dataset.

In the video above, Crawford and Paglen present this work and show a lot of great examples. They also published a long article about the work, if you’d rather read than watch.

ImageNet is a huge collection of labeled images. More than 14 million images. They were labeled according to a set of categories and synonym groupings from WordNet, an English-language lexical database. The images were labeled by humans.

And that, it seems, is at the root of the problem.

Crawford and Paglen were interested in the ImageNet photos of people. Person is a category in WordNet. Within the category, there are many descriptive terms for people, such as “cheerleaders, scuba divers, welders, Boy Scouts, fire walkers, and flower girls.” So the photos of people in ImageNet are labeled with these terms. However, not all terms are neutral.

“A young man drinking beer is categorized as an ‘alcoholic, alky, dipsomaniac, boozer, lush, soaker, souse.’ A child wearing sunglasses is classified as a ‘failure, loser, non-starter, unsuccessful person.’”
—Crawford and Paglen

You might say, well, where’s the harm? They are only labels in a database, after all.

The ImageNet database has been used to train many convolutional neural networks used in image-recognition software.

When you feed a photo of yourself into an image-recognition application, you might be surprised at the labels that are applied to you. For example, an image of Paglen (a white man with a shaved head) was labeled as “Klansman, Ku Kluxer.”

Paglen built a web app called ImageNet Roulette so that anyone could upload a photo of themselves or a friend and see what labels were applied. (The app is no longer online.) It became clear that perfectly innocuous people in photos were being labeled as criminals or dangerous, or with racist or sexist terms.

About 952,000 of ImageNet’s 14 million images were in the person category as of 2010 (source). Many of those images — with their labels — were removed after the opening of Crawford and Paglen’s art exhibition, Training Humans, in Milan in September 2019.

ImageNet has been used to train countless image-recognition systems since 2010.

Additional information:

Leading online database to remove 600,000 images after art project reveals its racist bias (September 2019), The Art Newspaper.

GPT-3 and automated text generation

August 18, 2020August 24, 2020 Mindy McAdams

GPT-3 has to be the most-hyped AI technology of the past year. Headlines said its predecessor, GPT-2, was “too dangerous” to be released publicly. Then it was released. The world did not end.

Less than a year later, the more advanced (next generation) GPT-3 was released by OpenAI. Why are people so excited about GPT-3? See for yourself in the video below.

GPT-3 is a natural language generation (NLG) system. Given instructions about what you want, it writes original text that — in most (but not all) cases — sounds like a human wrote it. The technology could be used to rapidly write 10,000 fake user comments into a discussion forum, for example. Or 10,000 fake restaurant reviews.

Don’t worry about the first examples in the video showing GPT-3 writing computer code, if that’s not something you’re well acquainted with — it quickly moves on to show the system extracting text from long documents and writing summaries on the fly. The presenter does a good job of demonstrating the breadth and variety of tasks GPT-3 can be used for. You might be flat-out amazed.

Bear in mind that the examples shown in the video are different, separate applications of GPT-3. You don’t just install GPT-3 and it does all of those things.

Developers can apply to gain access to the GPT-3 API. This enables them to create applications that use GPT-3 but not to see or modify the actual code that makes GPT-3 work. You can view more examples of GPT-3 applications at that same link.

Another nice thing about the video above is the explanation of generative pre-training. Instead of training the GPT-3 model (or models) only with labeled data (supervised learning), the OpenAI researchers used “a semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning.” The pre-training for GPT-2 included a dataset of more than 7,000 unpublished books “from a variety of genres including Adventure, Fantasy, and Romance.” Because entire books were used — instead of sentences separated from their context — the model was able to learn long-range structure.

GPT-3 used even more long-form texts for pre-training (described in a technical paper):

*Above: Screenshot from “Language Models Are Few-Shot Learners,” Brown et al., July 2020*

Once again we can see that tremendous advances in AI capability are made possible precisely because today’s computer hardware has the ability to run through enormous quantities of data very quickly. It’s not only that we now have billions of pages of text in digital form. It’s not just that we can store that Himalayan mountain range of data. It’s very much because processors are able to run multiple calculations simultaneously at lightning speed.

An important point about GPT-3 that’s not covered in the video: None of these applications, or GPT-3 itself, understands the meaning of the text that is being generated.

It’s going to be very easy for people to jump to conclusions about the “intelligence” of a computer system when it’s able to generate responses and explanations that are so human-like. There is no comprehension here. There is no knowledge of the world — there is only knowledge about language itself.

To learn more about how GPT-3 does what it does: GPT-3 Explained in Under 3 Minutes.

Untangling speech recognition

August 17, 2020October 14, 2020 Mindy McAdams

Dealing with language is so complicated! In this post I want to focus on speech, voice, audio — but bear in mind that text is also language, and unlike humans, a machine must be able to process text if it’s going to do anything at all with language.

The speech part of machine learning goes two ways: The machine can “hear” speech as audio (it receives audio and simultaneously creates a digital representation of it) — but to make sense of it, to use it (to find the answer to your question, for example), the machine must convert the audio into text. On the other hand, before the machine can “speak,” it needs text — and that text must be converted into digital audio. For the machine, these are not just one thing and its reverse.

Until I began researching this, I hadn’t given any thought to accents. I had thought about the differences among languages (and I still don’t know whether it’s harder, easier or the same to train a speech-recognition system in tonal languages such as the Chinese languages, or Vietnamese, as compared with a non-tonal language such as English), but I’d never considered that a person speaking English with an accent might not be “understood” by a speech-recognition system.

Behind the Mic: The Science of Talking with Computers (2014)

This breezy video from Google (7 minutes) does a good job of conveying a bit of the actual science behind how Siri, Alexa or Google Assistant “know” what we are saying when we speak to them. Even though it’s from 2014, there’s nothing outdated (as far as I know). You can see how the machine represents the speech it takes in. Like many explanations I found, however, it kind of mushes the text part and the sound part altogether, leaving the viewer with a general sense of how it all works but still in the dark as to how the parts work, separately. (I don’t like how they show a human brain when they talk about neural networks. That’s very misleading.)

The video provides a quick background on the development of speech recognition, which was pretty awful until just a few years ago when researchers started applying deep neural networks to the acoustics part. Just like image recognition, speech recognition got a tremendous boost from the advances in computer processing hardware that now allow immense quantities of data to be analyzed at super speed.

To get a handle on how the separate parts of a speech-recognition system work, I needed to listen to this podcast from March 2020. It’s a 50-minute interview with Catherine Breslin, a U.K. machine learning scientist who specializes in speech recognition. She worked at Amazon Alexa for four and a half years. There’s a full transcript at the same URL if you’d rather read than listen.

For speech recognition, machine learning is used to train separate models — one for acoustics, and one for language. There’s also a third piece, the lexicon, which indicates the sequence of phones (the tiniest sound segments) that make up a single word. I don’t yet understand how that part is made. (Any program that reads text aloud would need to have a lexicon.)

“So if we put these together, we have an acoustic model, which tells you from some audio which sounds are likely to be spoken at that time; the lexicon tells you how those sounds combine into words, and then the language model tells you how those words combine into sequences of words.”
—Catherine Breslin

The three pieces, Breslin explains, work together in a decoding process that produces text from speech — the most likely representation of what was said. I looked at some further technical explanations of how the decoding is done, and it resembles a system for AI analysis of game moves — giant trees, many layers, lots of nodes. What the system needs to learn is the probabilities for sounds forming words forming sentences.

Note, all this is just to get to where the machine has the text of what was said. It hasn’t yet done any analysis of what was meant. Whew.

However, apart from voice assistants like Siri and Alexa, this process by itself has tremendous value for transcription. It is used to produce transcripts of radio programs, interviews and meetings, as well as to generate subtitles for movies and videos.

Robots, and what’s not AI

August 14, 2020September 18, 2020 Mindy McAdams

Think of a robot. Do you picture a human-looking construct? Does it have a human-like face? Does it have two legs and two arms? Does it have a head? Does it walk?

It’s easy to assume that a robot that walks across a room and picks something up has AI operating inside it. What’s often obscured in viral videos is how much a human controller is directing the actions of the robot.

I am a gigantic fan of the Spot videos from Boston Dynamics. Spot is not the only robot the company makes, but for me it is the most interesting. The video above is only 2 minutes long, and if you’ve never seen Spot in action, it will blow your mind.

But how much “intelligence” is built into Spot?

The answer lies in between “very little” and “Spot is fully autonomous.” To be clear, Spot is not autonomous. You can’t just take him out of the box, turn him on, and say, “Spot, fetch that red object over there.” (I’m not sure Spot can be trained to respond to voice commands at all. But maybe?) Voice commands aside, though, Spot can be programmed to perform certain tasks in certain ways and to walk from one given location to another.

This need for additional programming doesn’t mean that Spot lacks AI, and I think Spot provides a nice opportunity to think about rule-based programming and the more flexible reinforcement-learning type of AI.

This 20-minute video from Adam Savage (of MythBusters fame) gives us a look behind the scenes that clarifies how much of what we see in a video about a robot is caused by a human operator with a joystick in hand. If you pay attention, though, you’ll hear Savage point out what Spot can do that is outside the human’s commands.

Two points in particular stand out for me. The first is that when Spot falls over, or is upside-down, he “knows” how to make himself get right-side-up again. The human doesn’t need to tell Spot he’s upside-down. Spot’s programming recognizes his inoperable position and corrects it. Watching him move his four slender legs to do so, I feel slightly creeped out. I’m also awed by it.

Given the many incorrect positions in which Spot might land, there’s no way to program this get-right-side-up procedure using set, spelled-out rules. Spot must be able to use estimations in this process — just like AlphaGo did when playing a human Go master.

The second point, which Savage demonstrates explicitly, is accounting for non-standard terrain. One of the practical uses for a robot would be to send it somewhere a human cannot safely go, such as inside a bombed-out building — which would require the robot to walk over heaps of rubble and avoid craters. The human operator doesn’t need to tell Spot anything about craters or obstacles. The instruction is “Go to this location,” and Spot’s AI figures out how to go up or down stairs or place its feet between or on uneven surfaces.

The final idea to think about here is how the training of a robot’s AI takes place. Reinforcement learning requires many, many iterations, or attempts. Possibly millions. Possibly more than that. It would take lifetimes to run through all those training episodes with an actual, physical robot.

So, simulations. Here again we see how super-fast computer hardware, with multiple processes running in parallel, must exist for this work to be done. Before Spot — the actual robot — could be tested, he existed as a virtual system inside a machine, learning over nearly endless iterations how not to fall down — and when he did fall, how to stand back up.

See more robot videos on Boston Dynamics’ YouTube channel.

Racial and gender bias in AI

August 13, 2020August 24, 2020 Mindy McAdams

Different AI systems do different things when they attempt to identify humans. Everyone has heard about face recognition (a k a facial recognition), which you might expect would return a name and other personal data about a person whose face is “seen” with a camera.

No, not always.

A system that analyzes human faces might simply try to return information about the person that you or I would tag in our minds when we see a stranger. The person’s gender, for example. That’s relatively easy to do most of the time for most humans — but it turns out to be tricky for machines.

Machines often get it wrong when trying to identify the gender of a trans person. But machines also misidentify the gender of people of color. In particular, they have a big problem recognizing Black women as women.

A short and good article about this ran in Time magazine in 2019, and the accompanying video is well worth watching. It shows various face recognition software systems at work.

Another serious problem concerns differentiating among people of Asian descent. When apartment buildings and other housing developments have installed face recognition as a security system — to open for residents and stay locked for others — the Asian residents can find themselves locked out of their own home. The doors can also open for Asian people who don’t live there.

You can find a lot of articles about this widespread and very serious problem with AI technology, including the deservedly famous mug shots test by the American Civil Liberties Union.

“While it is usually incorrect to make statements across algorithms, we found empirical evidence for the existence of demographic differentials in the majority of the face recognition algorithms we studied.”
—Patrick Grother, NIST computer scientist

So how does this happen? How do companies with almost infinite resources deploy products that are so seriously — and even dangerously — flawed?

Yesterday I wrote a little about training data for object-detection AI. To identify any image, or any part of an image, an AI system is usually trained on an immense set of images. If you want to identify human faces, you feed the system hundreds of thousands, or even millions, of pictures of human faces. If you’re using supervised learning to train the system, the images are labeled: Man, woman. Black, white. Old, young. Convicted criminal. Sex offender. Psychopath.

Who is in the images? How are those images labeled?

This is part of how the whole thing goes sideways. There’s more to it, though. Before a system is marketed, or released to the public, its developers are going to test it. They’re going to test the hell out of it. This can be compared with when an AI is developed that plays a particular game, like Go, or chess. After the system has been trained, you test it. To test the system, you’re going to have it play, and see if it can win — consistently. So when developers create a face recognition system, and they’ve tested it extensively, and they say, great, now it’s ready for the public, it’s ready for commercial use — ask yourself how they missed these glaring flaws.

Ask yourself how they missed the fact that the system can’t differentiate between various Asian faces.

Ask yourself how they missed the fact that the system identifies Black women as men.

Fortunately, in just the past year these flaws have received so much attention that a number of large firms (Amazon, IBM, Microsoft) have pulled back on commercial deployments of face recognition technologies. Whether they will be able to build more trustworthy systems remains to be seen.

More about bias in face recognition systems:

Meet the computer scientist and activist who got Big Tech to stand down (Fast Company, August 2020) — about AI researcher Joy Buolamwini
NIST Study Evaluates Effects of Race, Age, Sex on Face Recognition Software (December 2019) — this article is the source for the pullout quote above
Timnit Gebru: Machine learning, bias, and product design (March 2018) — although this interview is older, I like how Gebru speaks about design: “We need more people who think about design working in AI, because oftentimes what’s happening is little things. … think about Siri or Alexa or these personal assistants who are women — what does that do to society, just portraying that stereotype?”
Meet the Googlers working to ensure tech is for everyone (May 2020) — Gebru is now an engineer at Google. Along with colleagues Tiffany Deng and Tulsee Doshi, she works on ensuring fairness in AI.

Ask a computer to draw what it sees

August 12, 2020August 24, 2020 Mindy McAdams

If a computer can correctly identify an object (an apple, a tricycle) or an animal such as a zebra, can it produce a drawing of that object or animal? This is something most people can do, even if their drawing skills are minimal. After all, almost anyone can play Pictionary.

This 8-minute video shows us what happened when a programmer-artist reversed the process of an AI that recognizes objects and animals in digital images. I really admire the deft storytelling here.

Object recognition has improved amazingly in the past 10 years, but that does not mean these AI systems see the same way as a human does. In some cases, that might not matter at all. In other cases, it can mean the difference between life and death.

In yesterday’s post I mentioned the way a convolutional neural network (part of a machine learning system) processes an image through many stacked layers of detection units (sometimes called neurons), identifying edges and shapes that eventually lead to a conclusion that the image is likely to contain such-and-such an object, animal, or person. Today’s video shows a bit more about the training process that an AI goes through before it can perform these identifications.

Training is necessary in the type of machine learning called supervised learning. The training data (in this case, digital images of objects and animals) must be labeled in advance. That is, the system receives thousands of images labeled “tiger” before it is able to recognize a tiger in a random photo or video. If a system can identify 20 different animals, that system was trained on thousands of images of each animal.

If the system was never trained on tigers, it cannot recognize a tiger.

So today’s video gives us a nice glimpse into how and why that training works, and what its limitations are. What’s really fascinating to me, though, are the images produced by programmer-artist Tom White‘s system.

“I have created a drawing system that allows neural networks to produce abstract ink prints that reveal their visual concepts. Surprisingly, these prints are recognized not only by the neural networks that created them, but also universally across most AI systems which have been trained to recognize the same objects.”
—Tom White

In the video, you’ll see that humans cannot recognize what the AI drew. The rendering is too abstract, too unlike what we see and what we would draw ourselves. Note what White says, though, about other AI systems: they can recognize the object in these AI-produced drawings.

This is, I think, related to what is called adversarial AI, which I’ll discuss in a future post.

How machines ‘see’

August 11, 2020August 24, 2020 Mindy McAdams

I am fascinated by image recognition. I read about how ImageNet changed the whole universe of machine “vision” in 2009 in the excellent book Artificial Intelligence: A Guide for Thinking Humans, but I’m not going to discuss ImageNet in this post. (I will get to it eventually.)

To think about how a machine sees requires us first to think about human eyes vs. cameras. The machine doesn’t have a biological eyeball and an optic nerve and a brain. The machine might have one or more cameras to allow it to take in visual information.

Whether the machine has cameras or not, the images it receives are the same: digital images, made up entirely of pixels. This is true even if the visual inputs are video. The machine will need to sample that video, taking discrete frames from it to process and analyze.

So the first thing to absorb, as you begin to understand how a machine sees, is that it receives a grid of pixels. If it’s video, then there are a lot of separate grids. If it’s one still image, there is one grid. And how does the machine process that grid? It analyzes the differences between groups of pixels.

This 4-minute video, from an artist and programmer named Gene Kogan, helped me a lot.

Most people have an idea (possibly vague) of how the human brain works, with neurons kind of “wired together” in a network. When we imagine a computer neural network, most of us probably factor in that mental image of a brain full of neurons. This is both semi-accurate and wildly inaccurate.

In his video, Kogan points out that an image-recognition system uses a convolutional neural network, and this network has many, many layers.

When he’s clicking down the list in his video, Kogan is showing us what the different layers are “paying attention to” as the video is continuously chopped into one-frame segments. The mind-blowing thing (to me) is that the layers feed forward and backward to each other — ultimately producing the result he shows near the end, when he can hold a water bottle in front of his webcam, and the software says it sees a water bottle.

Screenshot of man holding water bottle and neural net evaluation of video image — *Above: Screenshot from 3:10 in the video*

Notice too, that “water bottle” is the machine’s top guess at that moment. Its number 2 guess is “bow tie.” Its confidence in “water bottle” is not very high, as shown by the red bar to the left of the label. However, the machine’s confidence in “water bottle” is much higher than all the other things it determines it might be seeing in that frame.

After watching this video, I understood why super-fast graphics-processing hardware is so important to image recognition and machine vision.

In tomorrow’s post, I’m going to say a bit more about these ideas and share a completely different video that also helped me a lot in my attempt to understand how machines see.