"Vision, Instruction and Action" describes a sophisticated integrated system called Sonja that takes instruction, can interpret its environment visually, and can play games (in this case the video game, Amazon) on its own. Sonja integrates advances in intermediate visual processing, interactive activity, and natural language pragmatics. In demonstrating that such systems, rare in artificial intelligence, are possible, David Chapman shows how discoveries in visual psychophysics can be incorporated into AI, how complex activity can result from participation rather than plan following, and how physical contest can be used to interpret indexical instructions. Sonja is able to play a competent beginner's game of Amazon autonomously and at the same time can also make flexible use of human instructions in knowing how to kill off monsters, pick up and use tools, and find its way in a dungeon maze. It extends the author's previous work in developing a new theory of activity by addressing linguistic issues and providing a better understanding of the architecture underlying activity, incorporating many technical improvements.
Sonja also models several pragmatic issues in computational linguistics, focusing on external reference and including linguistic repair processing, and the use of temporal and spatial expressions. It connects language use with more detailed and realistic theories of vision and activity. In the field of vision research, Sonja provides an implementation of a unified visual architecture, demonstrating that this architecture can support a serious theory of activity. It demonstrates the first instance that various visual mechanisms previously proposed on psychophysical, neurophysiological, and speculative computational grounds can be made useful by connecting them with a natural task domain.