On Motivation
by JS
I’ve been traveling a great deal recently, first to the AAAI Fall Symposium then to EpiRob. One of the most interesting days of all this conference travel came at the tail end of the EpiRob conference, which coincided with the beginning of the workshop on intrinsic motivation called IM-CLeVeR. Part of the appeal was the presence of so many luminaries in my particular field (the reinforcement learning community was particularly well represented, as Andrew Barto and Richard Sutton were both in attendance).
I’m still compiling my notes from both conferences, and hope to distill some of those ideas into entries, but I thought I’d start with an intriguing idea from Professor Barto, that “motivation is the gradient of the value function.” One approach to reinforcement learning, where an agent tries to act so as to maximize reward over time, is to compute value functions, which assign values to each state of the world that are intended to reflect estimates of future rewards agents can hope to achieve from those states acting as they are. Though I think this description is accurate, it is certainly horribly concise, so I’d recommend the book if you are intrigued enough to learn more.
Anyway, if an agent has a value function that represents the best an agent can expect in terms of future rewards for any state, then an agent has enough information to act optimally, since it simply needs to look to the next state with highest value. The gradient, or change in value from state to state, drives the choice of agent actions, and so can be considered a kind of motivation. We should probably complicate this further by noting that agents have to learn value functions, and that not all value functions are created equal. In fact, there’s a unique value function, the optimal value function, that represents the best an agent can do from any particular state. If the agent is gifted with this value function, then gradient as motivation makes sense. If not, then we have to consider the need to change the value function to better approximate the optimal, alongside the need to follow the value function in some greedy way.
[Aside: Thinking off the cuff, we can view the need to properly approximate the optimal value function as part of a kind of meta-value function, which doesn't just consider the values of world states relative to the reward function, but also considers the value of proper value function estimation. And so on down the rabbit hole...]
But all this complexity seems to avoid the tricky issue of motivation. For one thing, we, as the specifiers of the algorithm, are setting up the agent to follow value function gradients (or exploratory gradients) as a consequence of the way the problem is set up. In some sense, this is unappealing since it leaves aside any explanation for why an agent should follow value functions in the first place. Put another way, value function gradients as motivation for behavior only make sense in the context of a reward function that indicates good outcomes (if not the method of achieving those outcomes). So this just begs the question, where do rewards come from? This is a question that reinforcement learning conveniently avoids answering by assuming from the outset (at least in theory) that rewards are given.
“Where do rewards come from?” was, not surprisingly, the title of Andrew Barto’s workshop presentation, so I may very well just be recapitulating his own line of reasoning on the matter. His talk summarized a very interesting piece of work that looked at how evolution can act on reward functions that result in learning agents with better fitness. The presentation made a point about reward functions that I’ve thought of independently, but which psychologists have already enunciated in various forums. The point is this: reward signals are nothing special to the world, even though they are special to the agent. The universe does not care that you go hungry; you care that you go hungry. Drawing rewards as a distinct signal in standard reinforcement learning diagrams does not make sense. Rewards are just normal state signals with a special (internal and evolution-mediated) interpretation by the agent.

Comments
What you’re describing sounds very similar to game theory.
It is related, though the relationship is not always clear. In the standard reinforcement learning framework everything outside the agent is “the environment.” Game theory makes finer distinctions concerning the environment, and in particular assumes that the environment contains other opponent agents. Certain formal definitions of game theory are actually more general than formal definitions of reinforcement learning, but I generally like to think of game theory as trying to tackle the more specific problem of multi-agent interaction, instead of agent-environment interaction.