Archive for the “computer science” Category
Posts relating to research questions in computer science.
The shorthand “WTF” is common among developers. It does not actually stand for “worse than failure” but I run a clean ship here people. [Hint: let's just say that WTF is a somewhat crude interrogative. Sample usage: "Dude, WTF?" or, if you prefer, comic form.]
I just spent two days looking for a bug in my code whose fix was simply replacing
A = A + np.dot(features, features - env.gamma * newfeatures)
with
A = A + np.outer(features, features - env.gamma * newfeatures)
In my defense, I didn’t spend two full days looking for this mistake, but the process was so demoralizing that I couldn’t find much willpower to do productive work when I wasn’t actively debugging.
Takeaway lesson, triple check your linear algebra. If you get something wrong, you’re likely to have a hard time figuring out why, or worse, you may mistake the junk for the correct answer. [Aside: note that I was really burned by the fact that '+' is overloaded in this case, and so is valid for both matrix, matrix addition and scalar, matrix addition. Despite the scorch marks, I do like this feature.]
No Comments »
I compared two Python implementations for computing shortest paths in a previous post. This was all inspired by some Matlab code for computing Isomap that contained the following version of the Floyd-Warshall algorithm:
for k=1:N
D = min(D,repmat(D(:,k),[1 N])+repmat(D(k,:),[N 1]));
end
Of course, mirroring my Python comparison, I decided to recreate the Matlab version using outer products instead of repmat.
for k=1:N
D = min(D, D(:,k) * D(k,:));
end
As with my Python comparison, the implementation using outer products was faster (31.6 seconds to 59.3 seconds). The time difference between approaches was again about 30 seconds, which is very similar to the previous Python comparison.
The Matlab code ran about 10 seconds faster than the corresponding Python code in both cases, though you should probably take that result with a grain of salt as I was not careful to control for a number of possible confounding factors between the two experiments.
No Comments »
A research problem I’m working on involves Isomap, a method of dimensionality reduction that requires computing all shortest paths over relatively large graphs. In interpreted languages like Matlab this is often done through the use of the Floyd-Warshall algorithm, a simple dynamic programming approach to computing all shortest paths. In the reference Isomap code this algorithm takes only three lines:
for k=1:N
D = min(D,repmat(D(:,k),[1 N])+repmat(D(k,:),[N 1]));
end
The benefits of Floyd-Warshall are two-fold for languages like Matlab. The first is that the implementation is short and simple. The second is that the implementation can employ a number of linear algebra primitives that are highly optimized in languages like Matlab. In the code snippet above the repmat, addition, and min operations are all implemented as low-level highly optimized library calls. This means that Floyd-Warshall, despite the non-optimal runtime, is often the fastest Matlab approach to computing all shortest paths even on reasonably large problem sizes.
In my work, I’ve reimplemented Isomap in Python (which will soon appear in my algorithms gallery). My implementation of Floyd-Warshall uses NumPy for access to similarly optimized low-level linear algebra routines. Here’s my original Python code snippet:
for k in range(n):
adj = np.minimum( adj, np.add.outer(adj[:,k],adj[k,:]) )
return adj
Note that I used outer products in my code, instead of the repmat approach in the original Isomap implementation. I decided to see if using tile, the NumPy equivalent or repmat, would result in an even faster Floyd-Warshall implementation:
for k in range(n):
adj = np.minimum(adj, np.tile(adj[:,k].reshape(-1,1),(1,n)) + np.tile(adj[k,:],(n,1)))
return adj
To test these two approaches, I generated a few random graphs on 1000 nodes and computed all shortest paths several times (to account for variations in CPU usage). To my delight, it turns out that my original implementation ran an average of 30 seconds faster (42.8 seconds) than the approach using tile (72.7 seconds). Though more experimentation might be required to make this comparison definitive, I am trying to uncover possible reasons for the performance difference, and in particular why the Matlab code uses repmat instead of outer products (I have yet to compare performance of different Matlab versions of this algorithm).
4 Comments »
When pressed to explain my post on motivation the other day, in particular the idea of motivation as the gradient of the value function, I found my explanation, and indeed my own understanding, unsatisfactory. Since I like to aim for complete understanding, I thought I’d spend a little time on value function gradients and why they may serve as a technical definition of motivation.
To start we should probably have a target idea of motivation in mind if we hope to explain it in technical terms. From the OED we have:
The (conscious or unconscious) stimulus for action towards a desired goal, esp. as resulting from psychological or social factors; the factors giving purpose or direction to human or animal behaviour. Now also more generally (as a count noun): the reason a person has for acting in a particular way, a motive.
WordNet has the following:
The psychological feature that arouses an organism to action toward a desired goal; the reason for the action; that which gives purpose and direction to behavior.
So what does gradient of the value function mean? Consider the following environment and value function. The red square is the goal, where an agent receives a reward of 1.0. The agent is currently at the blue square and is trying to reach the goal. I’ve labeled each state with it’s expected discounted reward following the optimal policy of reaching the goal in the least number of states. The discount factor is 0.95.

Following a greedy policy, where the agent selects the action that brings it to the next state of highest value, would force the agent to move right (in this scenario the agent can move in the four cardinal directions unless obstructed, or not at all). Out of the two choices of actions, left or right, the right action has the higher value since it is closer to the goal and not subject to as much discounting. How would we describe the actions of this agent? We might say that the agent is moving up the gradient of the value function, since its path involves actions that seek states of higher and higher value.
Notice, however, that this characterization is tightly coupled with the way we’ve chosen to represent rewards and values. If we consider the undiscounted case, things look quite a bit different.

Now the agent’s choice of action does not matter all that much, since the agent can always get to the reward state, and the amount of time to get there does not matter. The point is not that one model of reward is more realistic than another, the point is simply that the model of reward matters, and by extension, the ability to use the gradient of value function as a meaningful proxy for reward just sort of begs the question of how (and why) rewards are the way they are.
As stated, the above definitions for motivation seem to do just as well as “gradient of the value function,” but they also seem to fail in the same way. The interesting question is not what is motivation, but why are there particular motivations.
No Comments »
I’ve been traveling a great deal recently, first to the AAAI Fall Symposium then to EpiRob. One of the most interesting days of all this conference travel came at the tail end of the EpiRob conference, which coincided with the beginning of the workshop on intrinsic motivation called IM-CLeVeR. Part of the appeal was the presence of so many luminaries in my particular field (the reinforcement learning community was particularly well represented, as Andrew Barto and Richard Sutton were both in attendance).
I’m still compiling my notes from both conferences, and hope to distill some of those ideas into entries, but I thought I’d start with an intriguing idea from Professor Barto, that “motivation is the gradient of the value function.” One approach to reinforcement learning, where an agent tries to act so as to maximize reward over time, is to compute value functions, which assign values to each state of the world that are intended to reflect estimates of future rewards agents can hope to achieve from those states acting as they are. Though I think this description is accurate, it is certainly horribly concise, so I’d recommend the book if you are intrigued enough to learn more.
Anyway, if an agent has a value function that represents the best an agent can expect in terms of future rewards for any state, then an agent has enough information to act optimally, since it simply needs to look to the next state with highest value. The gradient, or change in value from state to state, drives the choice of agent actions, and so can be considered a kind of motivation. We should probably complicate this further by noting that agents have to learn value functions, and that not all value functions are created equal. In fact, there’s a unique value function, the optimal value function, that represents the best an agent can do from any particular state. If the agent is gifted with this value function, then gradient as motivation makes sense. If not, then we have to consider the need to change the value function to better approximate the optimal, alongside the need to follow the value function in some greedy way.
[Aside: Thinking off the cuff, we can view the need to properly approximate the optimal value function as part of a kind of meta-value function, which doesn't just consider the values of world states relative to the reward function, but also considers the value of proper value function estimation. And so on down the rabbit hole...]
But all this complexity seems to avoid the tricky issue of motivation. For one thing, we, as the specifiers of the algorithm, are setting up the agent to follow value function gradients (or exploratory gradients) as a consequence of the way the problem is set up. In some sense, this is unappealing since it leaves aside any explanation for why an agent should follow value functions in the first place. Put another way, value function gradients as motivation for behavior only make sense in the context of a reward function that indicates good outcomes (if not the method of achieving those outcomes). So this just begs the question, where do rewards come from? This is a question that reinforcement learning conveniently avoids answering by assuming from the outset (at least in theory) that rewards are given.
“Where do rewards come from?” was, not surprisingly, the title of Andrew Barto’s workshop presentation, so I may very well just be recapitulating his own line of reasoning on the matter. His talk summarized a very interesting piece of work that looked at how evolution can act on reward functions that result in learning agents with better fitness. The presentation made a point about reward functions that I’ve thought of independently, but which psychologists have already enunciated in various forums. The point is this: reward signals are nothing special to the world, even though they are special to the agent. The universe does not care that you go hungry; you care that you go hungry. Drawing rewards as a distinct signal in standard reinforcement learning diagrams does not make sense. Rewards are just normal state signals with a special (internal and evolution-mediated) interpretation by the agent.
2 Comments »
In a previous post I mentioned that de Bruijn graphs are a useful way to model fixed memory agent strategies. I decided to code up a version of de Bruijn graphs using Python + NetworkX to show how this representation might be useful. You can find the source here. If we use de Bruijn graphs, we can replace the dynamic programming approach of the original paper with an application of Dijkstra’s algorithm to find the shortest (weighted) path.
The resulting running time if given this representation is actually less than the running time of the algorithm presented in the paper, but the cost of generating the representation from the original normal form game is quadradic in the number of available agent actions.
No Comments »
[ For background on game theory I recommend starting here. ]
In two player iterated games, each player chooses an action each round. The reward each player receives is given by a payout matrix, and is based on the joint action of both players. In other words, if I choose action and my opponent chooses action , then our joint reward is given by where is some payout matrix giving the reward for each possible joint action of me and my opponent.
If you are playing an agent with some fixed memory size and strategy, you may, as the opponent, want to reason about what the agent will play and, in particular, what you should play to maximize your own return every round. This is the topic of this paper.
While trying to understand the paper, and unsuccessfully trying to prove a conjecture in the paper concerning fixed strategy agents with non-trivial memory, I came across an interesting way to model the problem — de Bruijn graphs.
Since these are named graphs, you should assume that mathematicians know a great deal about them. They seem to come up in systems research as well, as the graphs have nice diameter properties that are beneficial in state based protocols and networks. The construction of a de Bruijn graph is simple. Each node is a length list of symbols. There are nodes in the graph (i.e. every possible list appears). If we view every node as a memory queue, then a directed edge occurs between two nodes if the destination node can be created from the origin node by putting a symbol at the front of the memory queue and “popping” off a symbol at the end of the memory queue. In other words, this is like remembering something recent and forgetting something old.
In the two player scenario above, our opponent is basing her actions on my previously recorded actions. The next action I take results in both some reward based on our joint actions, and an update to the opponent’s memory. In other words, we can model my actions in the game as a reward function over the de Bruijn graph that records both the state of the game in terms of the opponent’s memory, and my possible actions, which move from state to state (or node to node). The edges are weighted according to the payout matrix. This does not cause any problems since the opponent’s strategy, given the opponent’s current memory state, is entirely deterministic. As the opponent strategy changes, the edge weights change. As the opponent’s memory size changes, or the number of actions available to me changes, the underlying graph structure changes (de Bruijn graphs are parametrized by the number of symbols (e.g. my possible actions) and the memory size .
We know that the diameter of unweighted de Bruijn graphs is since to get to any node requires actions that have the effect of “overwriting memory” and thus terminating in the appropriate node. The conjecture in the paper mentioned above seems to be equivalent to the following question:
What is the length of the longest shortest path in a weighted de Bruijn graph?
This is a somewhat confusingly posed question. For instance, “length” refers to the number of hops in a path, but “shortest” means shortest in terms of the weight of the edges. For arbitrary weighting of edges, the diameter can probably be made arbitrarily large, but in the case where the edge weights are restricted to reflect a small number of rewards (relative to the size of the graph), and the rewards are somewhat evenly distributed across the edges, it seems like we can put some strong bounds on the length of shortest paths. If we have a mechanism for doing that, then we can reason about all sorts of fixed memory deterministic strategies for agents.
No Comments »
While reading Pattern Classification by Duda, Hart, and Stork, I came across what is known as the Ugly Duckling Theorem. The theorem is simple to state.
Insofar as we use a finite set of predicates that are capable of distinguishing any two objects considered, the number of predicates shared by any two such objects is constant, independent of the choice of the two objects.
You can think of the term predicate as any kind of yes or no question we may ask about the object in order to determine that object’s place in a hierarchy. For this discussion you may find it useful to keep in mind an example hierarchy, such as the biological hierarchy. If we naively try to determine how related two animals are, we might ask certain kinds of yes or no questions of each and compare the results. For example, is the animal warm blooded? Does the animal live on land?
[Aside: Biological classification is much more complex than the cartoonish example I'm putting forward as an illustrative example of the use of predicates in determining similarity. The Ugly Duckling Theorem, however, applies to any grouping scheme, including the complex evolutionary analysis that informs our modern understanding of the classification of living organisms. ]
Classifying objects into categories is a central problem in machine learning. Many of the algorithms I’ve explored on this blog were designed to make various facets of that process tractable on computers. I think it is worthwhile to explore some of the limits of what we should expect machine learning algorithms to do. For example, can we consider a simple approach to determining similarity based on predicate knowledge?
The somewhat surprising answer is no. For any finite set of predicates, any two objects will agree on a constant number of such predicates. The tricky observation that Watanabe (this theory’s originator) makes is that we also need to consider all logical relations of predicates. That is, if we use predicates A and B to determine the similarity between two objects, we ought to consider predicates constructed out of logical relations among A and B. Taking all of these together, we find that any two objects will agree on the same number of these predicates. In other words, every object is equidistant from every other object.
Clearly, we are able to measure similarity between objects, and we have machine learning algorithms that are also able to infer distinctions based on similarity measures. The way around the Ugly Duckling Theorem is to weight propositions differently. Not all propositions for comparing objects are created equal. Intuitively, the difference between plants and animals is “larger” than the difference between warm and cold-blooded animals.
One way to see how the number of shared predicates might be object independent is to consider some predicates, Red and Black, and two objects, one red and one black. Now we might intuit that these objects do not share any predicates, but in fact they do. They share the predicate “Red or Black.” It may happen that “Red or Black” is not a predicate that we would ever consider important, but its lack of importance is a property we place on the predicate, and prevents the sort of predicate counting approach that the Ugly Duckling Theorem rules out.
I still have some questions I’m trying to work through with regards to this theorem. For example, it seems like using all prepositions to determine similarity is like trying to learn a concept class consisting of all Boolean formula in N variables. I suspect we can draw some sort of equivalence between the Ugly Duckling Theorem and the idea that you have to restrict the space of possible hypotheses when trying to learn a classification function, even though the theorem applies to predicates, whereas hypothesis classes are essentially spaces of functions.
The other issue is how a weighting of propositions comes about. For us, does evolution or development provide our intrinsic propositional biases? It seems like effective similarities arise because they are useful, not because they are general. Can we capture this in a kind of improved version of the Ugly Duckling Theorem?
No Comments »
I went to a very interesting talk by Lance Fortnow on applying ideas from computational complexity to game-theory situations. The key problem from the game theory side is articulating a model of bounded rationality. Fortnow’s insight is that “efficiently computable” is the appropriate model for bounded rationality.
One idea that I hadn’t seen before is the notion of Program Equilibrium. The most common type of equilibrium in game theory is Nash equilibrium, where everything else being fixed, any change to a player’s strategy decreases that player’s reward. In program equilibrium, a player is able to spend time reasoning about the other player’s strategy (in the form of a program that the other player runs to play the game), but the time spent doing so discounts any reward received at the end of the game.
No Comments »
Decision tree learning algorithms have been around a long time but are something I have only come to appreciate lately, since, truth be told, I did not previously spend any serious time trying to understand decision tree learning.
One key benefit of learning decision trees is that the resulting tree function can be interpreted as a sequence of simple questions that can classify data. Say, for example, you want to figure out whether a loan will default or not. Loans have a number of parameters, like loan-to-value, interest rate, and credit score of the borrower. You may try to come up with a set of if-then criteria based on these parameters that lead you to the answer. This is what decision tree learning algorithms generate.
I’ve implemented a simple version, ID3, and added it to my collection of algorithms in the sidebar. You’ll note that I used Sqlite instead of trying to represent that training data using dictionaries. It turns out that a lot of the primitives that go into decision tree algorithms are naturally expressed as operations on table data (see for example the entropy calculation). Sqlite (or any RDMS) is ideally suited for this.
No Comments »
|