Web Site Metrics

by JS

When I look at the statistics for this site, I’m often startled by the number of page views I see in a given month. I’ve had months with 20,000 page views, certainly not a lot by any kind of commercial metric, but for a back water blog with approximately 5 regular readers (judging by the number of Google Reader subscribers), the number of page requests seems way too large.

So where do they come from? Well, I can think of a couple of sources. One, using the dashboard may be generating quite a few web server requests that my stats provider is misinterpreting as legitimate traffic. Alternatively, maybe those five Google Readers are really really interested in what I have to say. Google may also be scraping my site quite a bit.

Or maybe it’s spam.

I could probably use a service that will calculate better statistics than my host provides, but I want to consider this problem as if I only had access to the messy data that I have, and can’t change it. What to do? This is actually a fairly common dilemma in science. We’d like to have data X (the number of real live individual fans of the site) but instead we have data Y (the number of server requests). How do we go from Y to X?

Now I could try to build a complicated model of how Y and X are related, plug in the Y data (which I have) and then observe the model estimates of the X data. A sufficiently accurate model would provide precisely what I want. I don’t have such a model, but what I do have is a proxy for measuring the amount of spam traffic to the site — the number of real comments divided by the number of spam comments. Of course this actually underestimates the legitimate traffic, because not every visitor comments, but this is a useful ratio for producing a lower bound.

You’ll be happy to know that 99.9% of comments on this site are spam. Conclude what you will (my spam overlords).