If lots of people delete their cookies and NAT/other technologies pissed all other the single machine=single IP concept, then how do you reliably know the number of different people visiting a website?
Their solution is to do away with a single method and use a hierarchy of steps to determine if we have a unique visitor.
Before I detail the steps, it’s time to take the paradigm shift. Here it is:
We have been assuming that we can use a single method to identify unique individuals. We have been looking for yes-no answers and absolute numbers. We have done all the analysis within the framework of a single software system. We can’t do this any more. No single test is perfectly reliable, so we have to apply multiple tests. Some of those tests yield yes-no answers, and some of them yield probabilities, so the count of unique visitors will be a probabilistic estimate. Some of the tests depend on knowledge of IP topology, so we can’t restrict our analysis to a confined block of data analyzed by an isolated system.
In a nut-shell: To determine a web metric we should apply multiple tests, not just count one thing.
The Magdalena and Thomas methodology
Each of these steps is applied in order:
1. If the same cookie is present on multiple visits, it’s the same person.
2. We next sort our visits by cookie ID and look at the cookie life spans. Different cookies that overlap in time are different users. In other words, one person can’t have two cookies at the same time.
3. This leaves us with sets of cookie IDs that could belong to the same person because they occur at different times, so we now look at IP addresses.
4. We know some IP addresses cannot be shared by one person. These are the ones that would require a person to move faster than possible. If we have one IP address in New York, then one in Tokyo 60 minutes later, we know it can’t be the same person because you can’t get from New York to Tokyo in one hour.
5. This leaves us with those IP addresses that can’t be eliminated on the basis of geography. We now switch emphasis. Instead of looking for proof of difference, we now look for combinations which indicate it’s the same person. These are IP addresses we know to be owned by the same ISP or company.
6. We can refine this test by going back over the IP address/Cookie combination. We can look at all the IP addresses that a cookie had. Do we see one of those addresses used on a new cookie? Do both cookies have the same User Agent? If we get the same pool of IP addresses showing up on multiple cookies over time, with the same User Agent, this probably indicates the same person.
7. You can also throw Flash Shared Objects (FSO) into the mix. FSOs can’t replace cookies, but if someone does support FSO you can use FSOs to record cookie IDs. This way Flash can report to the system all the cookies a machine has held. In addition to identifying users, you can use this information to understand the cookie behavior of your flash users and extrapolate to the rest of your visitor population.