Inktomi (Yahoo!, AOL, MSN, etc.) banned
Red tailed hawks nested in a sycamore tree that towered over our house at the end of Calle de los Helechos. I was 11 and at the height of my Field & Stream phase. I lusted for data, and Grey's Anatomy of the Human Body didn't have it.
<hat w3c_p3p="on">
The USG has ordered an indexing and query service to provide the USG with some number of queries received. It is argued that queries do not, in and of themselves, contain information which tends to identify an individual. This is false. My first query of the interlibrary loan service was for the 1943 translation into English, of the Art of Falconry, the 1248 work in Latin De Arte Venandi cum Avibus, by Frederick II of Hohenstaufen, a European prince who wrote about the natural world like an Indian four centuries before Europeans enslaved Indians, and about comparative biology like Darwin six centuries before Darwin.

My request was a pleasent surprise to the librarian, who'd given me run of the library since I was in 3rd or 4th grade, and it was unique. Not simply unique for pre-teens in rural California, but unique for the entire set of all persons using the California Inter-Library Loan system that year. Some queries are inherently unique, or non-unique within a very small universe of requestors. And that is where the claim advanced by USG fails. Within any sample, and the USG is asking for all queries made in any 168 hour period, there are queries which are unique, and queries which are unique within some context (scope). Very few people generate queries on Iranian politicians, details of the Iraq-Iran military operations, Russian reactor export financing, improvements in centrifuge technology, let alone sequences of "independent" queries with high correlation. Those queries might be "anonymous" if I used multiple "anonymizer" services to query Google, and burried narrow queries in broader, and distracting queries, and spread them out over time. Otherwise I may as well digitally sign the damn things.
The USG has also ordered the same indexing and query service to provide the USG with some number of query replies. It is argued that URLs do not, in and of themselves, contain information which tends to identify an individual. This profoundly mistates the actual order and the data disclosed. The USG is not demanding some random, and unique, subset of all URLs known to the service provider, rather, the USG is demanding some universe of referants, whos non-uniqueness discloses what's "hot" and whats not within that sample universe.
If you use Google simply to find out when
</hat>
With that as a preamble, the instance of Apache that serves up Wampum now denies (403 return) these netblocks:
This has the effect of ending the indexing access to Wampum and its archives, effective January 21st, by Inktomi, a service used by Yahoo!, AOL, and MSN, all of whom have decided to be complicit with this apparently illegal invasion of the privacy of American citizens.
Next week I'll take up the issue with my peers, the other authors of the W3C's P3P standard on data collection and privacy. At the moment we're distracted with whether or not the Web Applications Working Group's "ping" (think technorati) allows for the collection of data that tends to identify individuals (of course it does).
That reminds me, blo.gs is owned by Yahoo!, so trackback pings sent to blo.gs (one of the defaults for Movable Type), are data that is provisioned to the USG, and for that reason Wampum no longer sends trackback pings to blo.gs.
If the USG was interested in the metrics for on-line porn, rather than ubiquity of governmental access to private (corporate) and personal (individual) data, its methodology would reflect that interest, rather than the ubiquity of access interest, and Grey's Anatomy of the Human Body is appropriate reference material for teens. Its how things work. Down there and everywhere else.
Comments
Revolt of the nerds!
Seriously...this is an impressive stand to take. With your own set of open source tools and ownership of the boxen, you can do this. [and thank you for listing all the versions of apache, php, mysql etc.] How could us lesser geeks fend off crawlers from cowardly indexers like Yahoo?
Will your advertisers have questions about this step? Does it have any negative impact on traffic? Not much I would imagine.
To read your exposition on how library records are in fact fairly open books for sleuthing out particular subscribers is a chilling exercise.
Posted by: greensmile | January 23, 2006 02:21 PM
Will your advertisers have questions about this step?
No. We don't do cents-per-thousand (CPM). Besides, we couldn't get NARAL/DCCC/DSCC/Emily'sList/... to do buys, so the buyers we have (1) actually cares about who reads the Koufax threads, not how many vacant robots scuttle across the floor of our digital pond.
Does it have any negative impact on traffic?
Inktomi's spider is one of the worst major SE spiders, so it is a win in bandwidth and cycles to ban the Inktomi netblocks. However, the real benefit happens when I move the "deny policy" from Apache to the host operating system packet filter, (and even more so to the cisco router's ACLs, later).
403's returned by Wampum to Inktomi spiders since noon 21/Jan: 1828 (indexing)
search.yahoo.com hits received by Wampum since 01/Jan: 1040 (query)
Note the equivalence, one day of Inktomi spidering is approximately as many hits as one month of hits from Yahoo! subscribers using Yahoo! search. Pretty expensive, neh? Every post you write some human reads Yahoo! had to suck it down 30 fricking times. They're amazingly inefficient and I've no idea why they are tolerated, let alone still in business.
Here is about 1k of search.yahoo.com queries received at Wampum, in order of incidence, and the Wampum post keyed to the query: link.
How could us lesser geeks fend off crawlers from cowardly indexers like Yahoo?
The "how to" is not so hard, in the vhost entry for Wampum in the file /usr/local/apache2/conf/httpd.conf there is the following:
# deny INKT CIDRs
Deny from 66.196.64.0/18
Deny from 68.142.192.0/18
Deny from 72.30.0.0/16
For bloggers who's hosting providers encourage or require them to use the .htaccess mechanism I suppose something similar has the same effect. I'll look into it.
Posted by: EBW | January 23, 2006 08:48 PM
.htaccess version:
order allow,deny
Deny from 66.196.64.0/18
Deny from 68.142.192.0/18
Deny from 72.30.0.0/16
allow from all
Posted by: Chris Clarke | January 24, 2006 10:12 AM
EBW, Chris:
You guys are inspiring. Unfortunately all inspiration ever produces
in me is attempts at aphorisms:
When one shares power, one risks reducing their own powers.
When one shares knowledge one gives power yet their own power is not reduced. That is how power is multiplied in the hands of the meek.
So long and thanks for all the HTTPD settings.
Posted by: greensmile | January 24, 2006 11:32 AM