chizumatic.mee.nu

August 05, 2007

Cyveillance

Every once in a while I see some obnoxious spider crawling through a bunch of files on my server. I've got a "robots" file which is supposed to prevent that, but a lot of people out there who run crawlers aren't very principled about that kind of thing. (My robots file says, "Googlebot, MSNBot, Answer.com-bot, come on in. Everybody else, go to hell." But it doesn't help very much, since most crawlers ignore it.)

A lot of those out-of-control crawlers turn out to be in China, and I've toyed occasionally with blocking China entirely in my firewall. It would be easy to do; it's all in a couple of A-level IP blocks. (Somewhere I've got a link to a list.)

But I saw a different one today: [38.100.41.102]. ARIN just says that the 38 IP block belongs to "Performance Systems International". It's one of the original companies, back in the early 1980's, who were granted A blocks of IPs. So this tells me jack.

A reverse DNS on that IP comes up blank for me. But a trace route... ah, yes. A trace route stops at 38.112.21.142, which reverse DNS's to CYVEILLANCE.demarc.cogentco.com.

Cyveillance is scum. (That's my opinion, which is protected speech under the First Amendment.) They're in the business of scanning the web looking for any references to their customers' names or products which might be negative, so that the customers can issue C&D's and/or threaten with libel suits. Or at least that's how they began; apparently they've branched out into other things since then. But the foundation of their business is data scraping.

It's no wonder that Cyveillance's crawler ignored my robot file; it's exactly the pages which are protected that way which are most likely to be juicy.

Posted by: Steven Den Beste in Site Stuff at 02:10 PM | Comments (4) | Add Comment
Post contains 292 words, total size 2 kb.

If I understand you correctly (and I admit to being outside my technical depth here), you are allowing Google to crawl your site, but are telling www.ask.com to "go to hell". I find this surpirsing, since Ask.com usually gives me much better search results than Google. Is there a specific reason for not allowing Ask.com, or did it just fall under "all the rest" when you were creating your robots file?

Posted by: Siergen at August 05, 2007 02:48 PM (IW5xv)

2 I am permitting the Googlebot, the MSNBot, and the one that identifies itself as "AskJeeves".

Posted by: Steven Den Beste at August 05, 2007 03:33 PM (+rSRq)

3 I think Ask.com *is* AskJeeves. They changed the name when they got sued.

Posted by: metaphysician at August 05, 2007 08:51 PM (hnYuE)

Ask.com used be called "Teoma" while they under beta testing. According to their web site FAQ, their crawler looks for the text "TEOMA" in the robot.txt file.

However, the FAQ also says that server logs should report "User-Agent: Mozilla/2.0 (compatible; Ask Jeeves/Teoma)" if Ask.com has crawled the site, so maybe they are affiliated with AskJeeves after all...

Posted by: Siergen at August 06, 2007 02:27 PM (IW5xv)

Hide Comments | Add Comment

Enclose all spoilers in spoiler tags:
[spoiler]your spoiler here[/spoiler]
Spoilers which are not properly tagged will be ruthlessly deleted on sight.
Also, I hate unsolicited suggestions and advice. (Even when you think you're being funny.)

At Chizumatic, we take pride in being incomplete, incorrect, inconsistent, and unfair. We do all of them deliberately.

How to put links in your comment

Comments are disabled. Post is locked.

7kb generated in CPU 0.024, elapsed 0.032 seconds.
20 queries taking 0.0221 seconds, 21 records returned.
Powered by Minx 1.1.6c-pink.