Ai scraping is an effective DDoS on the entire interent

irelephant [he/him]🍭@lemm.ee · 23 hours ago

Ai scraping is an effective DDoS on the entire interent

MonkderVierte@lemmy.ml · 7 hours ago

Count them as ad visits, to make big tech pay for better hardware or line?

BlueMonday1984@awful.systems · 6 hours ago

That opens you up to getting accused of click fraud, as AdNauseam found out the hard way but its worth it if you can squeeze some cash out of them before that happens.

Sas [she/her]@beehaw.org · 3 hours ago

I mean, scraping bots would obviously obey robots.txt so those scraping - bots, i mean users can’t be bots

Hirom@beehaw.org · edit-2 10 hours ago

In my experience with bots, a portion of them obey robots.txt, but it’s tricky to find the user agent string that some bots react to.

So I recommend having a robots.txt that not only target specific bots, but also tell all bots to avoid specific paths/queries.

Example for dokuwiki

User-agent: *
Noindex: /lib/
Disallow: /_export/
Disallow: /user/
Disallow: /*?do=
Disallow: /*&amp;do=
Disallow: /*?rev=
Disallow: /*&amp;rev=

irelephant [he/him]🍭@lemm.ee · 9 hours ago

Would it be possible to detect the gptbot (or similar) of their user agent, and server them different data?

Can they detect that?

froztbyte@awful.systems · edit-2 8 hours ago

yes, you can match on user agent, and then conditionally serve them other stuff (most webservers are fine with this). nepenthes and iocaine are the current preferred/recommended servers to serve them bot mazes

the thing is that the crawlers will also lie (openai definitely doesn’t publish all its own source IPs, I’ve verified this myself), and will attempt a number of workarounds (like using residential proxies too)

irelephant [he/him]🍭@lemm.ee · 8 hours ago

Can they detect that they’re being served different content though?

db0@lemmy.dbzer0.com · 21 hours ago

It’s a constant cat and mouse atm. Every week or so, we get another flood of scraping bots, which force us to triangulate which fucking DC IP range we need to start blocking now. If they ever start using residential proxies, we’re fucked.

irelephant [he/him]🍭@lemm.ee · 21 hours ago

I have a tiny neocities website which gets thousands of views a day, there is no way that anyone is viewing it often enough for that to be organic.

db0@lemmy.dbzer0.com · 21 hours ago

quickly, add some ad revenue :P

𝕸𝖔𝖘𝖘@infosec.pub · 16 hours ago

From ai vendors. Let them pay you for scraping you lol

self@awful.systems · 21 hours ago

at least OpenAI and probably others do currently use commercial residential proxying services, though reputedly only if you make it obvious you’re blocking their scrapers, presumably as an attempt on their end to limit operating costs

MonkderVierte@lemmy.ml · 7 hours ago

They have a botnet on residential devices?

froztbyte@awful.systems · 6 hours ago

the term of art is “residential proxy” and there’s a ton of them

for example: it’s the flipside of Bright’s free VPN service - through Bright Data they sell people access proxied via some user’s connection

irelephant [he/him]🍭@lemm.ee · 2 hours ago

And companies like honey that pay you (a pittance) to proxy people’s requests to porn sites.

db0@lemmy.dbzer0.com · 21 hours ago

Oh never heard of that. I have blocked their scrapers via agents but I haven’t felt residential proxy pain.

Opus DEI@mastodon.me.uk · 21 hours ago

@db0 @self Residential Proxy Pain are playing at the Dublin Castle in Camden this Friday, £4 advance, £5 on the door

self@awful.systems · 21 hours ago

here’s a mastodon post and linked blog post with some details on what currently sets it off

db0@lemmy.dbzer0.com · 9 hours ago

PS: Looks like that sync issue between our instances is resolved now?

db0@lemmy.dbzer0.com · 18 hours ago

Daym, I should set me up some iocane as well I think

Ai scraping is an effective DDoS on the entire interent

Ai scraping is an effective DDoS on the entire interent

Excerpt from a message I just posted in a #diaspora team internal f...