• Hirom@beehaw.org
    link
    fedilink
    English
    arrow-up
    4
    ·
    edit-2
    10 hours ago

    In my experience with bots, a portion of them obey robots.txt, but it’s tricky to find the user agent string that some bots react to.

    So I recommend having a robots.txt that not only target specific bots, but also tell all bots to avoid specific paths/queries.

    Example for dokuwiki

    User-agent: *
    Noindex: /lib/
    Disallow: /_export/
    Disallow: /user/
    Disallow: /*?do=
    Disallow: /*&do=
    Disallow: /*?rev=
    Disallow: /*&rev=
    
    • irelephant [he/him]🍭@lemm.eeOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      9 hours ago

      Would it be possible to detect the gptbot (or similar) of their user agent, and server them different data?

      Can they detect that?

      • froztbyte@awful.systems
        link
        fedilink
        English
        arrow-up
        5
        ·
        edit-2
        8 hours ago

        yes, you can match on user agent, and then conditionally serve them other stuff (most webservers are fine with this). nepenthes and iocaine are the current preferred/recommended servers to serve them bot mazes

        the thing is that the crawlers will also lie (openai definitely doesn’t publish all its own source IPs, I’ve verified this myself), and will attempt a number of workarounds (like using residential proxies too)

  • db0@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    13
    ·
    21 hours ago

    It’s a constant cat and mouse atm. Every week or so, we get another flood of scraping bots, which force us to triangulate which fucking DC IP range we need to start blocking now. If they ever start using residential proxies, we’re fucked.