Xkeeper's blog – Projects and other updates

TCRF has been getting DDoSed

The LLM scraper epidemic is hitting us, too, on top of a persistent DDoS threat

There’s been something of an epidemic of malicious bots on the internet these days. You may have seen a post recently titled “Please stop externalizing your costs directly into my face“, or “FOSS infrastructure is under attack by AI companies“. Those are all happening to us, too. Surprise.

I run a website called The Cutting Room Floor, a wiki about cut/unused content in games. The server has been getting hammered by a whole host of LLM scrapers, malicious bots, and other problems.

The first type, LLM scrapers, are fairly well known at this point. They operate by simply scraping every single page of your website, as fast as possible, in the worst way possible, ignoring flags like “noindex” and “nofollow”. The naïve part comes into play because many sites, especially dynamically generated pages like wikis and code repositories, have very deep links to all sorts of historical views of pages — old versions, comparisons between arbitrary old versions, statistics and info pages… A scraper bot will just grab all of those without care of their value (or cost).

The LLM scrapers largely come from cloud providers, especially low-quality ones that are rife with abuse. If you block their scraping attempts, they will simply start up a new VM on their provider of choice, do more scrapes until they’re caught, repeat ad infinitum, until you give up and black hole the entirety of PureVoltage/OVH/DigitalOcean/Amazon/etc. The particularly sophisticated ones spread out their requests across multiple IP addresses in the first place, too, making identifying them much harder.

A subtype of this is the “wannabe archiver/preservationist”. You know the type: “This website is important. I should save it. I will do this by single-handedly downloading the entire website.” We’ve had these kinds of clowns, too. In a sense, it’s similar to a bank run, complete with largely just making the problem worse for everyone.

Those bots are bad, but at least they want the content of the pages, even if it is for no noble goal. Far more insidious are the straight-out DDoS bots.

The DDoS

Here’s a bit of a graph from the TCRF server, one of the analysis tools I made and use. It simply shows a running count of “how many accesses were made to the web server this second”, split across blocked accesses (403s, indicated with a “#” on the bar) and all others (indicated with “|”). This is right as one of the DDoS waves came in, and you can see things jump from TCRF’s stable “5-15/sec” to nearly 100:

This particular DDoS has been ongoing since early January in various forms, and as far as I can tell seems to be targeted at us directly. Thousands of IP addresses, all making one or two requests, all at the same time.

These attacks typically last for a minute or two. The bots have certain pages and features that they disproportionately target: “expensive” pages to generate. For the first few months — before I’d actually pinpointed it as a DDoS — they mostly focused on making excessive calls to “Special:RecentChangesLinked“, a page that shows a page’s history and the history of every page linked to it. By simply requesting 5,000 different pages’ worth of interlinked history, the server grinds to a halt and nobody else can access a page.

When that attack was mitigated by making RecentChangesLinked login-only, The next attack has largely focused on just blasting random history / comparison pages, again with the strategy of “spend as much time executing worthless requests as possible so that nobody else can get in”. Note that almost all of these are within the same second, and from a wide variety of IP addresses, almost all Chinese-based:

It’s not as useful to attack a basic content page, because the wiki will simply give you the cached version. So the bots go after these deep, dynamic pages to waste resources.

Stopping the insanity

Stopping the attack on the site has been an ongoing process, employing multiple methods.

One tool that cropped up recently and got remarkably popular is Anubis, a service that sits in front of your webapp and presents connecting users with a challenge. The challenge effectively requires a client to waste several seconds/minutes doing expensive-to-do but easy-to-verify math, under the idea that bots won’t waste time doing it (if they are even using a client capable of handling it). Anubis seems great, but isn’t really workable for TCRF for a few reasons:

It requires Docker, which TCRF doesn’t use, and which I don’t want to have to learn, stand up, and add to an already-exhausting maintenance load
It requires you to use nginx as a reverse proxy to a “web app”; TCRF uses nginx as a web server, not a web app, so there is nowhere to slot it in without standing up even more services
It presents its challenge to everyone

If you’re setting up a modern web app, by all means, but the costs of implementing it here are too much.

Another suggestion that has cropped up is Cloudflare. Cloudflare likely needs no introduction: they’re a content-delivery-network that you place in front of your web server, and as part of their services they offer anti-DDoS protection. The problems with Cloudflare mostly stem from disagreeable politics, not wanting to rely on an external third party, and (again) having to set up something else in the middle that breaks all the workflows. And, again, I’m not the biggest fan of putting an interstitial in front of every single person.

So, what do you do?

Before analyzing the logs and determining that a lot of the abuse was coming from one particular location (China), the first step was simply to limit access to the most expensive pages. Since this is a simple PHP site, I was able to do this by just inserting a bit of magic into the index page; you might be able to see it if you go directly to one of the blocked pages in an incognito tab. It’s not the best, but it’s short, simple, and (most importantly) prevents the server from wasting additional resources on the request. Logged in users never get this message.

Your request was blocked. Try disabling any VPNs or proxies and try again. If problems persist, please get in touch via Discord (https://discord.gg/wnfBf5D) or IRC (irc.badnik.zone #tcrf, slow)

Then: block. Block block block. So far, most of the problem has come from a handful of networks, or ASNs. They use many IPs within those networks, but they’re still from that network. So: Block ’em all!

As a live example, one of thousands of the malicious IPs was from “219.142.153.182”. A quick look at ipinfo gives us little reason to be surprised: it originates in China, a common theme for these bots. So, simply take the AS number, “AS4847”, throw it into asn.ipinfo.app, and presto, a blocklist ready to drop into nginx. Do this about 20 times, and the DDoS starts to run out of networks it can abuse you from.

The full list of network blocks on TCRF is long and only getting longer, but it includes a frustrating number of Chinese-based telecoms, mobile networks, and cloud providers, especially Alibaba. Holy shit Alibaba is one of the most abusive networks I have ever seen, and I’m not lying when I say any sysadmin’s first step should be routing it to a black hole.

The advantages of all this is that, for most users, it’s completely transparent. Anyone can still use any old browser (even on Windows 98), there’s no intrusive waiting pages or CPU-melting Javascript challenges. The downsides are that innocent users can (and will) get caught up in these blocks, and since not everyone is subjected to the methods, there are ways to bypass it.

But at the end of the day, it’s extremely effective, and even with 100 requests/second, the requests being blocked means it doesn’t even cause a blip on the server.

And yes, there were DDoS waves hitting the server even as I was writing about it. For now, the attacks are impotent.

Here’s a Bluesky post with a video of what the access log looks like when it happens (about 15 seconds in):

Running TCRF is a one-person operation. If you’d like to support us, please consider joining our Patreon. You can also support me directly via Ko-fi.

Hopefully, there’s a day in the future where this comes to an end.

Sure has been a while, huh

Nothing much to say yet, just making sure the wheels on this site haven’t entirely fallen off.

Ping chart code

I mentioned in the previous post about my DIY network monitoring that I’d put the code up eventually, and that time is (or was) now. It’s still not really cleaned up, but if you want to run it, you can do that now.

It requires fping, which is a tool you should be able to install from your package manager. It’s basically ping but with slightly better output for script use.

Anyway: pingchart (GitHub)

DIY Network Monitoring

I’ve been posting images on Twitter and elsewhere lately that are largely green squares with red dots on them. While I intend to write a longer post on that later, for now, I’ll make a quick explanation of them.

A sample image from January 4th is below; there is also a webpage with an updated display of them.

A typical ping chart from my tool, with a large block of green pixels dotted with red. The green pixels show the ping response time (in ms), red dots are dropped packets, and each image represents one day with each pixel representing one second. — These are quite good at showing weird network behavior and other anomalies, but not good for diagnosing **why**.

Each pixel represents a test done on one second; each row is 300 pixels, or 5 minutes. Seconds go from left to right, top to bottom, and hours are ticked off by marks along the right edge. Faster times are denoted with darker greens, getting lighter as they approach 100ms, and turning yellow as they reach 500ms. Packets that aren’t received within half a second are marked lost, and show up in red.

Ideally, I wouldn’t have had to make this tool, but we’ve had three different cable technicians inspect our equipment with no improvement, and we’ve also purchased a newer modem. While the new modem helped, it’s only because it’s able to achieve even more out-of-spec power levels, and not because the old one was faulty.

Unfortunately, we haven’t managed to correlate this to anything going on inside or outside our apartment, so all we can do is monitor what’s happening for now… and use it as further evidence that the problem still isn’t fixed.

Cable modems are garbage

We’ve been having a lot of internet issues at the Romhaus, the subject of which will have to be a topic for another post.

While we’ve been trying to get it fixed, one of the things that was recommended to us by Cox was purchasing a new modem to replace our old (but still otherwise functional) Arris Surfboard, as it supported only DOCSIS 3.0 and was a few years old. We decided to replace it with a shiny new Motorola MB8600, which was the only DOCSIS 3.1 modem we could find nearby. (Unsurprisingly, it didn’t help, so that was a cool $160 we spent on not fixing the problem.)

While the old modem had no authentication, it also didn’t pretend to. There also wasn’t anything you could actually do with it — it was strictly a view-only interface, as far as I remember. The Motorola instead requires a username and password to log in. Logging in allows you to reboot the modem, restore it to factory defaults, or change the username and password.

DevTools screenshot of the top menu elements, showing that they are table cells with onclick events instead of actual links. — why

The entire interface is a mess of Javascript in place of meaningful HTML; rather than using normal links, everything is based on onclick events. The layout is built on tables within tables within tables, even in places where it isn’t useful (for example, the header of each table is itself a table, with the green-to-white transition provided by an image). There’s no reason for any of this, and it’s usually a sign of bad design further down.

And of course it is. As part of my network monitoring, I wrote a script that checks the connection status page and extracts the power levels, to help diagnose our ongoing issues. With the old modem, it was just a matter of requesting the status page; with the new one, that’s behind a login page, so I had to figure out how to have the script log itself in.

Beyond the login form being a further mess of needless JavaScript (you don’t need onBlur and onFocus if you just use ::placeholder like normal people), I dug through the login page to figure out how, exactly, it logged you in. And of course. Of course it’s horrible.

function btnApply() // (Called when you hit submit)
{
   var loc  = '/login_auth.html?';

   with ( document.forms[0] ) {
      loc += 'loginUsername=' + encodeUrl(loginUsername.value);
      loc += '&loginPassword=' + encode(loginPassword.value);
      loc += '&';
      var code = 'location="' + loc + '"';
	  
      eval(code);
   }
}

aaagh

What this does, in human terms:

URL encodes your username.
base64-encodes your password.
Glues those together into a URL, “/login_auth.html?loginUsername=(url-encoded username)&loginPassword=(base64-encoded password)&“
Prepends a JavaScript fragment to redirect you to that URL and executes it.

Yeeeep, it sends it over GET, completely in the clear. Making the script log in after finding this out was trivial.

A snippet of code showing how the script logs into the modem by doing a simple GET request with the username/password query parameters. — And somehow this isn’t even the worst part

That’s bad enough, but there’s more. Because of course there is.

When I first started developing this, I didn’t need to log in at all. It turns out that — surprise! — logging into the modem authenticates it for everyone. If you log into it on Computer A, and then open the modem page on Computer B, you’ll be presented with the information screen, without needing to log in again. At first I thought it might be a fluke or based on IP addresses, since I had the page open in my browser as well, but after letting the script log in, I opened the modem page on a computer in another room that hadn’t been touched for hours and, well, I didn’t have to log in.

So you don’t even need to sniff the password; you just need someone to log in for you.

I hate computers.

UPDATE!!! As pointed out in the comments, the current usernames and passwords are straight up dumped in plain text in the Security tab. I can’t even make this shit up, oh my god.

Snippet of the page source for the security tab, showing the current usernames and passwords for the modem's accounts — SERIOUSLY YOU GUYS

dfasgfdvcdfllghkfdsghk

Jul, a forum

Jul‘s a forum I run. If you’ve been around me for a while, you’ve probably heard of it. It’s a fork of an ancient forum, both in the sense of community (from Acmlm’s Board) and software (Acmlmboard). They were created by (surprise) Acmlm back in early 2001, 18 years ago. The community went through a lot of… turbulence over the years, branching into several different, smaller communities, before mostly settling down around 2010.

While a full genealogy chart is a little much for the first post about it, especially at 11:40 PM, there are a lot of variants and forks of the code, as well as several attempts to recreate it. Jul runs on a version that’s more similar to the original than most modern ones, though.

As most of the internet has moved on to social media like Twitter and Facebook, forums have been left behind. They’re simpler, they’re not centralized, and they’re a lot more relaxed. Forums aren’t going to send you push notifications or flood your e-mails or shove a bunch of recommendations in your face, and for that they fall behind the “engagement” metric… but as a long-form place to organize, discuss, and just hang out? Can’t beat them.

The code Jul uses is on GitHub, though it’s still full of Jul-specific ~~hacks~~ features and isn’t really usable on its own. It has no installer or readme, no guides, and is, frankly, a mess. Even then, it still manages to run, over 18 years after it was first made! Pretty impressive.

One of Jul’s more notable features — shared with other Acmlmboard-likes — is the ability to make “post layouts”. Unlike typical forum signatures, post layouts (and posts themselves) can include full HTML, both before and after a post, letting you flair your posts with a touch of style. Of course, users can also disable those by default if they’re too obnoxious.

Anyway, this post was brought on because I did some updates to it, mainly in redesigning the new reply page:

New Reply page for a user not logged in, showing an invalid username/password error next to the username field — The new reply page now simply shows an error if you’re logged out and give an invalid username or password, instead of cancelling your post entirely. Of course, if you post while logged in, you never see this.

The changes are fairly minor, but hopefully make it nicer to use:

“Mood avatars” are a dropdown instead of a large list of radio buttons
Posting while logged out and putting in an invalid username/password will return you to the form, without losing your post
Internally, the code is cleaner and better organized
The post reply box can now be resized fully, instead of being locked to 800px wide
Post previews now use the same form as the initial reply, instead of a separate one (for some reason)
In the event that you aren’t able to post your reply (because e.g. the thread was closed or moved to a restricted forum), you’re given a chance to copy what you had written — you no longer instantly lose it

An error when replying caused by the thread being closed. There is a textbox under the error with the content of the post that was being written, letting the author copy and paste it elsewhere. — The days of typing a long monologue only to lose it are (hopefully) over.

It’s a neat place! You should come visit some time.

LED Sign stuff

One of the goofier projects I’ve started recently has been toying with an Adaptive Micro Systems LED sign, specifically the PPD220RED (for “personal proximity display”, and its color). I’ve wanted to own a goofy LED marquee sign for years, and after some searching I managed to snag one on eBay.

Among other things, I’ve hooked it up to a Raspberry Pi and have it showing simple weather information, sourced from OpenWeatherMap.

The sign displaying the current weather conditions and a brief forecast.

The protocol for the sign is available online, and I’ve been working on an implementation of it so I can let the internet display messages on it as well. But for now it’s mostly just a silly little toy.

First post

New year, new… thing? I’ve talked way too much about how social media sucks and yet it’s where I put way too much of my thoughts… and my journal isn’t really a good place for that, either, being that it’s more personal.

I’ve tried various blogs before — too many times, arguably — but this time hopefully I can keep things going by focusing mostly on what I’m doing, and have something to show for it beyond a link to a tweet or two.

Here’s to 2019.