I hadn’t been to an Indie Web Camp since before The Situation. It felt very good to be back. I had almost forgotten how inspiring and productive they can be.
This one had a good turnout of around twenty people. We had ourselves an excellent first day of thought-provoking sessions. Then on day two it was time to put some of those ideas into action.
A little trick I like to do on the practical day is to have two tasks to attempt: one of them quite simple, and the other more ambitious. That way, as long as I get the simpler task done, I’ll always have at least something to demo at the end of the day.
This time I attempted three bits of home improvement on my website.
Autolinking Mastodon usernames
The first problem I set myself was ostensibly the simple one. But it involved regular expressions, so then I had two problems.
That turned out to be an excellent test case. Those Icelandic characters made sure I wasn’t making unwarranted assumptions about character sets.
Here’s the regular expression I came up with. It’s not foolproof by any means. Basically it looks for
Good enough. Ship it.
My next task was a bit more ambitious. It involved SQL queries, something I’m slightly better at than regular expressions but that’s a very low bar.
I wanted to show related posts when you get to the end of one of my blog posts.
I’ve been tagging all my blog posts for years so that’s the mechanism I used for finding similar posts. There’s probably a clever SQL statement that could do this, but I ended up brute-forcing it a bit.
I don’t feel too bad about the hacky clunky nature of my solution, because I cache blog post pages. That means only the first person to view the blog post (usually me) will suffer any performance impacts from my clunky database queries. After that everything’s available straight from a cached file.
Let’s say you’re reading a blog post of mine that I’ve tagged with ten different keywords. I make a separate SQL query for each keyword to get all the other posts that use that tag. Then it’s a matter of sorting through all the results.
I loop through the results of each tag and apply a score to the tagged post. If the post shares one tag with the post you’re looking at, it has a score of one. If it shares two tags, it has a score of two, and so on.
I decided that for a post to be considered related, it had to share at least three tags. I also decided to limit the list of related posts to a maximum of five.
I was very inspired by Remy’s recent post on how he’s tackling link rot on his site. I wanted to do the same for mine.
On the first day at Indie Web Camp I led a session on link rot to gather ideas and alternative approaches. We had a really good discussion, though it’s always worth bearing in mind that there’ll never be a perfect solution. There’ll always be some false positives and some false negatives.
In the end I decided to stick with Remy’s two-pronged approach:
- a client-side script that — as a progressive enhancement — intercepts outbound links and re-routes them to
- a server-side script that redirects to the Internet Archive if the link is broken.
It’s very similar to Remy’s but with one little addition. I check to see if the clicked link is inside an
h-entry and if it is, I pass on the date from the post’s
Here’s the PHP I wrote for the server-side redirector. The comments tell the story of what the code is doing:
- Check that the request is coming from my site.
- There also has to be a URL provided in the query string.
- Make a very quick
curlrequest to get the response headers from the URL. The time limit is set to 1 second.
- If there was any error (like a time out), give up and go to the URL.
- Pick the response headers apart to get the HTTP status code.
- If the response is OK, go to the URL.
- If the response is a redirect, go around again but this time use the redirect URL.
- Construct the archive.org search endpoint.
- If we have a date, provide it. Otherwise ask for the latest snapshot.
- Ping that archive.org URL. This time there’s no time limit; this might take a while.
- If there’s an archived copy, redirect to that.
- There’s no archived copy. Give up and go the URL anyway.
Not perfect by any means, but it works for the most common cases of link rot.
For the demo at the end of the day I went back into my archive of over 10,000 links and plucked out some old posts, like this one from December 2005. It takes a little while to do the rerouting but eventually you get to see the archived version from the same time period as when I linked to it.
The Internet Archive’s wayback machine really is a gift. I can’t imagine how would it be even remotely possible to try to address link rot on my site without archive.org.
I will continue to donate money to the Internet Archive and I encourage you to do the same.