fub | Entries tagged with internet

If you’re Very Online like me, you will have heard about the feud between the guy who seems to be the sole owner of WordPress (both the commercial and the open source project…) has with a large hoster of WordPress instances. The whole issue is explained in this long-read, and it makes an excellent point that Matt has destroyed the trust in the whole WordPress project. And if there’s anything an open source project needs, it’s trust.

My blog uses WordPress, which I modified to have my own theme and things like user icons and moods (which I ported over from LiveJournal when I migrated).

For some time now, I have been thinking that I could also use a static site generator instead — I don’t need too many fancy stuff, I get almost no traffic on my blog itself anyway. And keeping WordPress up to date is a bit of a hassle, and if you don’t, you offer an attack surface to hackers…

But there are some interactive features that I value. I don’t get many comments on the blog (most interaction happens on DW), but I do want the option to receive comments and to react to them. And cross-posting to DW is also very important to me. And I am using the ActivityPub plugin from WordPress to federate my posts in the ‘fediverse’ so that people on Mastodon (or any other ActivityPub tool) can subscribe to my blog. And if they comment on the entry, that is actually captured as a comment, which is very cool I think.

All those interactive elements need something to process them (and to re-create the site to reflect those changes), and I don’t think there’s any solution that does this except WordPress.

Certainly something to research a bit more urgently, what with the recent problems in that corner.

Crossposted from my blog. Comment here or at the original post.

Current Mood: nervous

There were some hurdles with the mortgage, as the building we’re buying is not a standard house — so the mortgage provider needed some more guarantees. But we cleared all of those hurdles, and on May 31st we inspected the property one last time and then went to the notary to sign the papers for the official transfer of ownership.

On Tuesday evening, we had an appointment with our builder at the house, at 18:00. He spent 3.5 hours with us, going through the space, discussing possibilities etcetera. We already did a few things like put up a house number — the address had changed, so there was no house number visible from the street — and putting up a mailbox.

Today, we put up some outside lights with motion sensors, as the house is at the edge of the village and there are no street lights: it’s dark there at night. Not that there is anything to steal there (yet) but we might as well get this done now.

But the most important thing we went to do there was to simulate our furniture being placed in the space. With paper painter’s tape, we put the outline of the furniture on the floor. We had prepared for this by cutting out pieces of paper to scale and populating the original builder’s drawings, but doing it in 1:1 scale really allowed us to experience the ‘flow’.

In square meters, the house is equivalent to where we live now, but our current house has different rooms with walls that you can put storage against — and we need quite a bit of storage. The mover who came to look at our inventory remarked that we owned ‘more than average’. We can thin that out a bit, but we are certainly not minimalists! But in the new house, in such a large open space, you can’t put storage without making a ‘zone’ feel boxed in. So that’s going to be A Thing going forward.

But we have also decided on all the ‘hard’ infrastructure: a completely separate bedroom and where the kitchen has to come. Everything else will be ‘soft’ and can be moved around as we see fit. Who knows, we might live there for two weeks and decide to swap things around. That flexibility really feels luxurious.

As the address of the house had changed, the fibre optic cable terminating in our house was still registered to the old address, but luckily it only took a single email to rectify that. We had decided against getting a landline again — everyone we want to call us on the landline also has our mobile numbers, and the rest are spammers. And as we’re moving out of the city, the landline number would change anyway, so it feels like more trouble than it’s worth.

What also helped in that decision was that I selected a provider for internet that doesn’t provide telephone services in our area — we’ll be getting 1Gbps for less than I now pay for 100Mbps!

But reception on our mobile phones is pretty bad in the house. So I studied the map of cell phone towers for the Netherlands, and we are indeed quite far from the nearest tower on the network of our current provider. But there are two towers for another network that are closer — so we should have better reception if we switch to a provider that uses that network. That has been initiated as well, and with number portability we’ll keep our phone numbers too.

Meanwhile, we also need to clean up our (current) house so it is fit for showings. That’s a painful process, but we’ll get through that too.

Crossposted from my blog. Comment here or at the original post.

Current Mood: tired

If you remember the “old web”, you remember those hand-crafted HTML webpages. Part of the design aesthetic that they all shared, were these 88 x 31 pixel mini-banners. Their origin is shrouded in mist, but it is theorized that they started as buttons to link back to GeoCities, though other timelines put the infamous and oft-parodied “Netscape NOW” button as the first to appear.

It is therefore no surprise that the prolific page builders on GeoCities ran with the format at created a slew of ‘buttons’ for their own use and amusement. Corporations followed suit, and the small graphical buttons got two functions. Most of the corporate ones used to link to useful software (“Get Acrobat!”), or as part of some ritualised tribal warfare to show allegiance to either Netscape (the aforementioned “Netscape NOW” button) or Internet Explorer (“Best viewed with Internet Explorer!”).

Many ‘fan-made’ buttons upped the ante, either promoting one thing or decrying something else: “This page is anti Apple, get an IBM!”

Soon every webring (remember those?) and even individual website or even sections on a website had their own button that you could use in your own pages to link to them (sometimes triggering heated debates of whether ‘hotlinking’ was ethically permissible).

No wonder then that GeoCities had a large variety of buttons sprinkled across its pages. booters[Bad username or site: kolektiva @ social] has scraped the pages still available through the archive that was created before GeoCities was taken offline, and you can see all of this creativity on display (though it is by no means the only archive of 88×31 pixel buttons). Nostalgia until your eyes bleed.

Crossposted from my blog. Comment here or at the original post.

Current Mood: melancholy

The problem

A Mastodon server keeps all entries it “knows” in its own database. That means that every entry received by every other federate social media server is stored in the database. That database can get quite big, and as someone has to pay for all of the storage costs, the database can’t be allowed to grow unchecked. Which is why Mastodon has a configuration that specifies after how many days external posts (so posts that originated on other servers) are deleted from the database. It’s in Administration | Server Settings | Content retention, as the setting “Content cache retention period”. On MuBlog, this is set to 100 — so after 100 days, posts from other servers are deleted. (Posts that were made on MuBlog are not under this policy, they keep existing.)

If you have bookmarked a post, which you do on your ‘home server’, you are actually bookmarking the copy of the post that is in your server’s database. And you can see where this is going: after the content cache retention period has expired, the local copy is removed — and with that, you have lost your bookmark as well! But the original link, on the originating server, still exists — so if you had a way to preserve your Mastodon bookmarks somewhere else, that would also record the original entry URL, you’d be safe.

Step one: Get Shaarli

I decided I wanted something to move my Mastodon bookmarks to an external bookmark manager, to preserve them. As I am a fan of self-hosted solutions, I selected Shaarli as my bookmark-manager. It is written in PHP and doesn’t use an external database, which means you can just drop the files through FTP on your webhost, open the page in your browser and configure. I installed it on my host in a separate directory.

After it all works (play around with it for a bit to get used to it!), you must go to Tools | Configure, and enable the REST API — this is how we will create bookmarks in Shaarli! Take a note of the REST API secret: this is how we will authorise ourselves to Shaarli.

Step two: Prepare your Mastodon account

Our script also has to authenticate itself with your Mastodon server in order to read your bookmarks, as those are not available to the public!

On the server, go to Preferences and then click on Development. Click on the button ‘New Application’. Give the application a name that makes sense to you (I used ‘Bookmark importer’), and enter the URL — you might want to use the URL where you installed Shaarli. Keep the default Redirect URI, since we don’t need a redirect. Then, set the correct Scope: the things the application is allowed to do. The only scopes the application needs are read:bookmarks and write:bookmarks. I suggest you only select these (deselecting the defaults of the ‘Read’ and ‘Write’ categories and ‘Follow’).

You now see the list of applications you have created. Click on the name of your bookmark-sync application and take note of the access token. This is how we will authorise our script to the Mastodon server.

Step three: The script

I have shared my script that does the syncing here. It’s in PHP, and I recommend that you try the script out from the commandline to make sure that it all works. Linux distributions often come with PHP CLI (the Ubuntu package is called php-cli), Windows users will have to find it out for themselves. Your PHP install should have at least the cURL and JSON modules included, but every self-respecting PHP version will have that out of the box. If not: good luck, figure it out.

Take the script, and fill out the four global constants at the top of the file: your Mastodon server, the Shaarli secret, the Mastodon token and the Shaarli endpoint. It is now ready to run.

The script will iterate through your bookmarks on Mastodon, create the bookmark in Shaarli, and delete the bookmark on Mastodon. Of course, you might not want to do all of that right on the first try, so there are three arguments you can give on the commandline. DONT_DELETE won’t delete the bookmarks from Mastodon, DONT_ADD won’t add the bookmarks to Shaarli and DONT_ITERATE will only take the first five bookmarks from Mastodon.

I would suggest the following order:

php mastobookmark.php DONT_ITERATE DONT_ADD DONT_DELETE, to check if the connection to the Mastodon server works;

php mastobookmark.php DONT_DELETE, to check if the connection to Shaarli works.

Note that the bookmarks are created in Shaarli with the tag source:mastodon, and any tag the original post might have. All of the bookmarks are also created as private bookmarks. The URL for the bookmark will be the URL of the bookmark on your home server. If you get around to reading the bookmark before it expires, you can read it on your home server and reply, like or boost directly. The original URL of the bookmark is also recorded, so if the entry has already expired on your home server, you can use the original link to get to the post after all. Interacting with the post will be a bit more complicated, but still possible.

Step four: Schedule the script

When you are satisfied that everything works, you can schedule running the script. If you have web hosting (which you have, because you used it to install Shaarli on) then there is a good chance you have the possibility to run cronjobs. (‘cron’ being the name of the server process that schedules scripts on Unix systems.) You can use FTP to upload the script somewhere — I strongly advise to not do this in the tree of your webserver (anything under public_html on my host, might be something else on yours, do check!), where it is accessible through any browser, but somewhere else — and then configure the cronjob to call it on a schedule that makes sense to you. Personally, I run it every day at 5 past midnight, but YMMV.

Optional: Tweak Shaarli

It irked me that Shaarli does not open the links of the bookmarks in a new tab by default. I use the default template, so I edited the file tpl/default/linklist.html in my Shaarli tree. That file is not HTML, but a template that gets interpreted. Find the link to “{$value.real_url}” with the CSS class linklist-real-url and add target="_blank" to the link tag. This will open the bookmarked links in a new tab. Mind you that links in the description of the bookmark (which includes the link on the original server!) will not be opened in a new tab. I’m sure that’s possible to do with some post-processing, but I’m too lazy to find out how to do that.

In closing

I made this script, and you are welcome to use it. I will not offer support, and if you mess things up, I won’t help you fix it. If you don’t know what you’re doing, it’s probably best not to do it. You’re welcome to suggest improvements, but I am under no obligation to implement them for you. I’m sure my coding style is horrible, I don’t care.

Good luck.

Crossposted from my blog. Comment here or at the original post.

Current Mood: nerdy

I posted this in a thread on Mastodon, because someone was complaining loudly about “the algorithms” in search engines “these days”, and I was cranky enough to shoot off a 26-post thread about it. This is an edited version.

I’m feeling cranky so you get a post about the development of the web and how search grew along with it.

First my credentials: I was there when a friend started the first web server at my university. I hand-crafted HTML pages for my student association. I have an MA in Cognitive Science, and for my thesis I built a search engine that used Natural Language Processing to investigate (among others) whether using noun phrases as queries made any difference.

After I got my MA, I worked on an Esprit project (public-private partnerships to promote high-tech research, sponsored by the EU) for Document Routing. My task was to design the learning algorithms for the classification of text — incoming letters would be scanned and OCR’ed, and then an automated system would parse and classify the document and put it in the appropriate workflow. I even published about my research at SIGIR’98. (SIGIR is the conference of the Association for Computing Machinery’s Special Interest Group for Information Retrieval. It’s where the scientific work on search is presented, and even though it’s now more than 25 years ago, I still consider it one of the highlights of my carreer to have been able to present my research there.)

In 2000 I worked on a semantic search engine/classifier for a start-up. I added some code to the OpenMuscat (now Xapian) search engine. My name is at the top of the AUTHORS file. (Ask me sometime how the OpenMuscat code ended up on SourceForge, it’s an interesting story, but something more suited to be told over beers.)

After 2001, I stopped working in search, but it has always stayed an interest of mine. In my current job, in content services platforms, I’m taking over the search component, and it’s interesting how things have changed and how things have stayed the same. So I am confident that my knowledge of 25 years ago still roughly translates to today’s issues. If you look at the growth of the web and the growth of search, there are some interesting parallels.

In the early days, people hand-crafted HTML pages about topics that interested them. Yes, there were directories of researchers and their research, but our student association had a set of pages about the animated series Samurai Pizza Cats. Why? Well, one of the guys who had write access to the web space liked it a lot, and he put together a list of characters and episode synopses. Put that somewhere, link to it from the home page (because why not?) and any visitor could find the page.

And the site of the student association was linked from a directory of computer science student associations (there were many informal contacts) that everyone linked to. So any student looking at associations at other Dutch universities would see this page and, if they were fans too, could look at it. This was at a time when e-mail spam didn’t exist yet, so mail addresses were shared at the bottom of the page. So they could mail with additions or praise. And they could tell their friends, who could go there and tell their friends. And, if they had pages of their own about this topic, they could link to those pages too, so that interested parties could jump from one set of pages to another. And it worked, but there was a large “word of mouth” factor: lots of people with interests making pages about those interests and finding others through serendipity and slowly building out a network. This didn’t scale to more people making pages.

Then the “web directories” started: a hierarchical list of topics (Entertainment -> TV -> Animation -> Samurai Pizza Cats) with each topic being a list of links with their description. So if you were interested in a topic, you’d go to Yahoo, traverse the topic tree and find a list of sites with pages that were about that topic. You could send in suggestions, but the directory itself was curated by a (growing) list of editors. The World Wide Web was designed to be decentral (everyone their own webserver!), but discoverability was low. These directories are an important step in the centralisation of the web: instead of going to your own “homepage” and following the links there to the sites you usually went to, you went to a directory to find the stuff you were interested in. There were some alternatives (like Webrings, remember those?) but the amount of content was too large to keep decentralised.

And at this time, these were largely hand-crafted pages made by hobbyists. But then corporations started to enter the web. At first, they too had hand-crafted content (often served through content management systems, so no hand-crafted HTML anymore) about their products and services. It wasn’t more than an online signboard: here we are, this is what we do, call us to learn more. Relatively static. If you were interested in a company, you’d go directly to their site and find the info there. The volume of pages shot up, as did the number of topics covered. Directories didn’t cut it anymore, and you needed a search engine. Engines started to download web pages in large quantities and indexed those. Some of this was quite rudimentary: the first version of Ilse, which would later become the biggest Dutch search engine for a time, simply did a ‘grep’ command on a downloaded copy of the web they crawled!

I’ll tell you a little secret: search technology kinda sucks. It suffers from the ‘keyword hypothesis’: if word X occurs in a document, then that document is “about” X. We know that this is not true, but it’s the best we have, even today. In 2000, we worked with hand-crafted search filters that could really capture whether a document was about a specific topic. We basically built an expert system for each topic, fed with expert knowledge. While that is cool, it doesn’t scale. So there are ‘tricks’ to work around the badness of the keyword hypothesis. A famous one is TF/IDF, which states that if a term occurs frequently in a document, but not that many times in other documents, then that term is a good description of the content of that document. The idea is literally 50 years old, and it is still used in modern search engines like Elastic!

But of course, since HTML is structured, you can use that structure to weigh terms as well. So: words in the title or in headings are more important, words in the description meta tags are more important, words in the links that link to that page are important, etcetera.

In 1999, I attended SIGIR in Berkeley, and there was a talk from the founders of Google, which had started just a year before and which had quickly overtaken AltaVista as the prime search engine, precisely because of all these ‘tricks’ (which some would call ‘algorithms’ today). They talked about their history and their plans, and of course many of the researchers were very interested in the internal workings, but it became apparent that they didn’t want to talk about that, at all. PageRank would become public knowledge, because that was patented, and there was some advice on how to write a ‘good’ page that could easily be indexed. But the real internals were hidden, precisely because the tricks could be exploited if they would become widely known.

The Google people had correctly predicted that they would become involved in a kind of ‘arms race’ with publishers who would want their pages at the top of the search results. Ironically, Google caused this to happen themselves, by going into the ad business. Having a set of popular pages with ads on them became a viable business model — ads that Google would place on your page and would pay you for. But for that to make you money, you would have to be at the top on Google results…

This is why “Search Engine Optimization Consultant” is an actual job that people are paid to do. This is why many people are trying to reverse engineer ‘the’ Google algorithm in order to manipulate their content in such a way that it will appear at the top of the Google search results so that Google will send people to their pages so that those people will see the ads that Google serves them, so that the owner of the pages gets paid by Google for providing the backdrop for the ads. This has become a vicious circle which mainly benefits Google itself. Google doesn’t really care about the quality of the pages they send you to: they care about being able to serve ads to you once you get there. Content farms don’t care about the quality of their work, they care about the money they can get from the ad providers. Quality doesn’t pay the bills, ad money does.

(Of course, there are many people who care deeply about the quality of their work! And most of them are successful in a way that makes the world a better place! But caring is not a _requirement_ for success anymore.)

So yes, search sucks these days. But if we would strip away all the ‘cleverness’ of ‘the algorithms’, we would default back to the good ol’ days of TF/IDF and having to sift through pages and pages of low quality search results to find that one page that has the information we need. Or one could also make a point that in some cases, we are reverting back to the days of word-of-mouth: what else is an article “going viral” other than people pointing out something good to others?

Yes, there are search engines that do “linguistic processing”, but those tend to work on specific domains. There is no model for language that works on general use cases, and most investment seems to go towards generating rather than understanding — which makes sense, because operating a content farm can be a profitable business, after all. That justifies the investment.

Rather than the ‘tricks’ ruining search, it is Google’s business model that has ruined search. Not only their own, but everyone’s. The glut of content (which I’m happily contributing to with this post!) is what is congesting the web. As a result, I find myself using search engines less and less. I don’t need to keep up to date with the latest stuff, because that’s exhausting and it’s only a vehicle for showing me ads anyway.

Is there a point to this post? Maybe it is this: search is hard, even in ideal circumstances. Computers are really bad at understanding language and context, but great at statistical analysis — so that is what search algorithms are based on. There is no meaningful search without ‘algorithms’ — search itself is an algorithm. Getting shitty results for your search is the result of shitty input and lack of context.

And whatever you think up as the alternative, it has to scale to an unimaginable scale. Statistical analysis scales, almost everything else does not. (One could point to Wikipedia as a thing that scales human curation very well, but one has to realise that Wikipedia, as big as it is, represents only a tiny fraction of all the content that exists on the web.) There might be a case for specialised search engines that cover a specific topic that utilises human curation, but a generic one…?

It is easy to make fun of sites like MySpace and GeoCities, but I regard that as the golden age of the web: “citizen publishers” writing about their interests, pointing to similar pages, and discoverability of high-quality content through curated directories and search engines was high. We have lost that because of the ad-supported model. Once again, capitalism has destroyed something that was good and brought joy.

Crossposted from my blog. Comment here or at the original post.

Current Mood: cranky

If you still thought that Musk was a smart man, then the real-time collapse of Twitter should cure you of that misconception. It is, in some way, impressive how little he understands of what makes Twitter unique and how he dismantles the value it had through all kinds of hare-brained schemes.

But to those of us who were with LiveJournal back when it was sold to the Russians, it’s a bit of deja-vu — a platform that is so important to you is changing in a way that does not align with your own goals for it. That means you will have to make a choice, and in the case of LJ, my choice was to create my own blog and cross-post to LJ. (And earlier this year, I decided to start cross-posting to DW instead of LJ.)

I use Twitter much more than I use my blog — microblogging has a different energy. And I don’t have that much interaction on my blog as I do on Twitter. So yes, I am invested in Twitter as a platform. Its breakdown and the way Musk is treating the company and the employees, does not fill me with great optimism for the long-term survival of Twitter.

One alternative is Mastodon, an open-source federated microblogging system. Through the federation, instances exchange content and users can follow and communicate cross-instance. (The cool thing is that it’s also possible to block federation with specific instances, so you can keep the fascists out.)

But for me it is not feasible to run my own instance of Mastodon on my existing hosting like I do with WordPress for my blog. The Mastodon technology stack is much more complex, and there is no easy install option — so I would have to set up everything and, more importantly, keep everything up to date. That’s not something I am equipped and/or prepared to do.

But the cool thing is that there are several places that offer managed hosting for Mastodon. So I got together with a few people and decided that we would set up an instance and see how that goes. We would use it, and we could also offer accounts to those who are not able to host their own instances. Ideally, I would transfer ownership to a foundation of sorts that would ensure a thriving community and ensure the longevity of the system — but for now we go with the ‘benevolent dictator’ model.

I’m now waiting for my application to be approved and the instance to be, eh, instantiated. Can’t wait!

Crossposted from my blog. Comment here or at the original post.