fub

I posted this in a thread on Mastodon, because someone was complaining loudly about “the algorithms” in search engines “these days”, and I was cranky enough to shoot off a 26-post thread about it. This is an edited version.

I’m feeling cranky so you get a post about the development of the web and how search grew along with it.

First my credentials: I was there when a friend started the first web server at my university. I hand-crafted HTML pages for my student association. I have an MA in Cognitive Science, and for my thesis I built a search engine that used Natural Language Processing to investigate (among others) whether using noun phrases as queries made any difference.

After I got my MA, I worked on an Esprit project (public-private partnerships to promote high-tech research, sponsored by the EU) for Document Routing. My task was to design the learning algorithms for the classification of text — incoming letters would be scanned and OCR’ed, and then an automated system would parse and classify the document and put it in the appropriate workflow. I even published about my research at SIGIR’98. (SIGIR is the conference of the Association for Computing Machinery’s Special Interest Group for Information Retrieval. It’s where the scientific work on search is presented, and even though it’s now more than 25 years ago, I still consider it one of the highlights of my carreer to have been able to present my research there.)

In 2000 I worked on a semantic search engine/classifier for a start-up. I added some code to the OpenMuscat (now Xapian) search engine. My name is at the top of the AUTHORS file. (Ask me sometime how the OpenMuscat code ended up on SourceForge, it’s an interesting story, but something more suited to be told over beers.)

After 2001, I stopped working in search, but it has always stayed an interest of mine. In my current job, in content services platforms, I’m taking over the search component, and it’s interesting how things have changed and how things have stayed the same. So I am confident that my knowledge of 25 years ago still roughly translates to today’s issues. If you look at the growth of the web and the growth of search, there are some interesting parallels.

In the early days, people hand-crafted HTML pages about topics that interested them. Yes, there were directories of researchers and their research, but our student association had a set of pages about the animated series Samurai Pizza Cats. Why? Well, one of the guys who had write access to the web space liked it a lot, and he put together a list of characters and episode synopses. Put that somewhere, link to it from the home page (because why not?) and any visitor could find the page.

And the site of the student association was linked from a directory of computer science student associations (there were many informal contacts) that everyone linked to. So any student looking at associations at other Dutch universities would see this page and, if they were fans too, could look at it. This was at a time when e-mail spam didn’t exist yet, so mail addresses were shared at the bottom of the page. So they could mail with additions or praise. And they could tell their friends, who could go there and tell their friends. And, if they had pages of their own about this topic, they could link to those pages too, so that interested parties could jump from one set of pages to another. And it worked, but there was a large “word of mouth” factor: lots of people with interests making pages about those interests and finding others through serendipity and slowly building out a network. This didn’t scale to more people making pages.

Then the “web directories” started: a hierarchical list of topics (Entertainment -> TV -> Animation -> Samurai Pizza Cats) with each topic being a list of links with their description. So if you were interested in a topic, you’d go to Yahoo, traverse the topic tree and find a list of sites with pages that were about that topic. You could send in suggestions, but the directory itself was curated by a (growing) list of editors. The World Wide Web was designed to be decentral (everyone their own webserver!), but discoverability was low. These directories are an important step in the centralisation of the web: instead of going to your own “homepage” and following the links there to the sites you usually went to, you went to a directory to find the stuff you were interested in. There were some alternatives (like Webrings, remember those?) but the amount of content was too large to keep decentralised.

And at this time, these were largely hand-crafted pages made by hobbyists. But then corporations started to enter the web. At first, they too had hand-crafted content (often served through content management systems, so no hand-crafted HTML anymore) about their products and services. It wasn’t more than an online signboard: here we are, this is what we do, call us to learn more. Relatively static. If you were interested in a company, you’d go directly to their site and find the info there. The volume of pages shot up, as did the number of topics covered. Directories didn’t cut it anymore, and you needed a search engine. Engines started to download web pages in large quantities and indexed those. Some of this was quite rudimentary: the first version of Ilse, which would later become the biggest Dutch search engine for a time, simply did a ‘grep’ command on a downloaded copy of the web they crawled!

I’ll tell you a little secret: search technology kinda sucks. It suffers from the ‘keyword hypothesis’: if word X occurs in a document, then that document is “about” X. We know that this is not true, but it’s the best we have, even today. In 2000, we worked with hand-crafted search filters that could really capture whether a document was about a specific topic. We basically built an expert system for each topic, fed with expert knowledge. While that is cool, it doesn’t scale. So there are ‘tricks’ to work around the badness of the keyword hypothesis. A famous one is TF/IDF, which states that if a term occurs frequently in a document, but not that many times in other documents, then that term is a good description of the content of that document. The idea is literally 50 years old, and it is still used in modern search engines like Elastic!

But of course, since HTML is structured, you can use that structure to weigh terms as well. So: words in the title or in headings are more important, words in the description meta tags are more important, words in the links that link to that page are important, etcetera.

In 1999, I attended SIGIR in Berkeley, and there was a talk from the founders of Google, which had started just a year before and which had quickly overtaken AltaVista as the prime search engine, precisely because of all these ‘tricks’ (which some would call ‘algorithms’ today). They talked about their history and their plans, and of course many of the researchers were very interested in the internal workings, but it became apparent that they didn’t want to talk about that, at all. PageRank would become public knowledge, because that was patented, and there was some advice on how to write a ‘good’ page that could easily be indexed. But the real internals were hidden, precisely because the tricks could be exploited if they would become widely known.

The Google people had correctly predicted that they would become involved in a kind of ‘arms race’ with publishers who would want their pages at the top of the search results. Ironically, Google caused this to happen themselves, by going into the ad business. Having a set of popular pages with ads on them became a viable business model — ads that Google would place on your page and would pay you for. But for that to make you money, you would have to be at the top on Google results…

This is why “Search Engine Optimization Consultant” is an actual job that people are paid to do. This is why many people are trying to reverse engineer ‘the’ Google algorithm in order to manipulate their content in such a way that it will appear at the top of the Google search results so that Google will send people to their pages so that those people will see the ads that Google serves them, so that the owner of the pages gets paid by Google for providing the backdrop for the ads. This has become a vicious circle which mainly benefits Google itself. Google doesn’t really care about the quality of the pages they send you to: they care about being able to serve ads to you once you get there. Content farms don’t care about the quality of their work, they care about the money they can get from the ad providers. Quality doesn’t pay the bills, ad money does.

(Of course, there are many people who care deeply about the quality of their work! And most of them are successful in a way that makes the world a better place! But caring is not a _requirement_ for success anymore.)

So yes, search sucks these days. But if we would strip away all the ‘cleverness’ of ‘the algorithms’, we would default back to the good ol’ days of TF/IDF and having to sift through pages and pages of low quality search results to find that one page that has the information we need. Or one could also make a point that in some cases, we are reverting back to the days of word-of-mouth: what else is an article “going viral” other than people pointing out something good to others?

Yes, there are search engines that do “linguistic processing”, but those tend to work on specific domains. There is no model for language that works on general use cases, and most investment seems to go towards generating rather than understanding — which makes sense, because operating a content farm can be a profitable business, after all. That justifies the investment.

Rather than the ‘tricks’ ruining search, it is Google’s business model that has ruined search. Not only their own, but everyone’s. The glut of content (which I’m happily contributing to with this post!) is what is congesting the web. As a result, I find myself using search engines less and less. I don’t need to keep up to date with the latest stuff, because that’s exhausting and it’s only a vehicle for showing me ads anyway.

Is there a point to this post? Maybe it is this: search is hard, even in ideal circumstances. Computers are really bad at understanding language and context, but great at statistical analysis — so that is what search algorithms are based on. There is no meaningful search without ‘algorithms’ — search itself is an algorithm. Getting shitty results for your search is the result of shitty input and lack of context.

And whatever you think up as the alternative, it has to scale to an unimaginable scale. Statistical analysis scales, almost everything else does not. (One could point to Wikipedia as a thing that scales human curation very well, but one has to realise that Wikipedia, as big as it is, represents only a tiny fraction of all the content that exists on the web.) There might be a case for specialised search engines that cover a specific topic that utilises human curation, but a generic one…?

It is easy to make fun of sites like MySpace and GeoCities, but I regard that as the golden age of the web: “citizen publishers” writing about their interests, pointing to similar pages, and discoverability of high-quality content through curated directories and search engines was high. We have lost that because of the ad-supported model. Once again, capitalism has destroyed something that was good and brought joy.

Crossposted from my blog. Comment here or at the original post.