Uncategorized

Let’s create a FOSS search engine

Since writing that post about what it would take for me to move away from Google as a search engine, I’ve been giving it some thought. Why can’t search be open-sourced?

I mean, I know why Google hasn’t opened up their repositories for public inspection: without a corner on the search market, their ads become much less valuable. Call me cynical; maybe I’ll write about that point of view another time. But really, with how popular it’s becoming to use FOSS solutions for everything these days, isn’t it about time to have an open engine? A cursory Google search (oh, irony) came up with a couple possibilities, but unless I’ve been living under a rock, there’s not a really viable open alternative out there. Sure, there are open search standards, but I’m pretty sure I’d’ve noticed a concrete implementation.

What are some of the benefits? Well for one thing, it would have a robust API. If everything is transparent (even the crawled and indexed data), there would be no reason to keep it under lock and key. Any site would be able to request search data via the API… an immediate benefit would be much better and more relevant results than a SELECT * WHERE column LIKE '%term%' query could ever hope to provide.

Another benefit would be the weight of the community behind finding solutions in the result-ranking problem space. Results should be ranked by relevance and the quality of content, not tricks and schemes to game the system. If the nameless masses get to decide what qualifies as “relevant content,” you’ve just gone a long way toward solving the problem. Even if content has to be curated to some extent to keep it relevant (because let’s face it: the major players on the market spent truckloads of money on developing algorithms to do this automatically, and they still don’t always get it right), the curation can be done in a public, transparent fashion, and even crowdsourced to the userbase to some extent.

Third, we’d own the web, and our use of the information on it. For a long time, I thought I was a fan of Google’s approach of datamining and personalization. “I’ll sign away whatever I have to so that they can keep giving me awesome stuff,” was my mindset. Even now, it’s sometimes strangely appealing to have someone else decide what I find interesting… but that’s not how it should be. Yes, I’ll grant you that having personalized results over arbitrary ones is nice, but it can be taken too far: when Facebook and Google and Twitter and the rest become so good at recommending things that they know I’ll like that it takes preference over anything else, I’ve just been robbed of the opportunity to expand my worldview and experience things outside of my comfort zone.

Now, I’m not talking about a “show snuff films to your grandma”-level of out-of-comfort-zone, but consider this example: if I like Italian food, and I spend time on the internet looking up stuff about Italian food at least a couple times a week, then it should come as no shock to anyone when a personalization algorithm suggests that I check out content about Italian food.

What would be shocking, and pleasantly so, is if the personalization algorithm comes up with a result that says something to the effect of, “I know you’re not usually a seafood person, but you should really check out this recipe for lobster bisque. It’s won a lot of awards, and you’d like it,” and be right. Or if not right, then at least I’d be able to learn a little more about myself: yeah, the lobster bisque might be award-winning, but either I’m not a very good chef, or I really don’t like seafood.

A lot of the responsibility for staying informed is taken from us when the media that we consume has been so heavily tailored that we only hear about or see results for things that are safely inside our comfort zone. If we choose to stay sheltered, fine, but it should be our choice. In the open web that I’m suggesting, the input sanitization that I’ve described wouldn’t occur. Filtering and sterilization, if it took place at all, would happen in a very transparent way: a summary of result customizations would be available with every query, along with the necessary tools to modify those customizations.

I know that what I’ve described is a big project, and a lofty goal, but I think it would be an enormous step toward creating the kind of web where users control the information, instead of the other way around. Thoughts?

Leave a Reply