Monday, January 03, 2011

Google vs. the Spammers

In an article entitled "Why We Desperately Need a New (and Better) Google", a professor describes the experiences of his students, trying to get past the "tropical paradise for spammers and marketers" in Google:
Almost every search takes you to websites that want you to click on links that make them money, or to sponsored sites that make Google money. There’s no way to do a meaningful chronological search.
He notes,
The problem is that content on the internet is growing exponentially and the vast majority of this content is spam. This is created by unscrupulous companies that know how to manipulate Google’s page-ranking systems to get their websites listed at the top of your search results.... This is exactly what blogger Paul Kedrosky found when trying to buy a dishwasher.... He couldn’t make head or tail of the results. Paul concluded that the “the entire web is spam when it comes to major appliance reviews”.
Pretty close. Some websites offer reviews of appliances that are genuine, albeit usually created by customers and thus sometimes unreliable or posted to the wrong make or model. You can sometimes find a decent, free resource such as Greener Sources from Consumer Reports. But most professionally conducted reviews for major appliances are not available for free, or are only available in highlight form as detailed in press releases by the authors or in news stories that discuss the findings. Meanwhile, as the article notes, even the "big boys", one-time ethical players in the "new media" market, are mass producing junk articles:
Content creation is big business, and there are big players involved. For example, Associated Content, which produces 10,000 new articles per month, was purchased by Yahoo! for $100 million, in 2010. Demand Media has 8,000 writers who produce 180,000 new articles each month. It generated more than $200 million in revenue in 2009 and planning an initial public offering valued at about $1.5 billion. This content is what ends up as the landfill in the garbage websites that you find all over the web. And these are the first links that show up in your Google search results.
But that's not necessarily Google's fault - nor is it necessarily the result of a fault in Google. Let's imagine we're a friendly neighborhood search engine spider looking for reviews of major appliances. We find a site that is known for offering quality reviews. Oops - the content is behind a firewall, for subscribers only, and we can't see it. So we keep looking, and find an "off the top of my head" article from Associated Content that at least offers a few tips on selecting an appliance. And we find a store site that has some consumer reviews for the appliance. And we find some low-quality sites that are primarily designed to push affiliate links. Then, finally, we find some machine-generated sites that either aggregate content from other sites or present garbled but keyword-rich test accompanied by ads. Google actually does a pretty decent job in ranking the pages it can see - the real problem is that for the most part the content the reader most wants, the costly and labor-intensive testing and comparison of major appliances across a number of relevant variables - simply isn't available for free.

The author implicitly recognizes the value and importance of paid content, having secured premium access to LinkedIn for his students. It's interesting to me that it was when the paid service failed -- "some of the [company] founders [students were to contact] didn’t have LinkedIn accounts" - that his students turned to Google. And the complaint is not that the paid service was inadequate, but that the free service didn't quickly and easily turn up the information that was not available for a fee through the paid service.

The author mentions a small search engine, Blekko, that he depicts as overcoming some of the problems faced by Google:
In addition to providing regular search capabilities like Google’s, Blekko allows you to define what it calls “slashtags” and filter the information you retrieve according to your own criteria. Slashtags are mostly human-curated sets of websites built around a specific topic, such as health, finance, sports, tech, and colleges. So if you are looking for information about swine flu, you can add “/health” to your query and search only the top 70 or so relevant health sites rather than tens of thousands spam sites. Blekko crowdsources the editorial judgment for what should and should not be in a slashtag, as Wikipedia does. One Blekko user created a slashtag for 2100 college websites. So anyone can do a targeted search for all the schools offering courses in molecular biology, for example. Most searches are like this—they can be restricted to a few thousand relevant sites. The results become much more relevant and trustworthy when you can filter out all the garbage.
Except who crowdsources the crowdsourcers? Also, you can presently build a custom search engine with Google that does pretty much the same thing. Blekko really seems to be identifying a way to switch by slashtag between custom search engines. Through its primary interface, Google is attempting to give you the results without the slashtags, and it seems unlikely that a search engine that requires users to learn slashtags in order to get meaningful results will ever be more than a niche player.

If Blekko "hits it big" it will be swamped with spammers, as was one of the co-founder's first projects, the Open Directory Project. Also, while I don't want to diminish the idea of getting rich based upon leveraging the free labor of the crowd, as the Open Directory again suggests, at a certain level depending on free volunteer labor doesn't scale very well. If you've already sold your project and have moved onto your next one, that's no big deal. If you're trying to continue a project that relies upon volunteers to sort through and meaningfully categorize billions of web pages, it's critical.

Further, it's not just the creation of slashtags that creates opportunity for spammers. We're told that while Google tends to attach a date to a page based upon when it first finds the page, an oversimplification but seemingly true in many cases, Blekko determines when content was created by "analyzing other information embedded in its HTML". There's an obvious reason, though, why Google has not chosen that approach - I can easily set up my server to embed in my HTML the message, "This is a fresh, new page". Spammers don't care right now because very few people use Blekko and even fewer of them are top targets for spammers. But if Blekko catches on, that approach to dating content will be quickly exploited and rendered completely useless. As the author said, "unscrupulous companies... know how to manipulate Google’s page-ranking systems" - the same is true for any popular search engine. It's not clear that Blekko would survive any degree of popularity. But then, I expect that Blekko's goal is to find a company willing to buy its technology, not to actually become the next big search engine.

Guess what else? In my non-scientific search for "appliance reviews" on Google and Blekko, I found Google's results to be better. Blekko offered more machine-generated compilations of user reviews, while (amidst similar compilations) Google came up with Good Housekeeping and Consumer Reports. A leading reason both searches included a lot of junk is, as previously mentioned, there's not a whole lot out there for free that's not junk. My suggestion? Find a popular forum with members who are familiar with the products and product lines that interest you, post a question about the products you are interested in buying, and see what the other members have to say.

Update: Speaking of things that make you go 'blekko'....

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.