Posts Tagged ‘Search Engines’
In my previous discussions of social media, channel architectures, and branding, I discussed the fact that I am manic about locking down my online brand (onlinematters) because there seems to be some relationship in the universal search engines between the number of posts/the number of sites that I post from under a specific username and how my posts rank. It is as if there is some measure of trust given to an author the more he publishes from different sites and the more people see/read/link to what he has written. I am not talking about authority given to the actual content written by the author – that is the core of search. I am talking instead about using the author's behavior and success as a content producer to change where his content ranks for any given search result on a specific search term. It is similar, in many ways, to what happened in the Vincent release where brand became a more important ranking factor. In this case, the author and the brand are synonymous and when the brand is highly valued, then those results would, under my hypothesis, be given an extra boost in the rankings.
This was an instinct call, and while I believed I had data to support the theory, I had no research to prove that perhaps an underlying algorithm had been considered/created to measure this phenomenon in universal search.
I thus considered myself twice lucky while doing my weekly reading on the latest patents to find one that indicates someone is thinking about the issue of "author rank." On October 29th, Jaya Kawale and Aditya Pal of Yahoo! applied for a patent with the name "Method and Apparatus for Rating User Generated Content in Search Results." The abstract reads as follows:
Generally, a method and apparatus provides for rating user generated content (UGC) with respect to search engine results. The method and apparatus includes recognizing a UGC data field collected from a web document located at a web location. The method and apparatus calculates: a document goodness factor for the web document; an author rank for an author of the UGC data field; and a location rank for web location. The method and apparatus thereby generates a rating factor for the UGC field based on the document goodness factor, the author rank and the location rank. The method and apparatus also outputs a search result that includes the UGC data field positioned in the search results based on the rating factor.
Let's see if we can't put this into English comprehensible to the common search geek. Kawale and Pal want to collect data on three specific ranking factors and to combine these into a single, weighted ranking factor, that is then used to influence rank ordering based on what they term "User Generated Content" or UGC. The authors note that typical ranking factors in search engines today are not suitable foir ranking UGC. UGC are fairly short, they generally do not have links to or from them (rendering the back-link based analysis unhelpful) and spelling mistakes are quite common. Thus a new set of factors is needed to adequately index and rank content from UGC.
The first issue the patent/algorithm has to deal with is defining what the term UGC includes. The patent specifically mentions "blogs, groups, public mailing lists, Q & A services, product reviews, message boards, forums and podcasts, among other types of content." The patent does not specifically mention social media sites, but those are clearly implied.
The second issue is to determine what sites should be scoured for UGC. UGC sites are not always easy to identify. An example would be a directory in which people rank references based on 5-star rating, where that is the only user input. Is this site easy to identify as a site with UGC? Not really, but somehow the search engine must make a decision whether this site is within its valid universe. Clearly, some mechanism for categorizing sites with UGC needs to exist and while Kawale and Pal use the example of blog search as covering a limited universe of sites, their patent does not give any indication of how sites are to be chosen for inclusion in the crawl process.
Now we come to the ranking factors. The three specific ranking factors proposed by Kawale and Pal are:
- Document Goodness. The Document Goodness Factor is based on at least one (and possibly more) of the following attributes of the document itself: a user rating; a frequency of posts before and after the document is posted; a document's contextual affinity with a parent document; a page click/view number for the document; assets in the document; document length; length of a thread in which the document lies; and goodness of a child document.
- Author Rank. The Author Rank is a measure of the author's authority in the social media realm on a subject, and is based on on or more of the following attributes: a number of relevant posted messages; a number of irrelevant posted messages; a total number of root documents posted by the author within a prescribed time period; a total number of replies or comments made by the author; and a number of groups to which the author is a member.
- Location Rank. Location Rank is a measure of the authority of the site in the social media realm. It can be based on one or more of the following attributes: an activity rate in the web location; a number of unique users in the web location; an average document goodness factor of documents in the web location; an average author rank of users in the web location; and an external rank of the web location.
These ranking factors are not used directly as calculated. They are "normalized" for elements like document length and then combined in some mechanism to create a single UGC ranking factor.
The main thing to note – and the item that caught my attention, obviously – is Author Rank. Note that is has ranking factors that correspond with what I have been hypothesizing exist in the universal search engines. That is to say, search results are not ranked only by the content on the page, but by the authority of the author who has written them, as determined by how many posts that author has made, how many sites he has made them on, how many groups he or she belongs to, and so on.
Can I say for certain that any algorithm like this has been implemented? Absolutely not. But my next task has to be to design an experiment to see if we can detect a whiff of it in the ether. I'll keep you informed.
But of course, I don’t want to ignore the previous Vincent update – as that was the connection to post #1.
Orion first. Actually Google did not announce “Orion” – which is a search technology it purchased in 2006, along with it’s college-student developer Ori Allon. But my guess is that thanks to Greg Sterling’s new article containing that title the term “Orion Release” will stick. Here’s how Danny Sullivan described the technology back in April 2006:
It sounds like Allon mainly developed an algorithm useful in pulling out better summaries of web pages. In other words, if you did a search, you’d be likely to get back extracted sections of pages most relevant to your query.
Ori himself wrote the following in his press release:
Orion finds pages where the content is about a topic strongly related to the key word. It then returns a section of the page, and lists other topics related to the key word so the user can pick the most relevant.
Google actually announced two changes:
Longer Snippets. When users input queries of more than three words, the Google results will now contain more lines of text in order to provide more information and context. As a reminder, a snippet is a search result that starts with a dark blue title and is followed by a few lines of text. Google’s research must have shown that regular-length snippets were not providing enough information to searchers to provide a clear preference for a result based on their longer search term – as their stated intent is to provide enhanced information that will improve the searcher’s ability to determine the relevance of items listed in the SERPs.
Having said this, I don’t see any difference. My slav…. I mean my 12-yo son (who has been doing keyword analysis since he was 10, so no slouch at this) ran ten tests on Google to see if we could find a difference (I won’t detail all the one- and two- vs 3+ word combinations we tried – if you want to have the list, leave a comment or send a twitter to arthurofsun and I will forward it to you). But shown below are the results for France Travel vs France Travel Guides for Northern France:
As you can see, there is absolutely no difference in snippet length for the two searches - and this was universally true across all the searches we ran. So I’m not sure – I wonder if Ori Allon, who wrote the post, could help us out on this one.
Also, I am somewhat confused. If you type in more keywords, the search engine has more information by which to determine the relevance of a result. So why would I need more information? Where I need more information is in the situation of a 3- keyword search, which will return a broad set of results that I will need to filter based on the information contained in a longer snippet.
Enhanced Search Associations. The bigger enhancement – and the one that seems most likely to derive from the original Orion technology – are enhanced associations between keywords. Basically if you type in a keyword – Ori uses the example ”principles of physics” – then the new algorithms understand that there are other ideas related to this I may be interested in, like “Big Bang” or “Special Relativity.” The way Google has implemented this is to put a set of related keywords at the bottom of the first SERP, which you may click on. When you click, it returns a new set of search results based on the keyword you clicked. Why at the bottom of the first SERP? My hypothesis would be that if the searcher has gone to the bottom of the page, it means that they haven’t found what they are looking for. So this is the right place in the user experience to prompt them with related keywords that they may find more relevant to the content they are seeking.
From my perspective, this feels like the “People who liked this item also bought…” widget on most comparison shopping sites (which I know something about, having been the head of marketing for SHOP.COM.) I’m not saying there is anything wrong with this – I’m just trying to make an analogy to the type of user experience Google is trying to create.
Shown below is an example of a enhanced search associations from a search on the broad term “credit derivatives in the USA”:
As I expected, the term “credit default swaps” – which is the major form of credit derivative – shows as an associated keyword. What I do not see in the list – and was surprised – was any reference to the International Swaps and Derivatives Association (ISDA), which is the organization that has developed the standards and rules by which most derivatives are created. It does, however, show up for the search on the keyword “credit default swap.” I’d be curious to understand just exactly how the algorithm has been tuned to make trade-offs between broad concepts (i.e, credit derivatives, which is a category)) and very focused concepts (i.e. credit default swap, which is a specific product). Maybe I can get Ori to opine on that as well, but most likely that comes under the category of secret sauce.
Anyway, fascinating and it certainly shows that Google continues to evolve the state of IR.
Well, I’ll just have to leave the Vincent release until tomorrow. Something else happened this morning I need to do a quick entry about. Sigh…..