Archive for November, 2009
Search Engines: Social Media, Author Rank and SEO
In my previous discussions of social media, channel architectures, and branding, I discussed the fact that I am manic about locking down my online brand (onlinematters) because there seems to be some relationship in the universal search engines between the number of posts/the number of sites that I post from under a specific username and how my posts rank. It is as if there is some measure of trust given to an author the more he publishes from different sites and the more people see/read/link to what he has written. I am not talking about authority given to the actual content written by the author – that is the core of search. I am talking instead about using the author's behavior and success as a content producer to change where his content ranks for any given search result on a specific search term. It is similar, in many ways, to what happened in the Vincent release where brand became a more important ranking factor. In this case, the author and the brand are synonymous and when the brand is highly valued, then those results would, under my hypothesis, be given an extra boost in the rankings.
This was an instinct call, and while I believed I had data to support the theory, I had no research to prove that perhaps an underlying algorithm had been considered/created to measure this phenomenon in universal search.
I thus considered myself twice lucky while doing my weekly reading on the latest patents to find one that indicates someone is thinking about the issue of "author rank." On October 29th, Jaya Kawale and Aditya Pal of Yahoo! applied for a patent with the name "Method and Apparatus for Rating User Generated Content in Search Results." The abstract reads as follows:
Generally, a method and apparatus provides for rating user generated content (UGC) with respect to search engine results. The method and apparatus includes recognizing a UGC data field collected from a web document located at a web location. The method and apparatus calculates: a document goodness factor for the web document; an author rank for an author of the UGC data field; and a location rank for web location. The method and apparatus thereby generates a rating factor for the UGC field based on the document goodness factor, the author rank and the location rank. The method and apparatus also outputs a search result that includes the UGC data field positioned in the search results based on the rating factor.
Let's see if we can't put this into English comprehensible to the common search geek. Kawale and Pal want to collect data on three specific ranking factors and to combine these into a single, weighted ranking factor, that is then used to influence rank ordering based on what they term "User Generated Content" or UGC. The authors note that typical ranking factors in search engines today are not suitable foir ranking UGC. UGC are fairly short, they generally do not have links to or from them (rendering the back-link based analysis unhelpful) and spelling mistakes are quite common. Thus a new set of factors is needed to adequately index and rank content from UGC.
The first issue the patent/algorithm has to deal with is defining what the term UGC includes. The patent specifically mentions "blogs, groups, public mailing lists, Q & A services, product reviews, message boards, forums and podcasts, among other types of content." The patent does not specifically mention social media sites, but those are clearly implied.
The second issue is to determine what sites should be scoured for UGC. UGC sites are not always easy to identify. An example would be a directory in which people rank references based on 5-star rating, where that is the only user input. Is this site easy to identify as a site with UGC? Not really, but somehow the search engine must make a decision whether this site is within its valid universe. Clearly, some mechanism for categorizing sites with UGC needs to exist and while Kawale and Pal use the example of blog search as covering a limited universe of sites, their patent does not give any indication of how sites are to be chosen for inclusion in the crawl process.
Now we come to the ranking factors. The three specific ranking factors proposed by Kawale and Pal are:
- Document Goodness. The Document Goodness Factor is based on at least one (and possibly more) of the following attributes of the document itself: a user rating; a frequency of posts before and after the document is posted; a document's contextual affinity with a parent document; a page click/view number for the document; assets in the document; document length; length of a thread in which the document lies; and goodness of a child document.
- Author Rank. The Author Rank is a measure of the author's authority in the social media realm on a subject, and is based on on or more of the following attributes: a number of relevant posted messages; a number of irrelevant posted messages; a total number of root documents posted by the author within a prescribed time period; a total number of replies or comments made by the author; and a number of groups to which the author is a member.
- Location Rank. Location Rank is a measure of the authority of the site in the social media realm. It can be based on one or more of the following attributes: an activity rate in the web location; a number of unique users in the web location; an average document goodness factor of documents in the web location; an average author rank of users in the web location; and an external rank of the web location.
These ranking factors are not used directly as calculated. They are "normalized" for elements like document length and then combined in some mechanism to create a single UGC ranking factor.
The main thing to note – and the item that caught my attention, obviously – is Author Rank. Note that is has ranking factors that correspond with what I have been hypothesizing exist in the universal search engines. That is to say, search results are not ranked only by the content on the page, but by the authority of the author who has written them, as determined by how many posts that author has made, how many sites he has made them on, how many groups he or she belongs to, and so on.
Can I say for certain that any algorithm like this has been implemented? Absolutely not. But my next task has to be to design an experiment to see if we can detect a whiff of it in the ether. I'll keep you informed.
Technical SEO: Site Loading Times and SEO Rankings Part 2
In my last post, I discussed the underlying issues regarding site loading times and SEO rankings. What I tried to do was help the reader understand why site loading times are important from the perspective of someone designing a search engine that has to crawl billions of pages. The post also outlines a few of the structures that they would have to put in place to accurately and effectively crawl all the pages they need in a limited time with limited processing power. I also tried to show that a search engine like Google has a political and economic agenda in ensuring fast sites, not just a technical agenda. Google wants as many people/eyeballs on the web as possible, so it is to their advantage to ensure that web sites provide a good user experience. As a result, they feel quite justified in penalizing sites that do not have good speed/performance characteristics.
As you would expect, the conclusion is that if your site is hugely slow you will not get indexed and will not rank in the SERPs. What is "hugely slow"? Google has indicated that slow is a relative notion and is determined based on the loading times typical of sites in your geographical region. Having said that, relative or not, from an SEO perspective I wouldn't want to have a site where pages are taking more than 10 seconds on average to load. We have found from the sites we have tested and built that average load times higher than approximately 10 seconds to completely load a page will have a significant impact on being indexed. From a UE perspective, there is some interesting data that the limit on visitors patience is about 6-8 seconds. Google has studied this data, so it would probably prefer to set its threshhold in that region. But I doubt it can. Many small sites are not that sophisticated, do not know these kinds of rules, and do not know how to check or evaluate their site loading times. Besides this, there are often problems with hosts that cause servers to run slowly at times. Google has to take that into account, as well. So I believe that the timeout has to be substantially higher than 6-8 seconds, but 10 seconds as a crawl limit is a guess,
I have yet to see a definitive statement by anyone as to what the absolute limit is for site speed before indexing ceases altogether (if you have a reference, please post it in the comments). I'm sure that if a bot comes to a first page and it exceeds the bot's timeout threshold in the algorithm, your site won't get spidered at all. But once the bot gets by the first page, it has to do an on-going computation of average page loading times for the site to determine if the average exceeds the built-in threshold, so at least a few pages would have to be crawled in that case.
Now here's where it gets interesting. What happens between fast (let's say < 1-2 second loading times, although this is actually pretty slow but a number Matt Cutts in the video below indicates is ok) and the timeout limit? And how important is site speed as a ranking signal? Let's answer one question at a time.
When a site is slow but not slow enough to hit any built-in timeout limits (not tied to the number of pages), a couple of things can happen. We do know that Google allocates bot time by the number of pages on the site and the number of pages it has to index/re-index. So for a small site that performs poorly, it is likely that most of the pages will get indexed. Likely, but not a guarantee. It all depends on the cumulative time lag versus the average that a site creates. If a site is large, then you can almost guarantee that some pages will not be indexed, as the cumulative time lag will ultimately hit the threshold set by the bots for a site of that number of pages. By definition, some of your content will not get ranked and you will not get the benefit of that content in your rankings.
As an aside, by the way, there has been a lot of confusion around the <meta name="revisit-after"> tag. The revisit-after
This tag supposedly tells the bots how often to come back to the site to reindex this specific page (in this case 5 days). The idea is that you can improve the crawlability of your site by telling the bots not to index certain pages all the time, but only some of the time. I became aware of this tag at SMX East, when one of the "authorities" on SEO mentioned it as usable for this purpose. The trouble is that, from everything I have read, the tag is completely unsupported by any of the major engines, and was only supported by one tiny search engine (SearchBC) many years ago.
But let's say you are one of the lucky sites where the site runs slowly but all the pages do get indexed. Do Google or any of the other major search engines use the site's performance as a ranking signal? In other words, all my pages are in the index. So you would expect that they would be ranked based on the quality of their content and their authority derived from inbound links, site visits, time-on-site, and other typical ranking signals. Performance is not a likely candidate for a ranking signal and isn't important.
If you thought that, then you were wrong. Historically, Google has said, and Matt Cutts reiterates this in the video below, that site load times do not influence search rankings. But while that may be true now, it may not be in the near future. And this is where Maile's comments took me by surprise. In a small group session at SMX East 2009, Maile was asked about site performance and rankings. She indicated that for the "middle ground" sites that are indexing but loading slowly, site performance may already be used to influence rankings. Who is right, I can't say. These are both highly respected professionals who choose their words carefully.
Whatever is true, Google is sending us signals that this change is coming. Senior experts like Matt and Maile don't say these things lightly. They are well considered and probably approved positions that they are asked to take. This is Google's way of preventing us from getting mad when the change occurs. Google has the fallback of saying "we warned you this could happen." Which from today's viewpoiint means it will happen.
Conclusion: Start working on your site performance now, as it will be important for SEO rankings later.
Oh and, by the way, your user experience will just happen to be better, which is clearly the real reason to fix site performance.
And it isn't only Google that may make this change. Engineers from Yahoo! recently filed a patent with the title "Web Document User Experience Characterization Methods and Systems" which bears on this topic. Let me quote paragraph 21:
With so many websites and web pages being available and with varying hardware and software configurations, it may be beneficial to identify which web documents may lead to a desired user experience and which may not lead to a desired user experience. By way of example but not limitation, in certain situations it may be beneficial to determine (e.g., classify, rank, characterize) which web documents may not meet performance or other user experience expectations if selected by the user. Such performance may, for example, be affected by server, network, client, file, and/or like processes and/or the software, firmware, and/or hardware resources associated therewith. Once web documents are identified in this manner the resulting user experience information may, for example, be considered when generating the search results.
In does not appear Yahoo! has implemented any aspect of this patent yet, and who knows what the Bing agreement will mean for site performance and search. But clearly this is a "problem" that the search engine muftis have set their eyes on and I would expect that if Google does implement it, others will follow.
