About Online Matters

Archive for the ‘Technical SEO’ Category

PostHeaderIcon Web Site Latency and Performance Issues – Part 6

Taking up where we left off in part 5…

In the last post, we had just moved aboutonlinematters.com to a privately-hosted standalone server and seen a substantial decrease is web site latency. We had seen our ratings improve in Google Page Speed from being better than 43% of similar websites to about 53% of sites. So great improvement. But we were still showing a lot of issues in ySlow and the Google Page Speed tool. These fell into three categories:

  • Server Configuration. This involves optimizing settings on our Apache web server: enabling gzip for file compression, applying entity tags, adding expires headers, turning on keep-alive,  and splitting components across domains.
  • Content Compression. This involves items like compressing images,  javascript, and css, specifying image sizes, and reducing the number of DOM elements.
  • Reducing External Calls. This involves combining all external css and javascript files into a single file, using cookieless domains, minimizing DNS lookups and redirects, as well as optimizing the order and style of scripts.

We decided to attack the web site latency issues in stages, first attacking those elements that were easiest to fix (server configuration) and leaving the most difficult to fix (reducing external calls) until last.

Server Configuration Issues

In their simplest form, server configuration issues related to web site latency have to do with settings on a site’s underlying web server, such as Apache.   For larger enterprise sites, server configuration issues cover a broader set of technical topics, including load balancing across multiple servers and databases as well as the use of a content delivery network.  This section is only going to cover the former, and not the latter, as they relate to web site latency.

With Apache (and Microsoft IIS), the server settings we care about can be managed and tracked through a page’s HTTP headers.  Thus, before we get into the settings we specifically care about, we need to have a discussion of what HTTP headers are and why they are important.

HTTP Headers

HTTP headers are an Internet protocol, or set of rules, for formatting certain types of data and instructions that are either:

  • included in a request from a web client/browser, or
  • sent by the server along with a response to a browser.

HTTP headers carry information in both directions.  A client or browser can make a request to the server for a web page or other resource, usually a file or dynamic output from a server side script.  Alternately, there are also HTTP headers designed to be sent by the server along with its response to the browser or client request.

As SEOs, we care about HTTP headers because our request from the client to the server will return information about various elements of server configuration that may impact web site latency and performance. These elements include:

  • Response status; 200 is a valid response from the server.
  • Date of request.
  • Server details; type, configuration and version numbers. For example the php version.
  • Cookies; cookies set on your system for the domain.
  • Last-Modified; this is only available if set on the server and is usually the time the requested file was last modified
  • Content-Type; text/html is a html web page, text/xml an xml file.

There are two kinds of requests. A HEAD request returns only the header information from the server. A GET request returns both the header information and file content exactly as a browser would request the information. For our purposes, we only care about HEAD requests. Here is an example of a request:

Headers Sent Request
HEAD / HTTP/1.0
Host: www.aboutonlinematters.com
Connection: Close

And here is what we get back in its simplest form using the Trellian FireFox Toolbar :

Response: HTTP/1.1 200 OK
Date: Sun, 04 Apr 2010 00:17:06 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: PHP/5.3.0
X-Pingback: http://www.aboutonlinematters.com/xmlrpc.php
Link: <http://wp.me/DbBZ>; rel=shortlink
Content-Encoding: gzip
Cache-Control: max-age=31536000
Expires: Mon, 04 Apr 2011 00:17:06 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: Chunked
Proxy-Connection: Keep-alive
x-ua-compatible: IE=EmulateIE7

Different tools will return different header information depending on the specific requests made in the calling script. For example, Live HTTP headers, a plugin for FireFox, provides detailed header request and response information for every element on a page (it basically breaks out each GET and shows you the actual response that comes back from the server). This level of detail will prove helpful later when we undertake deep analysis to reduce external server requests. But for now, what is shown here is adequate for the purposes of our analysis.

For a summary of HTTP header requests and response codes, click here .  But for now, let’s get back to configuring our Apache Server to reduce web site latency.

Apache Server Settings Related to Site Latency

Enabling Gzip Compression

Web site latency substantially improves when the amount of data that has to flow between the server and the browser is at a minimum.  I believe I’ve read somewhere that image requests account for 80% of the load time of most web pages, so just following good image-handling protocols for web sites (covered in a later installment) can substantially improve web site latency and page loading times.  However, manually compressing images is painful and time consuming.  Moreover, there are other types of files – Javascript and CSS are the most common – that can also be compressed.

Designers of web servers identified this problem early on and provided a built-in tool on their servers for compressing files moving between the server and the browser.  Starting with HTTP/1.1, web clients indicate support for compression by including the Accept-Encoding header in the HTTP request.

Accept-Encoding: gzip, deflate

If the web server sees this header in the request, it may compress the response using one of the methods listed by the client. The web server notifies the web client of this via the Content-Encoding header in the response.

Content-Encoding: gzip

Gzip remains the most popular and effective compression method. It was developed by the GNU project and standardized by RFC 1952. The only other compression format is deflate, but it’s less effective and less popular.

Gzipping generally reduces the response size by about 70%.  Approximately 90% of today’s Internet traffic travels through browsers that claim to support gzip. If you use Apache, the module configuring gzip depends on your version: Apache 1.3 uses mod_gzip while Apache 2.x uses mod_deflate.

Configuring Entity Tags

Web servers and browsers use Entity tags (ETags) to determine whether the component in the browser’s cache, like an image or script (which are examples of an “entity”) matches the one on the origin server. It is a simple string, surrounded by quotation marks, that uniquely identifies a specific version of the selected component/entity. The origin server specifies the component’s ETag using the ETag response header.

HTTP/1.1 200 OK
Last-Modified: Sun, 04 Apr 2010 00:37:48 GMT
Etag: "1896-bf9be880"
Expires: Mon, 04 Apr 2011 00:37:48 GMT

Later, if the browser has to validate a component, it uses the If-None-Match header to pass the ETag back to the origin server. If the ETags match, a 304 status code is returned.

GET http://www.aboutonlinematters.com/wp-content/plugins/web-optimizer/cache/f39a292fcf.css?1270299922 HTTP/1.1
Host: www.aboutonlinematters.com
If-Modified-Since: Sun, 04 Apr 2010 00:37:48 GMT
If-None-Match: "1896-bf9be880"
HTTP/1.1 304 Not Modified

ETags can impact site latency because they are typically constructed using attributes that make them unique to a specific server. ETags won’t match when a browser gets the original component from one server and later tries to validate that component on a different server, which is a fairly standard scenario on Web sites that use a cluster of servers to handle requests. By default, both Apache and IIS embed data in the ETag that dramatically reduces the odds of the validity test succeeding on web sites with multiple servers. If the ETags don’t match, the web client doesn’t receive the small, fast 304 response that ETags were designed for.  Instead,  they get a normal 200 response along with all the data for the component.  This isn’t a problem for small sites hosted on a single server. But it is a substantial problem for sites with multiple servers using Apache or IIS with the default ETag configuration.  Web clients see higher web site latency, web servers have a higher load,  bandwidth consumption is high, and proxies aren’t caching content efficiently.

So when a site does not benefit from the flexible validation model provided by Etags, it’s better to just remove the ETag altogether. In Apache, this is done by simply adding the following line to your Apache configuration file:

FileETag none

Expires Headers

The Expires header makes any components in an HTTP request cacheable. This avoids unnecessary HTTP requests on any page views after the initial visit because components downloaded during the initial visit, for example images and script files, remain in the browser’s local cache and do not have to be downloaded on subsequent requests. Expires headers are most often used with images, but they should be used on all components including scripts, stylesheets, and Flash components.

Browsers (and proxies) use a cache to reduce the number and size of HTTP requests, making web pages load faster.  The Expires header in the HTTP response tells the client how long a component can be cached. This far future Expires header

Expires: Thu, 15 Apr 2020 20:00:00 GMT

tells the browser that this response won’t be stale until April 15, 2020.

Apache uses the ExpiresDefault directive to set an expiration date relative to the current date. So for example:

ExpiresDefault "access plus 10 years"

sets the Expires date 10 years out from the time of the request.

Using a far future Expires header affects page views only after a user has already visited a site for the first time or when the cache has been cleared. Therefore the impact of this performance improvement depends on how often users hit your pages with a primed cache. In the case of About Online Matters, we still do not get lots of visitors, so you would expect that the impact of this change to the server would have little impact on our performance and, indeed, that proved to be true.

Keep Alive Connections

The Keep-Alive extension to HTTP/1.0 and the persistent connection feature of HTTP/1.1 provide long-lived HTTP sessions which allow multiple requests to be sent over the same TCP connection. What this does is prevent an extra HTTP request/response for every object on a page, and instead allows multiple objects to be requested and retrieved in a single HTTP session.  HTTP requests require a three-way handshake and have built in algorithms for congestion control that restrict available bandwidth on the startup of an HTTP session.  Making multiple requests in a single session reduces the number of times congestion control is invoked.  As a result, in some cases, enabling keep-alive on an Apache server has been shown to result in an almost 50% speedup in latency times for HTML documents with many images.  To enable keep-alive add the following line to your Apache configuration:

KeepAlive On

Is The Configuration Correct?

When I make these various changes to the server configuration, how can I verify they have actually been implemented?  This is where the HTTP headers come into play.  Let’s take a look at the prior response we got from www.aboutonlinematters.com when we made a HEADERS request:

Response: HTTP/1.1 200 OK
Date: Sun, 04 Apr 2010 00:17:06 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: PHP/5.3.0
X-Pingback: http://www.aboutonlinematters.com/xmlrpc.php
Link: <http://wp.me/DbBZ>; rel=shortlink
Content-Encoding: gzip
Cache-Control: max-age=31536000
Expires: Mon, 04 Apr 2011 00:17:06 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: Chunked
Proxy-Connection: Keep-alive
x-ua-compatible: IE=EmulateIE7

The line items in blue show that Gzip, expires headers, and keep-alive switches have been implemented on our server.  ETags won’t show in this set of responses because ETags are associated with a specific entity on a page.  They show instead in tools that provide detailed analysis of HTTP requests and responses, such as Live HTTP Headers or Charles.  No ETags should be visible in an HTTP request or response if FileETag: None has been implemented.

Results

We made changes in two steps.  First we activated Gzip compression, Expires Headers and removed ETags.  These changes made only negligible changes in overall web site latency.  Then we implemented the keep-alive  setting.  Almost immediately, our site latency improved in the Google Page Speed tool from being better than 53% of similar sites to being better than 61%.

We’ll stop there for today and pickup on content compression in the next installment.

FacebookTwitterFriendFeedStumbleUponDeliciousDiggLinkedInMultiplyBlogger PostPingDiigoGoogle ReaderMySpacePlaxo PulseSphinnTechnorati FavoritesTumblrWordPressShare

PostHeaderIcon Web Site Latency and Site Performance Part 5

Well, it’s Monday. A good Monday with some interesting insights.

I will continue with tool review going forward, but I’m finding that I need to document our work on our website performance as we go along or else we lose the data from the intermediate steps, and there have already been several that have been implemented. So let me bring you up to speed.

After my last post about the site and reviewing the data from the Google Site Performance tab in Google Webmaster tools, I was able to visualize (see the image) what was going on. As the image shows, performance jumped around substantially from mid-September, when I started the blog, until early-mid December. These jumped did not coincide in any major way with the debugging and latency improvements that I had been working on. Except for December – around the time of my last post. That seemed to have cut my latency in half – which was what pingdom had shown. So perhaps I was moving in the right direction.

Things continued to improve steadily through January – even though I had not changed any further settings. This again suggested that the fact I was hosted on a shared server and that perhaps my ISP had improved the performance of that server might be the reason for unpredictable performance changes, good or bad. But then in mid-January, I started to see a jump in latency times again.

Google Site Performance Chart for AboutOnlineMatters

At the same time, I wanted to continue debugging AboutOnlineMatters site latency and implement some of the changes from ySlow, such as gzip, entity tags, and expires headers. To do that, I needed direct access to my Apache Server. Given these two facts, I decided that it was time to remove the server as a factor and host the blog myself.

On February 6, we moved the site onto our own hosted setup. This is basically a dedicated server (we do have a few other small sites running, but they are using insignificant server resources) and I have direct access to all the configuration settings. From that time forward, as the chart shows, site latency has decreased continually until it is now at close to it’s historical lows.

I’ll leave it there for now – following my rule of short posts. We’ll pick up the next steps I took tomorrow.

FacebookTwitterFriendFeedStumbleUponDeliciousDiggLinkedInMultiplyBlogger PostPingDiigoGoogle ReaderMySpacePlaxo PulseSphinnTechnorati FavoritesTumblrWordPressShare

PostHeaderIcon Highlights from SMX West 2010 – Part 1

I have been away from the blog for waaay too long. Not a good thing. As someone asked yesterday at the “Ask the SEOs” session at SMX West; “Is it more important to blog often, to blog long articles less frequently, or to optimize your posts for SEO in order to rank well in the SERPs for the keywords you care about?”   The answer from the panel of luminaries – Greg Boser, Bruce Clay, Rand Fishkin (substituting for Rae Hoffman) Todd Friesen, Vanessa Fox, Aaron Wall, and Jill Whalen – was mixed depending on who responded.  But to me, if I had to choose only one, it would be blog often as that gets you crawled more frequently, broadens your keyword base, and generates more followers because people tend to subscribe to folks who generate regular, fresh content.  So I’ve broken my own rule.

Well, let’s see if I can change that with a quick post today.  To help those who couldn’t be at SMX West these last three days, I will list some tidbits here that I found useful.   These will go all over the place, so bear with me. This is Part 1 – I will cover more of the highlights in subsequent posts.

Also, before I get started, let me say a big “thanks” to the folks at SearchEngineLand for putting on another great and informative event. It was fun and you never fail to educate me on things I hadn’t discovered for myself. I’ll see you, of course, at SMX Advanced in June.

Insights for Mobile Search

  1. Isolate mobile into its own unique campaign.
    • KPIs are lower than in traditional paid search – so isolate it to protect your other programs.
    • Isolating helps you optimize to raise KPIs of mobile specifically.
    • Adwords lets you segment your campaigns by carrier and site you are on. Cool!
  2. In mobile search that are only 5 results above the fold and you must be in position 1 or 2 or not at all – there is a HUGE falloff after those 2 positions.  Literally near zero clicks.
  3. Mobile query lengths are shorter – no big surprise there.  Search Term length example:  in a study of 414 broad match queries on mobile, the avg word length was 2.8 and longest phrase was 8 words (used 3 times).  So it is very important to run broad match on mobile campaigns.
  4. Length of query versus Click Through Rate is pretty similar between mobile and desktop search.
  5. Most popular categories for mobile right now are sports, celebrity, news, wall papers, videos, and ringtones.
  6. Other Best Practices for Mobile
    • Reinforce mobile friendliness in ads – “mobile optimized, “4 UR phone”, “Mobile ready” – indicates to viewer that site will work on their phone with best performance.
    • Have a URL that indicates mobile friendliness – it’s another signal to the consumer that you’ve made the experience usable on mobile devices.
    • Build landing pages specifically for mobile. Test landing pages for mobile separately.
    • Simplify the experience for mobile – you have to really get the content down to an essence – we saw an example of Home Depot versus Best Buy. Home Depot’s cluttered page really made the experience unusable.

New Tools to Explore

Of course I’m going to have a section on this, toolhound that I am.  Just a list for you to explore:

  • usertesting.com. Quick testing of your user experience for landing pages.  Cool idea: run this on your competitors sites as well – $29/user.  I’ve used this, btw – it has drawbacks, but for testing landing pages it is very appropriate.
  • crossbrowsertesting.com. See how your page looks in various browsers easily.
  • attentionwizard.com. A free tool that shows where people are looking on your page.
  • clicktale.com.  Records sessions and gives feedback/analytics on experience.  Free for one domain and 2-page playback.
  • crazyegg.com. Simple and affordable heat mapping tools that allow you to visually understand user behavior.
  • Lynx or seobrowser.com. A dedicated browser that allows you to see your pages the way the search engine crawlers see it
  • charles. An HTTP proxy / HTTP monitor / Reverse Proxy that enables a developer to view all of the HTTP and SSL / HTTPS traffic between their machine and the Internet.
  • Wave toolbar. Provides a mechanism for running WAVE reports directly within Firefox for debugging site issues.
  • gsite crawler. Another free sitemap generator for Google, Yahoo and Bing.

That’s it for today. Back to the day job.

FacebookTwitterFriendFeedStumbleUponDeliciousDiggLinkedInMultiplyBlogger PostPingDiigoGoogle ReaderMySpacePlaxo PulseSphinnTechnorati FavoritesTumblrWordPressShare

PostHeaderIcon Technical SEO: Analyzing Site Loading Times

Well, it’s after Thanksgiving and I finally get back to the blog. Feels good. This is the next installment about site performance analysis and how to deal with a site with worrisomely slow page loading times. It turns out I had a case study right under my nose. This site, the OnlineMatters blog. Recently, I showed a client my site and watched as it loaded, and loaded and….loaded. I was embarassed but also frustrated. I had just finished my pieces on site performance and knew that this behavior was going to cause my rankings in the SERPs to drop, even before Google releases Caffeine. While I am not trying to publish this blog to a mass audience – to do that I would need to write every day – I still wanted to rank well on keywords I care about. Given what I do, it’s an important proof point for customers and prospects.

So I am going to take advantage of this negative and turn it into a positive for you. You will see how to perform various elements of site analysis by watching me debug this blog in near real time. Yesterday, I spent three hours working through the issues, and I am not done yet. So this first piece will take us about halfway there. But even now you can learn a lot from my struggles.

The first step was to find out just how bad the problem was. The way to do this is to use Pingdom’s Full Page Analysis tool. This tool not only tests page loading speeds but also visualizes which parts of the page are causing problems. An explanation of how to use the tool can be found here, and you should read it before trying to interpret the results for your site. Here is what I got back when I ran the test:

aboutonlinematters site performance test results in pingdom

A load time of 11.9 seconds? Ouch!  Since Pingdom runs this on their servers, the speed is not influenced by my sometime unpredictably slow Comcast connection.

Pingdom shows I had over 93 items loading with the home page of which the vast majority were images (a partial listing is shown below).  There were several (lines 1, 36, 39, 40, 41, 54) where a significant part of the load time was occurring during rendering (that is, after the element had been downloaded into the browser).  This is indicated by the blue part of the bar.  But in the majority of cases, the load time was mainly caused by the time it took from either the first request to the server until the content began downloading (the yellow portion of the bar), or from the time of downloading to the time rendering began (the green portion).  This suggested that

  1. I had too big a page, because the download time for all the content to the browser was very long.
  2. I might have a server bandwidth problem.

But rather than worrying about item 2, which would require a more extensive fix – either an upgrade in service or a new host – I decided to see how far I could get with some simple site fixes.

pingdom details for www.aboutonlinematters site performance analysis

 

The first obvious thing to fix was the size of the home page, which was 485 KB – very heavy.  I tend to write long (no kidding?) and add several images to posts, so it seemed only natural to reduce the number of entries on my home page below the 10 I had it set for.  I set the allowable number in WordPress to 5 entries, saved the changes, and ran the test again. 

Miracle of miracles: My page now weighed 133 KB (respectable), had 72 total objects, and downloaded in six seconds.  That was a reduction in load time by almost 50% for one simple switch.  

improved site performance for aboutonlinematters.com

Good, but not great.  My page loading time was still 6 seconds – it needed to be below 2.  So more work was needed.

If you look at the picture above, you can just make out that some of the slowest loading files – between 4 and 6 of them  – were .css or Javascript files.  Since these are files that are part of WordPress, I chose to let them go for the moment and move onto the next obvious class of files – images.  Since images usually represent 80% of page loading times, this was the next obvious place to look.   There were between 6 and 10 files –  mainly .png files – that were adding substantially to download times.  Most of these were a core portion of the template I was using (e.g. header.png).  So they effected the whole site and, more importantly, they had been part of the blog before I ever made one entry.  The others were the icons in my Add-to-Any toolbar, which also showed on every post on the site.

Since I developed the template myself using Artisteer when I was relatively new to WordPress, I hypothesized that an image compression tool might make a substantial improvement for little effort. 

Fortunately, the ySlow Firefox plugin, which is a site performance analyzer we will examine in my next entry, contains smushit, an image compression tool created by Yahoo! that is easy to use, identifies and shows just how much bandwidth it saves, produces all compressed files at the push of a single button, and produces excellent output quality.

So I ran the tool (I sadly did not keep a screenshot of the first run, but a sample output is below), and Smushit reduced image sizes overall by about 8%, and significantly compressed the size of the template elements.  So I downloaded the smushed images and uploaded them to the site

image compression output for site performance from smushit

 

As you can see below – my home page was now 89.8 KB, but my load time had increased to 8.8 seconds! – and note on the right of the image that several prior runs confirmed the earlier 6 second load time. So either compression did not help or some other factor was at play. 

The fact is the actual rendering times had basically reduced from measurable amounts (e.g. 0.5 seconds) to milliseconds – so the actual file sizes had improved rendering performance.  Download times had increased – once again pointing to my host.  But before  going there, i wanted to see if there were any other elements on the site I could manipulate to improve peformance.

More in next post.  BTW, as I go to press this am, my site speed was 5.1 seconds – a new, positive record.  Nothing has changed – so more and more I’m suspecting my ISP and feeling I need a virtual private server.

NOTE: Even more important: as I go to press Google has just announced that it is adding a site performance tool to Google Webmaster Tools in anticipation of site performance becoming a ranking factor.

 

FacebookTwitterFriendFeedStumbleUponDeliciousDiggLinkedInMultiplyBlogger PostPingDiigoGoogle ReaderMySpacePlaxo PulseSphinnTechnorati FavoritesTumblrWordPressShare

PostHeaderIcon Search Engines: Social Media, Author Rank and SEO

In my previous discussions of social media, channel architectures, and branding, I discussed the fact that I am manic about locking down my online brand (onlinematters) because there seems to be some relationship in the universal search engines between the number of posts/the number of sites that I post from under a specific username and how my posts rank.  It is as if there is some measure of trust given to an author the more he publishes from different sites and the more people see/read/link to what he has written.  I am not talking about authority given to the actual content written by the author – that is the core of search.  I am talking instead about using the author's behavior and success as a content producer to change where his content ranks for any given search result on a specific search term.  It is similar, in many ways, to what happened in the Vincent release where brand became a more important ranking factor.  In this case, the author and the brand are synonymous and when the brand is highly valued, then those results would, under my hypothesis, be given an extra boost in the rankings.

This was an instinct call, and while I believed I had data to support the theory, I had no research to prove that perhaps an underlying algorithm had been considered/created to measure this phenomenon in universal search. 

I thus considered myself twice lucky while doing my weekly reading on the latest patents to find one that indicates someone is thinking about the issue of "author rank."  On October 29th, Jaya Kawale and Aditya Pal of Yahoo!  applied for a patent with the name "Method and Apparatus for Rating User Generated Content in Search Results."  The abstract reads as follows:

Generally, a method and apparatus provides for rating user generated content (UGC) with respect to search engine results. The method and apparatus includes recognizing a UGC data field collected from a web document located at a web location. The method and apparatus calculates: a document goodness factor for the web document; an author rank for an author of the UGC data field; and a location rank for web location. The method and apparatus thereby generates a rating factor for the UGC field based on the document goodness factor, the author rank and the location rank. The method and apparatus also outputs a search result that includes the UGC data field positioned in the search results based on the rating factor.

Let's see if we can't put this into English comprehensible to the common search geek.  Kawale and Pal want to collect data on three specific ranking factors and to combine these into a single, weighted ranking factor, that is then used to influence rank ordering based on  what they term "User Generated Content" or UGC.  The authors note that typical ranking factors in search engines today are not suitable foir ranking UGC.  UGC are fairly short, they generally do not have links to or from them (rendering the back-link based analysis unhelpful) and spelling mistakes are quite common.  Thus a new set of factors is needed to adequately index and rank content from UGC.

The first issue the patent/algorithm has to deal with is defining what the term UGC includes.  The patent specifically mentions "blogs, groups, public mailing lists, Q & A services, product reviews, message boards, forums and podcasts, among other types of content." The patent does not specifically mention social media sites, but those are clearly implied. 

The second issue is to determine what sites should be scoured for UGC.  UGC sites are not always easy to identify.  An example would be a directory in which people rank references based on 5-star rating, where that is the only user input.  Is this site easy to identify as a site with UGC?  Not really, but somehow the search engine must make a decision whether this site is within its valid universe.  Clearly, some mechanism for categorizing sites with UGC needs to exist and while Kawale and Pal use the example of blog search as covering a limited universe of sites, their patent does not give any indication of how sites are to be chosen for inclusion in the crawl process.

Now we come to the ranking factors.  The three specific ranking factors proposed by Kawale and Pal are:

  • Document Goodness.  The Document Goodness Factor is based on at least one (and possibly more) of the following attributes of the document itself: a user rating; a frequency of posts before and after the document is posted; a document's contextual affinity with a parent document; a page click/view number for the document; assets in the document; document length; length of a thread in which the document lies; and goodness of a child document. 
  • Author Rank.  The Author Rank is a measure of the author's authority in the social media realm on a subject, and is based on on or more of the following attributes:  a number of relevant posted messages; a number of irrelevant posted messages; a total number of root documents posted by the author within a prescribed time period; a total number of replies or comments made by the author; and a number of groups to which the author is a member.
  • Location Rank.  Location Rank is a measure of the authority of the site in the social media realm.  It can be based on one or more of the following attributes: an activity rate in the web location; a number of unique users in the web location; an average document goodness factor of documents in the web location; an average author rank of users in the web location; and an external rank of the web location.

These ranking factors are not used directly as calculated.  They are "normalized" for elements like document length and then combined in some mechanism to create a single UGC ranking factor. 

The main thing to note – and the item that caught my attention, obviously – is Author Rank.  Note that is has ranking factors that correspond with what I have been hypothesizing exist in the universal search engines.  That is to say, search results are not ranked only by the content on the page, but by the authority of the author who has written them, as determined by how many posts that author has made, how many sites he has made them on, how many groups he or she belongs to, and so on.

Can I say for certain that any algorithm like this has been implemented?  Absolutely not.  But my next task has to be to design an experiment to see if we can detect a whiff of it in the ether.  I'll keep you informed.

FacebookTwitterFriendFeedStumbleUponDeliciousDiggLinkedInMultiplyBlogger PostPingDiigoGoogle ReaderMySpacePlaxo PulseSphinnTechnorati FavoritesTumblrWordPressShare
Posts By Date
December 2014
M T W T F S S
« Jul    
1234567
891011121314
15161718192021
22232425262728
293031