About Online Matters

Archive for the ‘Site Architecture’ Category

PostHeaderIcon Technical SEO: Site Loading Times and SEO Rankings Part 2

In my last post, I discussed the underlying issues regarding site loading times and SEO rankings.  What I tried to do was help the reader understand why site loading times are important from the perspective of someone designing a search engine that has to crawl billions of pages.  The post also outlines a few of the structures that they would have to put in place to accurately and effectively crawl all the pages they need in a limited time with limited processing power.  I also tried to show that a search engine like Google has a political and economic agenda in ensuring fast sites, not just a technical agenda.  Google wants as many people/eyeballs on the web as possible, so it is to their advantage to ensure that web sites provide a good user experience.  As a result, they feel quite justified in penalizing sites that do not have good speed/performance characteristics.

As you would expect, the conclusion is that if your site is hugely slow you will not get indexed and will not rank in the SERPs.  What is “hugely slow”?  Google has indicated that slow is a relative notion and is determined based on the loading times typical of sites in your geographical region.  Having said that, relative or not, from an SEO perspective I wouldn’t want to have a site where pages are taking more than 10 seconds on average to load.  We have found from the sites we have tested and built that average load times higher than approximately 10 seconds to completely load a page will have a significant impact on being indexed.  From a UE perspective, there is some interesting data that the limit on visitors patience is about 6-8 secondsGoogle has studied this data, so it would probably prefer to set its threshhold in that region.  But I doubt it can.   Many small sites are not that sophisticated, do not know these kinds of rules, and do not know how to check or evaluate their site loading times.  Besides this, there are often problems with hosts that cause servers to run slowly at times.  Google has to take that into account, as well.  So I believe that the timeout has to be substantially higher than 6-8 seconds, but 10 seconds as a crawl limit is a guess, 

I have yet to see a definitive statement by anyone as to what the absolute limit is for site speed before indexing ceases altogether (if you have a reference, please post it in the comments).  I’m sure that if a bot comes to a first page and it exceeds the bot’s timeout threshold in the algorithm, your site won’t get spidered at all.  But once the bot gets by the first page, it has to do an on-going computation of average page loading times for the site to determine if the average exceeds the built-in threshold, so at least a few pages would have to be crawled in that case. 

Now here’s where it gets interesting.  What happens between fast (let’s say < 1-2 second loading times, although this is actually pretty slow but a number Matt Cutts in the video below indicates is ok) and the timeout limit?  And how important is site speed as a ranking signal?  Let’s answer one question at a time.

When a site is slow but not slow enough to hit any built-in timeout limits (not tied to the number of pages), a couple of things can happen.   We do know that Google allocates bot time by the number of pages on the site and the number of pages it has to index/re-index.  So for a small site that performs poorly, it is likely that most of the pages will get indexed.  Likely, but not a guarantee.  It all depends on the cumulative time lag versus the average that a site creates. If a site is large, then you can almost guarantee that some pages will not be indexed, as the cumulative time lag will ultimately hit the threshold set by the bots for a site of that number of pages. By definition, some of your content will not get ranked and you will not get the benefit of that content in your rankings.

As an aside, by the way, there has been a lot of confusion around the <meta name=”revisit-after”> tag.  The revisit-after meta tag takes this form <meta name=”revisit-after” content=”5 days”>. 
This tag supposedly tells the bots how often to come back to the site to reindex this specific page (in this case 5 days).  The idea is that you can improve the crawlability of your site by telling the bots not to index certain pages all the time, but only some of the time.  I became aware of this tag at SMX East, when one of the “authorities” on SEO mentioned it as usable for this purpose.  The trouble is that, from everything I have read, the tag is completely unsupported by any of the major engines, and was only supported by one tiny search engine (SearchBC)  many years ago. 

But let’s say you are one of the lucky sites where the site runs slowly but all the pages do get indexed.  Do Google or any of the other major search engines use the site’s performance as a ranking signal?  In other words, all my pages are in the index.  So you would expect that they would be ranked based on the quality of their content and their authority derived from inbound links, site visits, time-on-site, and other typical ranking signals.  Performance is not a likely candidate for a ranking signal and isn’t important. 

If you thought that, then you were wrong. Historically, Google has said, and Matt Cutts reiterates this in the video below, that site load times do not influence search rankings.  But while that may be true now, it may not be in the near future.  And this is where Maile’s comments took me by surprise.  In a small group session at SMX East 2009, Maile was asked about site performance and rankings.  She indicated that for the “middle ground” sites that are indexing but loading slowly, site performance may already be used to influence rankings.  Who is right, I can’t say.  These are both highly respected professionals who choose their words carefully. 

 

 

 

Whatever is true, Google is sending us signals that this change is coming.  Senior experts like Matt and Maile don’t say these things lightly.  They are well considered and probably approved positions that they are asked to take.  This is Google’s way of preventing us from getting mad when the change occurs.  Google has the fallback of saying “we warned you this could happen.”  Which from today’s viewpoiint means it will happen.

Conclusion: Start working on your site performance now, as it will be important for SEO rankings later. 

Oh and, by the way, your user experience will just happen to be better, which is clearly the real reason to fix site performance. 

And it isn’t only Google that may make this change.  Engineers from Yahoo! recently filed a patent with the title “Web Document User Experience Characterization Methods and Systems” which bears on this topic.  Let me quote paragraph 21:

With so many websites and web pages being available and with varying hardware and software configurations, it may be beneficial to identify which web documents may lead to a desired user experience and which may not lead to a desired user experience. By way of example but not limitation, in certain situations it may be beneficial to determine (e.g., classify, rank, characterize) which web documents may not meet performance or other user experience expectations if selected by the user. Such performance may, for example, be affected by server, network, client, file, and/or like processes and/or the software, firmware, and/or hardware resources associated therewith. Once web documents are identified in this manner the resulting user experience information may, for example, be considered when generating the search results.

In does not appear Yahoo! has implemented any aspect of this patent yet, and who knows what the Bing agreement will mean for site performance and search.  But clearly this is a “problem” that the search engine muftis have set their eyes on and I would expect that if Google does implement it, others will follow.

Share

PostHeaderIcon .htaccess Grammar Tutorial – .htaccess Special Characters

One thing this blog promises is to provide information about anything online that someone coming new to the business of online marketing needs to know.  The whole point being my pain is your gain.  Well, I have had some real pain lately around .htaccess file rewrite rules and I wanted to provide an easy translation to those with a .htaccess grammar tutorial for beginners.

What is a .htaccess file and Why Do I Care?

A .htaccess file is a type of configuration file for Apache servers (only.  If you are working with Microsoft IIS, this tutorial does not apply).   There are several ways an Apache web server can be configured.  Webmasters who have write access to the Apache directories can access a series of files (especially httpd.conf) that allow them to do what are called server-side includes, which are preferable in many cases because they allow for more powerful command structures and tend to run faster than .htaccess files.

Why you and I care about .htaccess files is that many of us run in a hosted environment where we do not have access to Apache directories.  In many cases we run on a shared server with other websites.  In these cases, the only way to control the configuration of the Apache web server is to use a .htaccess file.

The .htaccess file should always be put in the root directory of the site to which it applies.

Why would I want to control the configuration of the Apache server?  Well, the most likely scenario is that you have moved pages, deleted pages, or renamed pages and you don’t want to lose the authority they have gained with the search engines that gives you a good placement in the SERPs.  You do this through what are called redirects that tell the server that if someone requests a specific URL like http://www.onlinematters.com/oldpage.htm  it will automatically map that to an existing URL such as http://www.onlinematters.com/seo.htm.  Another common reason to have a .htaccess file is to provide a redirect to a custom error page when someone types in a bad URL.

.htaccess Files are Touchy

.htaccess files are very powerful and, like most computer communications, are very exacting in the grammer they use to communicate with the Apache server. The slightest syntax error (like a missing space) can result in severe server malfunction. Thus it is crucial to make backup copies of everything related to your site (including any original .htaccess files) before working with your .htaccess.  It is also important to check your entire website thoroughly after making any changes.  If any errors or other problems are encountered, employ your backups immediately to restore the original configuration while you test your .htaccess files.

Is There a Place I Can Check the Grammar of My .htaccess File?

I asked this question at SMX West 2009 at a panel on Apache server configuration and 301 redirects (301 Redirect, How Do I Love You? Let Me Count The Ways).    The speakers were Alex Bennert, In House SEO, Wall Street Journal; Jordan Kasteler, Co-Founder, SearchandSocial.com; Carolyn Shelby from CShel; Stephan Spencer, Founder & President, Netconcepts; and Jonah Stein, Founder, ItsTheROI. These are all serious SEO players – so they would know if anyone would.  When the question got asked, they all looked puzzled and then said "I just test it live on my staging server."  I have spent hours looking for a .htaccess grammar checker and have yet to find anything that has any real horsepower.   So seemingly the only options to check your .htaccess grammar are either to test it on your stage or live server or find a friend or Apache guru who can review what you have done. 

Basic .htaccess Character Set

We’re going to start this overview of .htaccess grammar with a review of the core character definitions (which is probably the hardest documentation I’ve had to find.  You’d think everyone would start with "the letters"  of the alphabet, but believe it or not, they don’t).  In the next post, we will then construct basic statements with these character sets so you can see them in action.  After that, we’ll move into multipage commands. 

#
the # instructs the server to ignore the line. Used for comments. Each comment line requires it’s own #. It is good practice to use only letters, numbers, dashes, and underscores, as this will help eliminate/avoid potential server parsing errors.
 
[C]
Chain: instructs server to chain the current rule with the previous rule.
 
[E=variable:value]
Environmental Variable: instructs the server to set the environmental variable "variable" to "value".
 
[F]
Forbidden: instructs the server to return a 403 Forbidden to the client. 
 
[G]
Gone: instructs the server to deliver Gone (no longer exists) status message. 
 
[L]
Last rule: instructs the server to stop rewriting after the preceding directive is processed.
 
[N]
Next: instructs Apache to rerun the rewrite rule until all rewriting directives have been achieved.
 
[NC]
No Case: defines any associated argument as case-insensitive. i.e., "NC" = "No Case".
 
[NE]
No Escape: instructs the server to parse output without escaping characters.
 
[NS]
No Subrequest: instructs the server to skip the directive if internal sub-request.  
 
[OR]
Or: specifies a logical "or" that ties two expressions together such that either one proving true will cause the associated rule to be applied.
 
[P]
Proxy: instructs server to handle requests by mod_proxy
 
[PT]
Pass Through: instructs mod_rewrite to pass the rewritten URL back to Apache for further processing.  
 
[QSA]
Append Query String: directs server to add the query string to the end of the expression (URL).
 
[R]
Redirect: instructs Apache to issue a redirect, causing the browser to request the rewritten/modified URL.
 
[S=x]
Skip: instructs the server to skip the next "x" number of rules if a match is detected.
 
[T=MIME-type]
Mime Type: declares the mime type of the target resource.
 
[]
specifies a character class, in which any character within the brackets will be a match. e.g., [xyz] will match either an x, y, or z.
 
[]+
character class in which any combination of items within the brackets will be a match. e.g., [xyz]+ will match any number of x’s, y’s, z’s, or any combination of these characters.
 
[^]
specifies not within a character class. e.g., [^xyz] will match any character that is neither x, y, nor z.
 
[a-z]
a dash (-) between two characters within a character class ([]) denotes the range of characters between them. e.g., [a-zA-Z] matches all lowercase and uppercase letters from a to z.
 
a{n}
specifies an exact number, n, of the preceding character. e.g., x{3} matches exactly three x’s.
 
a{n,}
specifies n or more of the preceding character. e.g., x{3,} matches three or more x’s.
 
a{n,m}
specifies a range of numbers, between n and m, of the preceding character. e.g., x{3,7} matches three, four, five, six, or seven x’s.
 
()
used to group characters together, thereby considering them as a single unit. e.g., (perishable)?press will match press, with or without the perishable prefix.
 
^
denotes the beginning of a regex (regex = regular expression) test string. i.e., begin argument with the proceeding character.
 
$
denotes the end of a regex (regex = regular expression) test string. i.e., end argument with the previous character.
 
 ?
declares as optional the preceding character. e.g., monzas? will match monza or monzas, while mon(za)? will match either mon or monza. i.e., x? matches zero or one of x.
 
!
declares negation. e.g., “!string” matches everything except “string”.
 
.
a dot (or period) indicates any single arbitrary character.
 
-
instructs “not to” rewrite the URL, as in “...domain.com.* - [F]”.
 
+
matches one or more of the preceding character. e.g., G+ matches one or more G’s, while "+" will match one or more characters of any kind.
 
*
matches zero or more of the preceding character. e.g., use “.*” as a wildcard.
 
|
declares a logical “or” operator. for example, (x|y) matches x or y.
 
\
escapes special characters ( ^ $ ! . * | ). e.g., use “\.” to indicate/escape a literal dot.
 
\.
indicates a literal dot (escaped).
 
/*
zero or more slashes.
 
.*
zero or more arbitrary characters.
 
^$
defines an empty string.
 
^.*$
the standard pattern for matching everything.
 
[^/.]
defines one character that is neither a slash nor a dot.
 
[^/.]+
defines any number of characters which contains neither slash nor dot.
 
http://
this is a literal statement — in this case, the literal character string, “http://”.
 
^domain.*
defines a string that begins with the term “domain”, which then may be proceeded by any number of any characters.
 
^domain\.com$
defines the exact string “domain.com”.
 
-d
tests if string is an existing directory
 
-f
tests if string is an existing file
 
-s
tests if file in test string has a non-zero value   

 

      Redirection Header Codes [ ^ ]

  • 301 – Moved Permanently
  • 302 – Moved Temporarily
  • 403 – Forbidden
  • 404 – Not Found
  • 410 – Gone

 

 

 

 

Share
Posts By Date
November 2017
M T W T F S S
« Jul    
 12345
6789101112
13141516171819
20212223242526
27282930