About Online Matters

Archive for the ‘SEO’ Category

PostHeaderIcon All Atwitter About Algorithms

A slight detour in our discussion of methods of geolocation in mobile for a comment about algorithms.

Many, if not most, companies with any intelligent automation talk about their algorithms.   There’s an algorithm for optimizing pricing, an algorithm for selecting a target audience, an algorithm for beating the casino at blackjack, etc. etc.  You get the point.  Marketers almost always want to include the term in their collateral.  Why? Because the implication of the word “algorithm” today is that they are hard, manipulate huge amounts of data, require a lot of complex math, and thus by inference make the company appear both smarter than the average bear and the owner of unique intellectual property that makes their products or services better than the next guy’s.

On my teams, the word algorithm is verboten in describing what we do.  Algorithms are tools -  nothing more than  step-by-step procedures for calculations.  And yes, for the technologists in the crowd I am aware the definition can be a tad more precise.  But that is just the point – the word algorithm has been used so much and applied to such a wide range of situations (e.g  ”Mark had an algorithm by which he determined which route to take to the office during rush hour) that it has become effectively meaningless.

On my teams, we use the word model because what we do is model human behavior.  We look at data to understand how people act, what they value, what they believe.  We then hypothesize what that data means in terms of the motivations and internal beliefs/processes that lead to those behaviors .  Basically, we are data-driven virtual psychologists trying to understand what is going on in the ‘black box’ of the human mind based on what we can see – the inputs into the box and the outputs from it.

The hardest part of our job is not the math or the calculation process, but asking the right questions.  As I gain more experience in this arena, this is where most of the data scientists miss the mark.  They are so caught up in the math they forget about (or don’t understand) the real issue.  After all, guys (and it is mainly guys) who have a highly dominant left brain don’t really groc the emotion that their work is trying to uncover.  This is especially true, and this is not a sexist comment, when we are talking about the emotions of women shoppers.  And at the end of the day, it is comprehending the sentiments of a human being that we really want to understand.

For example,  we see that two people go to a Starbucks every day and both drink three cups of coffee.  However, one person goes repetitively to the same Starbucks, while the other goes to numerous ones around their city throughout the day.  What would cause that difference.  Hypothesis: one is a stay at home mom/worker who takes a run/walk every morning and stops for coffee; the other is a service professional moving between customer offices.  Or maybe the second person is a pizza delivery person.  Each of these hypotheses is tested with various calculations of data and are either validated or not.  At that point we have a guess at who they might be.

Now we can look at other data and start making predictions about their attitudes and values.  Let’s say the person has been identified as a stay at home mom.  What does that tell us about them?  Well, we might guess they love being a parent enough to sacrifice some part of their career to have time with their kids.  Alternately, they might be driven by the fact that their spouse makes more money and so they have to  be the member of the couple that has to make a career sacrifice for the financial welfare of the family.  Which means that they are rational, but also are willing (with regrets) to subsume their own needs to those of others.  Either way, they might be frustrated with having to stay at home, and be responsive to an offer for a brand that shows it recognized their frustration and offers them something uniquely for them and not the other members of the family.  A spa day, for example. Or at least time to create their own relaxation time at home – because they can’t go to a spa and leave the kids at home alone (because…they can’t afford a nanny?).

So now what.  I create a model that tries to capture and predict the behavior of someone with those attitudes and values.  Yes, there is math underneath it.  Yes there is a step-by-step procedure – an algorithm – which runs underneath it.  But I could care less about the math – that’s a tool.  The important thing is to focus on how we think the black box of the human soul is working.

Using this model I would predict a certain type of response to an ad that reflects these values, based on prior response rates to similar ads – maybe not targeted to exactly the same psychology but let’s say similar (without defining what ‘similar’ means in this case.)  Now we run the ad and see what happens.  If the response rate exceeds my projected threshold, I will assume that my model of what is happening in the black box is right; if not, we go back to the drawing board.

So have you seen the latest swiffer ads?  Swiffer’s value proposition: spend less time housecleaning and we give you more time for yourself.  In the ad, mom uses that gift of a home spa day that’s been sitting on the shelf.  When her kids come into the bathroom looking for her, she turns and has a cucumber mask in process.  Kids scream in fright; run out.  Mom is not happy to scare her kids, but in some ways smug because her needs came  before the kids.  I would bet this ad is successful at engaging the audience just described because it appeals to their sentiments.

Let’s be clear, though.  Success does not mean I really know what is happening in the black box – how the gears are arranged, what causes them to move, how fast they move.  It’s just that whatever model I have created parallels the way the mechanics of the black box of a group of people work, so I assume I have got the model right.  But later data may prove me wrong and, with further modeling and using better algorithms as tools, I may get better and better at paralleling the real psychology.  But this is working at a very high level on a group of people.  I can never really know what is going on in the black box of any individual’s mind, and even within the group, it varies from person to person.

The algorithm is not the model.  It is a tool we use to build a model.  Nothing more; nothing less.  That’s why the term is verboten in my groups.  Our focus must always be on the person, not the tool, or else we lose sight of our customers and can only see as far as our computer screens.

 

FacebookTwitterFriendFeedStumbleUponDeliciousDiggLinkedInMultiplyBlogger PostPingDiigoGoogle ReaderMySpacePlaxo PulseSphinnTechnorati FavoritesTumblrWordPressShare

PostHeaderIcon A Primer on Geolocation and Location-Based Services: Geolocation from IP Address

 

We now take a slight turn from more exact geolocation technologies to one that is more basic – IP address-based geolocation.  IP address-based geolocation (IPG) was the earliest online geolocation technique and has been around since 1999.  It determines a user’s geographic latitude, longitude and, by inference, city, region and nation by comparing the user’s public Internet IP address with known locations of other electronically neighboring servers and routers.  While IPG is not specific to mobile, it is used in the geolocation of mobile devices by the more complex algorithms.  It is thus worth taking time to understand what it is, how it works, and how accurate it is.

What we will find is that as a stand-alone technology, IPG is not very accurate for the purpose of locating a device with any reasonable degree of accuracy.  Moreover, their is no magic to linking an IP address to a location – it must come from some type of third-party service that  has manually (or semi-manually) mapped IP address to a geolocation.   Even then, without the help of your ISP in providing more information about a device, the best you can do is the location of your ISP’s host server. However, in a later entry, we will discover when combined with other forms of geolocation IP address can be used as an extra signal to confirm location.

Overview

Every device connected to the public Internet is assigned a unique number known as an Internet Protocol (IP) address. IP addresses consist of four numbers separated by periods (also called a ‘dotted-quad’) and look something like 192. 168.0.1.

Since these numbers are usually assigned to Internet service providers within region-based blocks, an IP address can often be used to identify the region or country from which a computer is connecting to the Internet. An IP address can sometimes be used to show the user’s general location.  At one time ISPs issued one IP address to each user. These are called static IP addresses. Because Internet usage exploded far beyond what was envisioned in the early design of the IP standard (known as IPv4) and the number of IP addresses is limited,  ISPs moved toward allocating IP addresses in a dynamic fashion out of a pool of IP addresses using a technology called Dynamic Host Configuration Protocol or DHCP.  This dynamic allocation makes physically locating a device using an IP address tougher.

As we move forward in this discussion, an example will help us understand what is required to convert an IP address to a physical location.  Below are two different services which provide geolocation information from an IP address

example of how two different services - Google and whatismyip - use IP address to determine a location

 

The first service is Google search.  If you type “what is my ip address” into the Google search box, a set of results is returned.  The IP address of the device from which the search was made appears at the top of those results.  On the left-hand side, Google shows it has auto-detected my location in Carmel Valley Village (actually, I am about a mile away and few hundred feet above Carmel Valley Village).

The second service is WhatIsMyIPAddress.com, which is the first organic listing in the result set returned from the “what is my ip address” search.  In this case, the service shows “my device” as being in Salinas, California, about 20 miles away as the crow flies.  Or actually it doesn’t show my device as being in Salinas.  It shows that my ISP is Comcast Cable and that my ISP is in Salinas.

Same query.  Two different services.  Two very different results.  The reason for the difference is that Google is using multiple sources to geolocate my device (IP address, wifi-based latitude and longitude) whereas WhatIsMyIPAddress is only using DNS-based information including traceroute mapping for this particular page.  Once I approve the use of geolocation services, whatismyipaddress yields similar results to Google because it also triangulates across multiple sources.

The rest of this post will delve into why this has occurred, which involves understanding the technology used to perform IP-based address geolocation.

Its a Ping Thing

To understand what is going on, we have to start at the most raw form of the technology underlying IP Addresses, which is the TCP/IP model itself.  Our most basic entry to determine an IP address or a host from an IP address is the ping command.  We won’t go into how ping operates in detail here, but you can find a great overview of this at the GalaxyVisions website. However, by definition the ping command, which is an application and sits in the application layer of the TCP/IP model,  reaches down into the internetworking layer of the TCP/IP model directly with an ICMP Echo Request message and receives back an IP address in an Echo Reply message.  Here is an example of what a ping looks like for 24.130.244.124:

 

What the image shows is that through the echo request/reply, Ping is able to retrieve information about the hostname of the particular IP address, which in this case is a server at comcast.net.   Note the IP address is supposedly the IP address of my computer in my house.  But it isn’t.  Instead what is returned is the location of the server of my ISP to which my account is attached.

From DNS Hostname to Location

So in the prior step, we were able to use Ping to get to a server/domain name.  The next step is to get from the server name to its location.  This is where it took me some time to understand the options available and how they work.  In this section I am going to discuss four:

  • DNS LOC Records
  • Whois Data
  • W3C Geolocation Services in Web browsers
  • Third-Party Service Providers that Map IP Address to a Physical Location

DNS LOC Records

In the Domain Name System there is an experimental standard called RFC 1876 which is named “ A Means for Expressing Location Information in the Domain Name System.”  This standard defines a format for a location record (LOC record) that can be used to geolocate   hosts, networks, and subnets using the DNS lookup.  You can read the standard to get all the details, but the format of the record looks like this:

DNS Location record for RFC 1876

Sample format of a DNS LOC record

The size is the diameter of the sphere surrounding the chosen host and the horiz pre and vert pre give the precision of the estimate.  Latitude, longitude and altitude are pretty obvious as to what they mean.

DNS LOC has two problems.  First, it has only been defined for a few sites around the world.  ”Defined” means that as ISP has manually created a LOC record for their hosts (they add the record to their DNS servers).  Second, it once again only gives the location of the host server – in this case hsd1.ca.comcast.net – not the location of my device.

So this doesn’t help us.

WhoIs

Anyone who has used the Internet extensively knows about the whois service.  This service describes the owner of a particular domain.  It is possible to gain some geolocation information about a domain from it – but that data is usually the headquarters of the owner of the domain and has almost no relation to the location of my domain host, much less my computer.  The example from Comcast:

comcast's whois entry for ip address-based geolocation

Example of Comcast's whois Entry

 

Note that there is no entry for my specific hosting server hsd1.ca.comcast.net (left image) and the information about the top-level domain shows it in Wilmington, DE (right image).  Not much help at all.

W3C Geolocation Services in Web Browsers

W3C geolocation services describes an API that provides scripted access to geographical location information, such as latitude and longitude, associated with a specific device via the device’s browser.   Geolocation Services are agnostic of the underlying location information sources which can include Global Positioning System (GPS) and location inferred from network signals such as IP address, RFID, WiFi and Bluetooth MAC addresses, and GSM/CDMA cell IDs, as well as user input.

The API allows developers to access both “one-shot” position requests and repeated position updates.  It also provides the ability cache historic positions and use the data to add geolocation functionality to their applications. For the geeks in the room and so newbies won’t be confused when they find more geolocation-based acronyms,  Geolocation Services builds upon earlier work in the industry, including [AZALOC][GEARSLOC], and [LOCATIONAWARE].

Note that Geolocation Services draws on third-party services – it does not do any geolocation itself from the device.  Thus, all W3C geolocation services – and this is not to minimize their value in developing online and mobile location-aware applications – are simply an aggregation tool to allow developers to draw on whatever third party sources are available to geolocate a device and feed that information into their applications.

Also note that there is nothing explicitly tying these services to the DNS record.  How do you make that connection?

Well, as the next section shows, W3C geolocation services can draw upon a service like hostip.info to get the geographic location of the host.

Third-Party Service Providers

At the end of the day, there is no “magic bullet” technology that links an IP address to a geolocation.  The only way this occurs is through a third-party service that has used numerous, usually labor-intensive and semi-manual techniques, to acquire a geolocation for an IP-address

A part of me wants to chat about Netgeo here – which was one of the earliest attempts to geolocate devices by their IP address.  However, Netgeo has not been maintained , and frankly I’ve covered pretty much everything they discuss in the prior sections.  But if you are interested in this bit of IP address-based geolocation, click on the link.

Having used a number of these services,  I can tell you that the majority do not do a particularly good job of geolocating a host server using an IP address, much less a specific device.  I’ll use hostip.com, as it is the most transparent.  hostip.com uses crowdsourcing to geolocate a host. Developers and ISPs can enter the location of their servers into the hostip database.  The database is then freely available to anyone who wishes to use it.  Here is an example of how my location fared:

Example of hostip.info service

Example of hostip.info IP address-based geolocation

 

Tustin is several hundred miles to the south of my location. So as you can see, not very accurate at all.

 Circling Back to Our Example

So how do these two services handle geolocating my computer?  First, they are both using the W3C geolocation API.  What differs are the sources they use to identify a location.

Obviously Google is relatively accurate in this example, although I do not consider a one mile radius to be particularly useful for those of us who are trying to deliver fine-grained location-based services.  Google manages this through a combination of sources:

If you allow Google Chrome to share your location with a site, the browser will send local network information to Google Location Services to get an estimate of your location. The browser can then share your location with the requesting site. The local network information used by Google Location Services to estimate your location includes information about visible WiFi access points, including their signal strength; information about your local router; your computer’s IP address. The accuracy and coverage of Google Location Services will vary by location.

Google Chrome saves your location information so that it can be easily retrieved. This information is periodically updated; the frequency of updates depends on changes to your local network information.

I should add that when it comes to Android and Android Location Services, Google also uses GPS and Assisted GPS technnologies for geolocation.

 Android Location Services periodically checks on your location using GPS, Cell-ID, and Wi-Fi to locate your device. When it does this, your Android phone will send back publicly broadcast Wi-Fi access points’  Service set identifier (SSID) and Media Access Control (MAC) data.

And it this isn’t just how Google does it; it’s how everyone does it. It’s standard industry practice for location database vendors.

whatismyipaddress, on the other hand, is only using the IP-based address from a third-party service.  This is a choice on their part because I haven’t opted in to use location-based services.  Once I do, I get the map below and my location by one source is correctly shown as Carmel Valley (it is interesting to note, as well,  the different results depending on which third party provider you use).  But this is because we are now triangulating not just from IP address-based geolocation.  whatismyipaddress.com is also using wifi-based geolocation and cell-tower triangulation via the W3C location services API to get a more accurate read.

Conclusion

Basically, after exploring IP address-based geolocation, the conclusion is it is a non-starter for any application but those that can live with the broadest of geolocation options.

Next up: Assisted GPS.

 

FacebookTwitterFriendFeedStumbleUponDeliciousDiggLinkedInMultiplyBlogger PostPingDiigoGoogle ReaderMySpacePlaxo PulseSphinnTechnorati FavoritesTumblrWordPressShare

PostHeaderIcon Notes from First Day of SMX Advanced 2010

Back from SMX Advanced London, where I got a chance to speak on “SEO, Search, and Reputation Management and SMX Advanced 2010 in Seattle, where I got to relax and just take in the knowledge.

So here for all who could not attend, is a summary of three of the sessions I attended on the first day of SMX Advanced 2010.  I only get so much time to blog…working guy you know.  I’ll do my best to post the rest, but no promises.

SEO for Google versus Bing

Janet Miller, Searchmojo

  • From heatmap studies, it appears people “see” Bing and Google SERPs in pretty much the same way.  The “hotspots” are pretty similar.
  • Not surprising: average pages/visit and time on site are higher for Bing than Google – but that has always been true from my perspective
  • Bing does not currently accept video or news sitemaps.
  • On Google you can edit sitelinks in Webmaster tools, in Bing you cannot.
  • Geolocation results show pretty much the same in both sets of results.
  • One major difference:  Google shopping is free for ecommerce sites to submit; Bing only has a paid option for now.
  • Bing lets you to share results (social sharing) on Facebook, Twitter, and email, Google does not.  But the sharing links point back to the images on Bing, not to the original images on your site.  You also have to grant access to Bing on Facebook.
  • Bing allows “document preview” when you rollover the entry.  It will also play videos in preview mode – but only those on youTube.  If you look at the behavior, information from the page shows up.  To optimize the presentation of that information, Bing takes information in this order:
    • H1 tag first – if title tag and h1 tag don’t match, it takes the H1 tag
    • First paragraphs of information
    • To add contact info, add that information to that page.  Bing is really good about recognizing contact information that is on a page.
      • Address
      • Phone
      • Email
      • To disable “document preview” enter the following
        • Add this meta tag to the page: <meta name=“msnbot”, content=“nopreview”>
        • Or add this line to robots.txt: x-robots-tag: nopreview

      Rand Fishkin: Ranking Factor Correlations: Google versus Bing

      As usual, Rand brought his array of statistical knowledge to bear to compare how Bing and Google react to different ranking signals.  Here are the takeaways:

      Overall Summary of Correlations with Ranking, in Order of Importance

      Bing Google
      1. Number of linking root domains
      2. An exact match of .com domain name with desired keyword
      3. Linking domains with an exact match in the TLD name
      4. Any exact match of the domain name with the desired keyword
      5. Number of inbound links
      1. An exact match of .com domain name with desired keyword
      2. Linking domains with an exact match in the TLD name
      3. Number of linking root domains
      4. Any exact match of the domain name with the desired keyword
      5. Number of inbound links

      Domain Names as Ranking Factors

      • Exact match domains remain powerful ranking signals in both engines (anchor text could be a factor, too).
      • Hyphenated versions of domain names are less powerful, though when they show they show more  frequently (more times on a page)  in Bing (G: 271 vs. B: 890).
      • Just having keywords in the domain name has substantial positive correlation with high rankings.
      • If you really want to rank on a keyword, make sure you get exactmatchname.com as the TLD.
      • Other exact match domains may still help, but don’t have as high correlation.
      • Keywords  in subdomains are not nearly as powerful as in root domain name (no surprise).
      • Bing may be rewarding subdomain keywords less than before (though G: 673 vs. B: 1394).
      • On alternate TLD extensions:
        • Bing appears to give substantially more weight to these than Google.
        • Matt Cutts’ claim that Google does not differentiate between .gov, .info and .edu appears accurate.
        • The .org TLD has a surprisingly high correlation with high rankings  but you can attribute this to elements of their authority – more links, more non-commercial links, Less spam.
        • Don’t forget the exact match data  .com is still probably a very good thing (at least own it).
        • Shorter URLs are likely a good best practice (especially on Bing).
        • Long domains may not be ideal, but aren’t awful.

      On-Page Keyword Usage

      • Google rankings seem to be much more highly correlated with on-page keyword usage than for Bing.
      • The alt attribute of images shows significant correlation as an on-page ranking factor. (I always thought so and it’s one of the elements most SEO newbies miss.)
      • Putting keywords  in URLs is likely a best practice.
      • Everyone optimizes titles (G: 11,115 vs. B: 11,143).  Differentiating here is hard.
      • (Simplistic) on-page optimization isn’t a huge factor.
      • Raw content length (length of page and number of times the keyword is mentioned on the page) seems to have only a marginal correlation with rankings.

      Link Counts and Link Diversity

      • Links are likely still a major part of the algorithms, with Bing having a slightly higher correlation.
      • Bing may be slightly more naïve in their usage of link data than Google, but better than before.
      • Diversity of link sources remains more important than raw link quantity.
      • Many anchor text links from the same domain likely don’t add much value.
      • Anchor text links from diverse domains, however, appears highly correlated.
      • Bing seems more Google-like than in the past in handling exact match anchor links (this is a surprise!).

      Home Pages

      • Bing’s stereotype holds true: homepages are more favored in top results vs. Google.

      Twitter, Real-Time Search, and Real-Time SEO

      Steve Langville – Mint.com

      Steve had a lot of interesting points, and I thought his approach to real-time was one of the most sophisticated I had heard.

      1. One element of his strategy is what I like to call “Merchandising Real-Time Search.”    Basically someone at Mint has a merchandising calendar of important dates/topics in consumers financial lives (e.g. tax time) and also watches for hot topics that could impact a consumers sense of money (e.g. new credit card legislation).  Mint then has a team that can create new content on that topic that is likely to generate word-of-mouth.  At that point, they push the content out and then energize their communities on Facebook, Twitter, etc. by promoting the content to them.  This generates buzz and visits back to mint.com.
      2. Mint has also created Mint Answers, it’s own Yahoo Answers-like site where people ask and answer questions on financial topics.  The result is a lot of user generated content on Mint.com on critical keywords that yields high ranking in the SERPs.
      3. Mint also developed as Twitter aggregator widget around personal finance and put this as a section on their site.  Twitter’s community managers then retweeted these folks who then signed up for @mint and began retweeting @mint tweets.  According to Steve, the amplification effect was huge.

      Danny Sullivan

      As always, Danny had some really interesting insights to add about real-time search.  I will honestly say that many times I still think Danny, like many search marketers, thinks “transactionally” about search , as compared to consumer marketers who think about having an on-going “conversation” with a customer.  (More on that notion later).  But in this case, Danny really showed why he is known as an industry visionary:

      • Search marketing means being visible wherever someone has overtly expressed a need or desire.  It is more than web; more than keywords.  An example is mobile apps –  search by another name- so I guess he agrees with Steve Jobs on that one.
      • This was uniquely insightful. Whereas normal search is a many-to-many platform where anonymous individuals post  content whose authority grows based on “good” links that are added over time, real-time search is a one-to-one platform where clearly identified people post questions or comments  and get responses.  Authority comes from the level of active engagement, not links.  I had never heard real-time described this way, and it is a succinct but very sophisticated definition of real-time search.
      • You can use conversations to identify folks interested in what you need. Not a new concept, but good to repeat.  So if you have a service that sells vacuum cleaners, search for “anyone know vacuum cleaners” and the folks who have an interest are now identified and you can respond to them.
      • Get a gift by giving a gift. That’s the fundamental currency of social media. Danny answered 42 questions from people who didn’t know him, didn’t follow him.  He got no complaints and 10 thank yous.
      • Recency versus Relevancy. Anyone doing real-time gets this – that authority can come from having high-quality information or having reasonably high quality information in a very short time frame – in other words, sometimes the recency of news makes it more worthy of attention than something older but more thought out.  Danny believes that as Twitter matures (and maybe the entire real-time search business – that wasn’t clear), relevancy is going to get a higher relative weighting, so that relevant results will get more hang time in the SERPs.

      Chris Silver-Smith

      I have trouble summarizing all of Chris’s talk – and it was a very good talk – because so much of what he talked about was covered in my notes from other speakers.  So here are the unique points from his chat:

      • You have to decide how you resource Twitter and other sites.  Questions to ask for your strategy
        • Consumers First: What are consumers saying about your site/company already? How might they use your Twitter content? Develop representative Personas of consumers who would engage with you on Twitter.
        • Time/Investment: How much time do you have to devote to Twittering? Do you devote someone to spend time dailyreading/responding to Tweets?
        • Goals: What are some advantageous things you could accomplish by interacting with consumers in real-time?
        • Strategy will decide whether you hire a full-time person, part-time person, or use automation.
        • Use OAuth for API integration as it shows the application the visitor used as an appended data point
        • Convert your Google News feeds to RSS to make them easier to subscribe to by members of your community
        • A great tool for small business social media management is www.closely.com which auto-creates a social action page for every offer a company makes on Twitter and Facebook
        • Be brief but really clear in main point on Tweets. Include a call to action as they are retweeted at a much higher rate.

      John Shehata – Advanced Internet

      I loved John’s presentation because it confirmed many of the same conclusions I had reached about real-time search and reported on at SMX Advanced in London.  Key points:

      • The ranking factors for real-time search are very different. They include:
        • User (author) authority (My comment:  not just one site but across every site  on which the author publishes).
        • How fresh that author’s content continues to be.
        • Number of followers.
        • The quality of follows and how they act on the author’s content (is it retweeted often?  Is it stumbled?  Does someone flow it into their RSS feed?  How often?  How quickly?).
        • URL real-time resolution.
        • It is not about how many followers you have but how reputable (authoritative) your followers are.  (This is what I call Authorank and like PageRank it is passed from authoritative follower to those they follow.)
        • You earn reputation, and then you give reputation. If lots of people follow you, and then you follow someone–then even though this [new person] does not have lots of followers, his tweet is deemed valuable because his followers are themselves followed widely.
        • Other possible ranking factors:
          • Recent Activity : Google pays more attention to accounts with more activity?
          • User name: keywords in your user name might also help.
          • Age: since age plays a big role in Google search engine ranking, it’s possible that more established Twitter accounts will outrank the newer ones.
          • External links: links to your @account from (reputable) non-social media sites should boost reputation as far as Google is concerned.
          • Tweet Quantity: the more you tweet, the better chance you’ve got to be seen in Google real-time search results.
          • Ratios of followed vs follow: a close ratio between the two can raise a red flag.
          • Lists: it might also matter in how many lists you appear.

      Tactics to follow:

      • Encourage retweets by tweeting content of 120 characters or less so you can save room for the RT @ Username that is added when someone passes along your message to their followers.
      • Tools to identify hot trends: Google Hot Trends, Google Insights, Google News, Bing xRank, Surchur, Crowdeye, Oneriot.
      • Same advice as Steve Langville – plan for seasonal keyword trends.
      • Don’t update multiple accounts, reTweet instead.
      • Connect your social profiles.
      • Attract reputable, topically-related followers.
      • Write keyword-rich tweets whenever possible, without sounding spammy:
        • Do not create content with multiple buzzing terms.
        • Do not abuse shortening services for spam links.
        • Do not go overboard using Twitter #hashtags – Search Engines will eliminate your tweet from search if you use too many because it “looks bad.”
        • Spammy looking tweet streams will be eliminated from search.
        • Don’t use same IP address for different twitter accounts.

      Show Me The Links

      This was a great session with a HUGE number of ideas for getting new links.  And each person talked about a very different philosophy towards link building and their tactics reflected those philosophies.  Let’s see if I can capture them:

      Chris Bennett

      • Philosophy centers on using easily created and highly valued visual or viral content:
        • Creating Infographics – they work very well.  An example – a “where does the money go from the 2008 stimulus bill” infographic generated 29,000 links.
        • Writing guest blog posts whose content is highly viral for others .  Embed a link to your site as the source.  You give the gift of traffic to them, you get links as a gift in return.

      Arnie Kuenn

      • More traditional link building
        • 50% is content development  and promotion.  The big example he used on this was the Google April Fools Day Prank about Google opening an SEO Shop.  Got picked up as “real” story by Newswire 27 days after post, went viral, generated 800 backlinks.
        • 20% is blog post and article placement.
        • 10% is basic link development.
        • 20% is targeted link requests to those few critical high-value sites. There are NO magic bullets here – it takes creativity and just good old-fashioned hard work and persistence.  But the rewards can be substantial.

      Gil Reich

      • Use badges with your URL embedded that benefits the person who puts on site (e.g. “a gold star” validation).
      • Write testimonials for other folks.
      • Write on sites that want good content and can deliver an audience.
      • Answer questions on answer sites where you have the expertise.
      • Make it easy to link to you by providing the information to potential linkers.

      Roger Montti

      Focused on B2B link building tactics:

      • Backlink trolling from competitors- but also look for sites that your competitors aren’t on – you want your own authoritative link network.
      • Don’t ignore TLD .us  There are lots of good possible link sites with decent authority there.
      • Look at associations that provide ways to link to their members.  Search for member lists, restrict your search to .org and add in relevant keyword phrases to filter for your related groups.
      • Look at dead sites with broken links – see who is linking to them.  Once you have identified a dead internet page do a linkdomain: search on Yahoo to identify sites still linking to the dead site.
      • Free links from resources, directories, or “where to buy” sites.
      • Bloggers:  cultivate alliances and relationships with other sites and blogs.  Particular bloggers who like to do interviews.

      Debra Masteler

      • You have all this content that you generate as a normal part of your business.  Use it.
        • Use dapper.net to create RSS feeds of your blog content
        • Joost de Valk has a WordPress plugin at http://yoast.com/wordpress/rss-footer/ which let’s you add an extra line of content to articles in your feed, defaulting to”Post from“ and then a link(s) back to your blog,with your blog’s name as it’s anchor text.
        • Use RSS feeds from news sources to identify media leads to speak with as part of your PR work.
        • Content syndication: podcasts, white papers, living stories, news streams and user generated content (e.g. gues blogging) are still hot.  Infographics, short articles, individual blogs, and Wikipedia are not.
        • Widget Bait: basic widgets that you can build on widgetbox are getting somewhat passé but still have some value.   You need to do more advanced versions – information aggregation widgets seem to work very well right now.  Make people come to you to download them.
        • Microsites: the old link wheels are worthless at this point – the engines have figured those out and treat them similarly to link spam sites.  Those with good content – e.g. blogs or sites with good content – work.  One option is to buy an established site and then rebrand it.
FacebookTwitterFriendFeedStumbleUponDeliciousDiggLinkedInMultiplyBlogger PostPingDiigoGoogle ReaderMySpacePlaxo PulseSphinnTechnorati FavoritesTumblrWordPressShare

PostHeaderIcon The End of The Chasm Is Nigh – Intro

Many years ago at Stanford, I had the opportunity to work with a team of researchers related to Everett Rogers, who wrote the book Diffusion of Innovations.  That book has had huge influence in high tech, because it was the first accessible, mass-market publication to provide a working model of how new technologies achieve market acceptance.  The most famous image is the Adoption Curve (see below), which defined 5 categories of technology adopters: innovators, early adopters, early majority, late majority, and laggards.  These terms have become fundamental in high tech marketing, and you will often hear phrases like “Our initial target market are the early adopters” in marketing planning sessions.

Everett Rogers Original Adoption Curve

Since I was involved with the team that developed the Adoption Curve, it became a standard part of my repertoire as a marketer.  Like most others, it structured my views on how to approach any market for a new product innovation.

Then in 1991, along came Geoffrey Moore, a consultant with the McKenna Group, who published Crossing the Chasm. Crossing the Chasm expanded on Roger’s diffusion of innovation model.  Moore argued that there is a chasm between the early adopters of the product (the “innovators”, or technology enthusiasts and visionaries) and the early majority, who while appreciating a new technology tend to be more pragmatic about its application.  As a result, the needs and purchasing decision-making of these two groups are quite different.  Since effective marketing requires selling to the needs of a specific segment, there comes a time when young companies face a “chasm” where the features and marketing that helped them gain their early followers will not work, and thus they need to adapt their business to a new set of customers and expectations.  It takes time, energy, and a lot of experimentation to find the right new model.  But in high tech businesses,  especially prior to the Web, sales cycles tend to be relatively long (12 -18 months is not unusual).  Given that most small companies have limited resources, the number of experimental cycles they can undertake to discover the correct new model is thus limited.  This makes the transition extremely hard – limited resources, limited time and a lot of spinning of wheels until the right model is discovered.  Requires a lot of heavy lifting and long hours – and if you’ve ever been through this, you’ll know why Moore chose to call it  ”a chasm.”   It feels like a huge, almost overwhelming leap from where you are today to where you need to be tomorrow.  Even with a running start, when you take the leap to grow your company to the next level, it’s easy to miss and “fall into the chasm.”

Everett Rogers Technology Adoption Curve Adapted with The Chasm

I had been working with the technology adoption model visually in my head for almost 10 years at the time Moore published his book.  And when I saw his curve, I realized that we tend to see only what we have modeled (or had modeled by others) in our minds about how the world works.  I had been struggling with the chasm for all that time, and never saw it, even though it was staring me in the face.  I swore that the next time I had an opportunity to experience something that was at odds with my internal models of reality, I wouldn’t ‘ignore the data’ and make a concerted effort to see past the limitations of my own mind.

So Geoff.  I have one for you.  For web-based businesses, the chasm is closing and I can already see a time in the near future when it no longer becomes a barrier to a company’s transition from a customer base mainly made up of innovators to a customer base of early adopters.  The End of “The Chasm” Is Nigh.  Darwin – and the real-time web – are dealing with it.

The detailed rationale in my next post.  Right now, I need to get onto my day job.

FacebookTwitterFriendFeedStumbleUponDeliciousDiggLinkedInMultiplyBlogger PostPingDiigoGoogle ReaderMySpacePlaxo PulseSphinnTechnorati FavoritesTumblrWordPressShare

PostHeaderIcon Web Site Latency and Performance Issues – Part 6

Taking up where we left off in part 5…

In the last post, we had just moved aboutonlinematters.com to a privately-hosted standalone server and seen a substantial decrease is web site latency. We had seen our ratings improve in Google Page Speed from being better than 43% of similar websites to about 53% of sites. So great improvement. But we were still showing a lot of issues in ySlow and the Google Page Speed tool. These fell into three categories:

  • Server Configuration. This involves optimizing settings on our Apache web server: enabling gzip for file compression, applying entity tags, adding expires headers, turning on keep-alive,  and splitting components across domains.
  • Content Compression. This involves items like compressing images,  javascript, and css, specifying image sizes, and reducing the number of DOM elements.
  • Reducing External Calls. This involves combining all external css and javascript files into a single file, using cookieless domains, minimizing DNS lookups and redirects, as well as optimizing the order and style of scripts.

We decided to attack the web site latency issues in stages, first attacking those elements that were easiest to fix (server configuration) and leaving the most difficult to fix (reducing external calls) until last.

Server Configuration Issues

In their simplest form, server configuration issues related to web site latency have to do with settings on a site’s underlying web server, such as Apache.   For larger enterprise sites, server configuration issues cover a broader set of technical topics, including load balancing across multiple servers and databases as well as the use of a content delivery network.  This section is only going to cover the former, and not the latter, as they relate to web site latency.

With Apache (and Microsoft IIS), the server settings we care about can be managed and tracked through a page’s HTTP headers.  Thus, before we get into the settings we specifically care about, we need to have a discussion of what HTTP headers are and why they are important.

HTTP Headers

HTTP headers are an Internet protocol, or set of rules, for formatting certain types of data and instructions that are either:

  • included in a request from a web client/browser, or
  • sent by the server along with a response to a browser.

HTTP headers carry information in both directions.  A client or browser can make a request to the server for a web page or other resource, usually a file or dynamic output from a server side script.  Alternately, there are also HTTP headers designed to be sent by the server along with its response to the browser or client request.

As SEOs, we care about HTTP headers because our request from the client to the server will return information about various elements of server configuration that may impact web site latency and performance. These elements include:

  • Response status; 200 is a valid response from the server.
  • Date of request.
  • Server details; type, configuration and version numbers. For example the php version.
  • Cookies; cookies set on your system for the domain.
  • Last-Modified; this is only available if set on the server and is usually the time the requested file was last modified
  • Content-Type; text/html is a html web page, text/xml an xml file.

There are two kinds of requests. A HEAD request returns only the header information from the server. A GET request returns both the header information and file content exactly as a browser would request the information. For our purposes, we only care about HEAD requests. Here is an example of a request:

Headers Sent Request
HEAD / HTTP/1.0
Host: www.aboutonlinematters.com
Connection: Close

And here is what we get back in its simplest form using the Trellian FireFox Toolbar :

Response: HTTP/1.1 200 OK
Date: Sun, 04 Apr 2010 00:17:06 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: PHP/5.3.0
X-Pingback: http://www.aboutonlinematters.com/xmlrpc.php
Link: <http://wp.me/DbBZ>; rel=shortlink
Content-Encoding: gzip
Cache-Control: max-age=31536000
Expires: Mon, 04 Apr 2011 00:17:06 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: Chunked
Proxy-Connection: Keep-alive
x-ua-compatible: IE=EmulateIE7

Different tools will return different header information depending on the specific requests made in the calling script. For example, Live HTTP headers, a plugin for FireFox, provides detailed header request and response information for every element on a page (it basically breaks out each GET and shows you the actual response that comes back from the server). This level of detail will prove helpful later when we undertake deep analysis to reduce external server requests. But for now, what is shown here is adequate for the purposes of our analysis.

For a summary of HTTP header requests and response codes, click here .  But for now, let’s get back to configuring our Apache Server to reduce web site latency.

Apache Server Settings Related to Site Latency

Enabling Gzip Compression

Web site latency substantially improves when the amount of data that has to flow between the server and the browser is at a minimum.  I believe I’ve read somewhere that image requests account for 80% of the load time of most web pages, so just following good image-handling protocols for web sites (covered in a later installment) can substantially improve web site latency and page loading times.  However, manually compressing images is painful and time consuming.  Moreover, there are other types of files – Javascript and CSS are the most common – that can also be compressed.

Designers of web servers identified this problem early on and provided a built-in tool on their servers for compressing files moving between the server and the browser.  Starting with HTTP/1.1, web clients indicate support for compression by including the Accept-Encoding header in the HTTP request.

Accept-Encoding: gzip, deflate

If the web server sees this header in the request, it may compress the response using one of the methods listed by the client. The web server notifies the web client of this via the Content-Encoding header in the response.

Content-Encoding: gzip

Gzip remains the most popular and effective compression method. It was developed by the GNU project and standardized by RFC 1952. The only other compression format is deflate, but it’s less effective and less popular.

Gzipping generally reduces the response size by about 70%.  Approximately 90% of today’s Internet traffic travels through browsers that claim to support gzip. If you use Apache, the module configuring gzip depends on your version: Apache 1.3 uses mod_gzip while Apache 2.x uses mod_deflate.

Configuring Entity Tags

Web servers and browsers use Entity tags (ETags) to determine whether the component in the browser’s cache, like an image or script (which are examples of an “entity”) matches the one on the origin server. It is a simple string, surrounded by quotation marks, that uniquely identifies a specific version of the selected component/entity. The origin server specifies the component’s ETag using the ETag response header.

HTTP/1.1 200 OK
Last-Modified: Sun, 04 Apr 2010 00:37:48 GMT
Etag: "1896-bf9be880"
Expires: Mon, 04 Apr 2011 00:37:48 GMT

Later, if the browser has to validate a component, it uses the If-None-Match header to pass the ETag back to the origin server. If the ETags match, a 304 status code is returned.

GET http://www.aboutonlinematters.com/wp-content/plugins/web-optimizer/cache/f39a292fcf.css?1270299922 HTTP/1.1
Host: www.aboutonlinematters.com
If-Modified-Since: Sun, 04 Apr 2010 00:37:48 GMT
If-None-Match: "1896-bf9be880"
HTTP/1.1 304 Not Modified

ETags can impact site latency because they are typically constructed using attributes that make them unique to a specific server. ETags won’t match when a browser gets the original component from one server and later tries to validate that component on a different server, which is a fairly standard scenario on Web sites that use a cluster of servers to handle requests. By default, both Apache and IIS embed data in the ETag that dramatically reduces the odds of the validity test succeeding on web sites with multiple servers. If the ETags don’t match, the web client doesn’t receive the small, fast 304 response that ETags were designed for.  Instead,  they get a normal 200 response along with all the data for the component.  This isn’t a problem for small sites hosted on a single server. But it is a substantial problem for sites with multiple servers using Apache or IIS with the default ETag configuration.  Web clients see higher web site latency, web servers have a higher load,  bandwidth consumption is high, and proxies aren’t caching content efficiently.

So when a site does not benefit from the flexible validation model provided by Etags, it’s better to just remove the ETag altogether. In Apache, this is done by simply adding the following line to your Apache configuration file:

FileETag none

Expires Headers

The Expires header makes any components in an HTTP request cacheable. This avoids unnecessary HTTP requests on any page views after the initial visit because components downloaded during the initial visit, for example images and script files, remain in the browser’s local cache and do not have to be downloaded on subsequent requests. Expires headers are most often used with images, but they should be used on all components including scripts, stylesheets, and Flash components.

Browsers (and proxies) use a cache to reduce the number and size of HTTP requests, making web pages load faster.  The Expires header in the HTTP response tells the client how long a component can be cached. This far future Expires header

Expires: Thu, 15 Apr 2020 20:00:00 GMT

tells the browser that this response won’t be stale until April 15, 2020.

Apache uses the ExpiresDefault directive to set an expiration date relative to the current date. So for example:

ExpiresDefault "access plus 10 years"

sets the Expires date 10 years out from the time of the request.

Using a far future Expires header affects page views only after a user has already visited a site for the first time or when the cache has been cleared. Therefore the impact of this performance improvement depends on how often users hit your pages with a primed cache. In the case of About Online Matters, we still do not get lots of visitors, so you would expect that the impact of this change to the server would have little impact on our performance and, indeed, that proved to be true.

Keep Alive Connections

The Keep-Alive extension to HTTP/1.0 and the persistent connection feature of HTTP/1.1 provide long-lived HTTP sessions which allow multiple requests to be sent over the same TCP connection. What this does is prevent an extra HTTP request/response for every object on a page, and instead allows multiple objects to be requested and retrieved in a single HTTP session.  HTTP requests require a three-way handshake and have built in algorithms for congestion control that restrict available bandwidth on the startup of an HTTP session.  Making multiple requests in a single session reduces the number of times congestion control is invoked.  As a result, in some cases, enabling keep-alive on an Apache server has been shown to result in an almost 50% speedup in latency times for HTML documents with many images.  To enable keep-alive add the following line to your Apache configuration:

KeepAlive On

Is The Configuration Correct?

When I make these various changes to the server configuration, how can I verify they have actually been implemented?  This is where the HTTP headers come into play.  Let’s take a look at the prior response we got from www.aboutonlinematters.com when we made a HEADERS request:

Response: HTTP/1.1 200 OK
Date: Sun, 04 Apr 2010 00:17:06 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: PHP/5.3.0
X-Pingback: http://www.aboutonlinematters.com/xmlrpc.php
Link: <http://wp.me/DbBZ>; rel=shortlink
Content-Encoding: gzip
Cache-Control: max-age=31536000
Expires: Mon, 04 Apr 2011 00:17:06 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: Chunked
Proxy-Connection: Keep-alive
x-ua-compatible: IE=EmulateIE7

The line items in blue show that Gzip, expires headers, and keep-alive switches have been implemented on our server.  ETags won’t show in this set of responses because ETags are associated with a specific entity on a page.  They show instead in tools that provide detailed analysis of HTTP requests and responses, such as Live HTTP Headers or Charles.  No ETags should be visible in an HTTP request or response if FileETag: None has been implemented.

Results

We made changes in two steps.  First we activated Gzip compression, Expires Headers and removed ETags.  These changes made only negligible changes in overall web site latency.  Then we implemented the keep-alive  setting.  Almost immediately, our site latency improved in the Google Page Speed tool from being better than 53% of similar sites to being better than 61%.

We’ll stop there for today and pickup on content compression in the next installment.

FacebookTwitterFriendFeedStumbleUponDeliciousDiggLinkedInMultiplyBlogger PostPingDiigoGoogle ReaderMySpacePlaxo PulseSphinnTechnorati FavoritesTumblrWordPressShare
Posts By Date
December 2014
M T W T F S S
« Jul    
1234567
891011121314
15161718192021
22232425262728
293031