A slight detour in our discussion of methods of geolocation in mobile for a comment about algorithms.
Many, if not most, companies with any intelligent automation talk about their algorithms. There’s an algorithm for optimizing pricing, an algorithm for selecting a target audience, an algorithm for beating the casino at blackjack, etc. etc. You get the point. Marketers almost always want to include the term in their collateral. Why? Because the implication of the word “algorithm” today is that they are hard, manipulate huge amounts of data, require a lot of complex math, and thus by inference make the company appear both smarter than the average bear and the owner of unique intellectual property that makes their products or services better than the next guy’s.
On my teams, the word algorithm is verboten in describing what we do. Algorithms are tools - nothing more than step-by-step procedures for calculations. And yes, for the technologists in the crowd I am aware the definition can be a tad more precise. But that is just the point – the word algorithm has been used so much and applied to such a wide range of situations (e.g ”Mark had an algorithm by which he determined which route to take to the office during rush hour) that it has become effectively meaningless.
On my teams, we use the word model because what we do is model human behavior. We look at data to understand how people act, what they value, what they believe. We then hypothesize what that data means in terms of the motivations and internal beliefs/processes that lead to those behaviors . Basically, we are data-driven virtual psychologists trying to understand what is going on in the ‘black box’ of the human mind based on what we can see – the inputs into the box and the outputs from it.
The hardest part of our job is not the math or the calculation process, but asking the right questions. As I gain more experience in this arena, this is where most of the data scientists miss the mark. They are so caught up in the math they forget about (or don’t understand) the real issue. After all, guys (and it is mainly guys) who have a highly dominant left brain don’t really groc the emotion that their work is trying to uncover. This is especially true, and this is not a sexist comment, when we are talking about the emotions of women shoppers. And at the end of the day, it is comprehending the sentiments of a human being that we really want to understand.
For example, we see that two people go to a Starbucks every day and both drink three cups of coffee. However, one person goes repetitively to the same Starbucks, while the other goes to numerous ones around their city throughout the day. What would cause that difference. Hypothesis: one is a stay at home mom/worker who takes a run/walk every morning and stops for coffee; the other is a service professional moving between customer offices. Or maybe the second person is a pizza delivery person. Each of these hypotheses is tested with various calculations of data and are either validated or not. At that point we have a guess at who they might be.
Now we can look at other data and start making predictions about their attitudes and values. Let’s say the person has been identified as a stay at home mom. What does that tell us about them? Well, we might guess they love being a parent enough to sacrifice some part of their career to have time with their kids. Alternately, they might be driven by the fact that their spouse makes more money and so they have to be the member of the couple that has to make a career sacrifice for the financial welfare of the family. Which means that they are rational, but also are willing (with regrets) to subsume their own needs to those of others. Either way, they might be frustrated with having to stay at home, and be responsive to an offer for a brand that shows it recognized their frustration and offers them something uniquely for them and not the other members of the family. A spa day, for example. Or at least time to create their own relaxation time at home – because they can’t go to a spa and leave the kids at home alone (because…they can’t afford a nanny?).
So now what. I create a model that tries to capture and predict the behavior of someone with those attitudes and values. Yes, there is math underneath it. Yes there is a step-by-step procedure – an algorithm – which runs underneath it. But I could care less about the math – that’s a tool. The important thing is to focus on how we think the black box of the human soul is working.
Using this model I would predict a certain type of response to an ad that reflects these values, based on prior response rates to similar ads – maybe not targeted to exactly the same psychology but let’s say similar (without defining what ‘similar’ means in this case.) Now we run the ad and see what happens. If the response rate exceeds my projected threshold, I will assume that my model of what is happening in the black box is right; if not, we go back to the drawing board.
So have you seen the latest swiffer ads? Swiffer’s value proposition: spend less time housecleaning and we give you more time for yourself. In the ad, mom uses that gift of a home spa day that’s been sitting on the shelf. When her kids come into the bathroom looking for her, she turns and has a cucumber mask in process. Kids scream in fright; run out. Mom is not happy to scare her kids, but in some ways smug because her needs came before the kids. I would bet this ad is successful at engaging the audience just described because it appeals to their sentiments.
Let’s be clear, though. Success does not mean I really know what is happening in the black box – how the gears are arranged, what causes them to move, how fast they move. It’s just that whatever model I have created parallels the way the mechanics of the black box of a group of people work, so I assume I have got the model right. But later data may prove me wrong and, with further modeling and using better algorithms as tools, I may get better and better at paralleling the real psychology. But this is working at a very high level on a group of people. I can never really know what is going on in the black box of any individual’s mind, and even within the group, it varies from person to person.
The algorithm is not the model. It is a tool we use to build a model. Nothing more; nothing less. That’s why the term is verboten in my groups. Our focus must always be on the person, not the tool, or else we lose sight of our customers and can only see as far as our computer screens.
We now take a slight turn from more exact geolocation technologies to one that is more basic – IP address-based geolocation. IP address-based geolocation (IPG) was the earliest online geolocation technique and has been around since 1999. It determines a user’s geographic latitude, longitude and, by inference, city, region and nation by comparing the user’s public Internet IP address with known locations of other electronically neighboring servers and routers. While IPG is not specific to mobile, it is used in the geolocation of mobile devices by the more complex algorithms. It is thus worth taking time to understand what it is, how it works, and how accurate it is.
What we will find is that as a stand-alone technology, IPG is not very accurate for the purpose of locating a device with any reasonable degree of accuracy. Moreover, their is no magic to linking an IP address to a location – it must come from some type of third-party service that has manually (or semi-manually) mapped IP address to a geolocation. Even then, without the help of your ISP in providing more information about a device, the best you can do is the location of your ISP’s host server. However, in a later entry, we will discover when combined with other forms of geolocation IP address can be used as an extra signal to confirm location.
Every device connected to the public Internet is assigned a unique number known as an Internet Protocol (IP) address. IP addresses consist of four numbers separated by periods (also called a ‘dotted-quad’) and look something like 192. 168.0.1.
Since these numbers are usually assigned to Internet service providers within region-based blocks, an IP address can often be used to identify the region or country from which a computer is connecting to the Internet. An IP address can sometimes be used to show the user’s general location. At one time ISPs issued one IP address to each user. These are called static IP addresses. Because Internet usage exploded far beyond what was envisioned in the early design of the IP standard (known as IPv4) and the number of IP addresses is limited, ISPs moved toward allocating IP addresses in a dynamic fashion out of a pool of IP addresses using a technology called Dynamic Host Configuration Protocol or DHCP. This dynamic allocation makes physically locating a device using an IP address tougher.
As we move forward in this discussion, an example will help us understand what is required to convert an IP address to a physical location. Below are two different services which provide geolocation information from an IP address
The first service is Google search. If you type “what is my ip address” into the Google search box, a set of results is returned. The IP address of the device from which the search was made appears at the top of those results. On the left-hand side, Google shows it has auto-detected my location in Carmel Valley Village (actually, I am about a mile away and few hundred feet above Carmel Valley Village).
The second service is WhatIsMyIPAddress.com, which is the first organic listing in the result set returned from the “what is my ip address” search. In this case, the service shows “my device” as being in Salinas, California, about 20 miles away as the crow flies. Or actually it doesn’t show my device as being in Salinas. It shows that my ISP is Comcast Cable and that my ISP is in Salinas.
Same query. Two different services. Two very different results. The reason for the difference is that Google is using multiple sources to geolocate my device (IP address, wifi-based latitude and longitude) whereas WhatIsMyIPAddress is only using DNS-based information including traceroute mapping for this particular page. Once I approve the use of geolocation services, whatismyipaddress yields similar results to Google because it also triangulates across multiple sources.
The rest of this post will delve into why this has occurred, which involves understanding the technology used to perform IP-based address geolocation.
Its a Ping Thing
To understand what is going on, we have to start at the most raw form of the technology underlying IP Addresses, which is the TCP/IP model itself. Our most basic entry to determine an IP address or a host from an IP address is the ping command. We won’t go into how ping operates in detail here, but you can find a great overview of this at the GalaxyVisions website. However, by definition the ping command, which is an application and sits in the application layer of the TCP/IP model, reaches down into the internetworking layer of the TCP/IP model directly with an ICMP Echo Request message and receives back an IP address in an Echo Reply message. Here is an example of what a ping looks like for 18.104.22.168:
What the image shows is that through the echo request/reply, Ping is able to retrieve information about the hostname of the particular IP address, which in this case is a server at comcast.net. Note the IP address is supposedly the IP address of my computer in my house. But it isn’t. Instead what is returned is the location of the server of my ISP to which my account is attached.
From DNS Hostname to Location
So in the prior step, we were able to use Ping to get to a server/domain name. The next step is to get from the server name to its location. This is where it took me some time to understand the options available and how they work. In this section I am going to discuss four:
- DNS LOC Records
- Whois Data
- W3C Geolocation Services in Web browsers
- Third-Party Service Providers that Map IP Address to a Physical Location
DNS LOC Records
In the Domain Name System there is an experimental standard called RFC 1876 which is named “ A Means for Expressing Location Information in the Domain Name System.” This standard defines a format for a location record (LOC record) that can be used to geolocate hosts, networks, and subnets using the DNS lookup. You can read the standard to get all the details, but the format of the record looks like this:
The size is the diameter of the sphere surrounding the chosen host and the horiz pre and vert pre give the precision of the estimate. Latitude, longitude and altitude are pretty obvious as to what they mean.
DNS LOC has two problems. First, it has only been defined for a few sites around the world. ”Defined” means that as ISP has manually created a LOC record for their hosts (they add the record to their DNS servers). Second, it once again only gives the location of the host server – in this case hsd1.ca.comcast.net – not the location of my device.
So this doesn’t help us.
Anyone who has used the Internet extensively knows about the whois service. This service describes the owner of a particular domain. It is possible to gain some geolocation information about a domain from it – but that data is usually the headquarters of the owner of the domain and has almost no relation to the location of my domain host, much less my computer. The example from Comcast:
Note that there is no entry for my specific hosting server hsd1.ca.comcast.net (left image) and the information about the top-level domain shows it in Wilmington, DE (right image). Not much help at all.
W3C Geolocation Services in Web Browsers
W3C geolocation services describes an API that provides scripted access to geographical location information, such as latitude and longitude, associated with a specific device via the device’s browser. Geolocation Services are agnostic of the underlying location information sources which can include Global Positioning System (GPS) and location inferred from network signals such as IP address, RFID, WiFi and Bluetooth MAC addresses, and GSM/CDMA cell IDs, as well as user input.
The API allows developers to access both “one-shot” position requests and repeated position updates. It also provides the ability cache historic positions and use the data to add geolocation functionality to their applications. For the geeks in the room and so newbies won’t be confused when they find more geolocation-based acronyms, Geolocation Services builds upon earlier work in the industry, including [AZALOC], [GEARSLOC], and [LOCATIONAWARE].
Note that Geolocation Services draws on third-party services – it does not do any geolocation itself from the device. Thus, all W3C geolocation services – and this is not to minimize their value in developing online and mobile location-aware applications – are simply an aggregation tool to allow developers to draw on whatever third party sources are available to geolocate a device and feed that information into their applications.
Also note that there is nothing explicitly tying these services to the DNS record. How do you make that connection?
Well, as the next section shows, W3C geolocation services can draw upon a service like hostip.info to get the geographic location of the host.
Third-Party Service Providers
At the end of the day, there is no “magic bullet” technology that links an IP address to a geolocation. The only way this occurs is through a third-party service that has used numerous, usually labor-intensive and semi-manual techniques, to acquire a geolocation for an IP-address
A part of me wants to chat about Netgeo here – which was one of the earliest attempts to geolocate devices by their IP address. However, Netgeo has not been maintained , and frankly I’ve covered pretty much everything they discuss in the prior sections. But if you are interested in this bit of IP address-based geolocation, click on the link.
Having used a number of these services, I can tell you that the majority do not do a particularly good job of geolocating a host server using an IP address, much less a specific device. I’ll use hostip.com, as it is the most transparent. hostip.com uses crowdsourcing to geolocate a host. Developers and ISPs can enter the location of their servers into the hostip database. The database is then freely available to anyone who wishes to use it. Here is an example of how my location fared:
Tustin is several hundred miles to the south of my location. So as you can see, not very accurate at all.
Circling Back to Our Example
So how do these two services handle geolocating my computer? First, they are both using the W3C geolocation API. What differs are the sources they use to identify a location.
Obviously Google is relatively accurate in this example, although I do not consider a one mile radius to be particularly useful for those of us who are trying to deliver fine-grained location-based services. Google manages this through a combination of sources:
If you allow Google Chrome to share your location with a site, the browser will send local network information to Google Location Services to get an estimate of your location. The browser can then share your location with the requesting site. The local network information used by Google Location Services to estimate your location includes information about visible WiFi access points, including their signal strength; information about your local router; your computer’s IP address. The accuracy and coverage of Google Location Services will vary by location.
Google Chrome saves your location information so that it can be easily retrieved. This information is periodically updated; the frequency of updates depends on changes to your local network information.
I should add that when it comes to Android and Android Location Services, Google also uses GPS and Assisted GPS technnologies for geolocation.
Android Location Services periodically checks on your location using GPS, Cell-ID, and Wi-Fi to locate your device. When it does this, your Android phone will send back publicly broadcast Wi-Fi access points’ Service set identifier (SSID) and Media Access Control (MAC) data.
And it this isn’t just how Google does it; it’s how everyone does it. It’s standard industry practice for location database vendors.
whatismyipaddress, on the other hand, is only using the IP-based address from a third-party service. This is a choice on their part because I haven’t opted in to use location-based services. Once I do, I get the map below and my location by one source is correctly shown as Carmel Valley (it is interesting to note, as well, the different results depending on which third party provider you use). But this is because we are now triangulating not just from IP address-based geolocation. whatismyipaddress.com is also using wifi-based geolocation and cell-tower triangulation via the W3C location services API to get a more accurate read.
Basically, after exploring IP address-based geolocation, the conclusion is it is a non-starter for any application but those that can live with the broadest of geolocation options.
Next up: Assisted GPS.
Well, here we are how many months later, and it’s my first post? My friends have been ribbing me – an online guy active in social media – not updating his blog for almost a year.
Sadly, I haven’t been all that active in social media, either, until the last couple of weeks. Checked out completely. Nowhere to be found in real-time. Gone from the immediate consciousness of those I hold near and dear (or at least, in the case of Twitter, the great founts and filters of information). No longer part of the noise. Like a black hole, I might be there but no one could detect me. Tsk, tsk.
The reality is, however, that for the last year I have been heads down moving deeply into machine learning, audience segmentation, behavioral targeting and recommendation engines for mobile advertising. You try pulling all that down and back into your tool set after 25 years way from a masters in robotics (which involved what was then called adaptive learning) even as you deliver product specs, running deep analytics, and building product. I dare you. While I ain’t done yet, but I’ve finally gotten to a point where I can take a breath and look up and see what’s going on in the world. However, it’s not like it used to be when it comes to my participation in social media. Deep analysis requires intense concentration (at least for me), and all the interruptions from Tweetdeck just kill my train of thought. So it’s beginning of end or end of day mostly, with an occasional day where I can just relax and range through lots of immediate information.
Sigh. The price I pay for working in an area that is intensely mathematical and has become my passion. The price I pay for reveling in the ability to build incredible products on huge data. A price, but well worth it.
So now that the mea culpa is past, what’s on the agenda for today. Given I have been working in the mobile space, it seems appropriate to start with the issue of geolocation. Geolocation data represents a relatively new input to data mining, but one that can provide a host of opportunities to identify and segment audiences. Admittedly, there are a number of services like Loopt, Foursquare, and others that make their bones on using geolocation to understand where you are and what might be of interest to you. But believe it or not, for the majority of businesses and even many technology companies, the whole idea of using geolocation data as elements of customer profiles is completely new. Many are still trying to wrap their heads around how to best set up a mobile website and integrate it into their overall marketing programs. The science of geolocation? Not even on the radar yet.
It has certainly been an eye opener for me to learn this field – it is deep, rich and complex. So I thought I would build a primer for those who, like me, had to start from scratch and understand how geolocation works and how you might use it to enhance your customer offerings. There is a lot to cover, so this will be another multipart series.
The Basics of Geolocation
Most people know that anyone with a mobile device can be geolocated. But what many people do not know is that a device can also be geolocated if you are online through information transmitted by the device’s browser (especially Google Chrome and FireFox). The combination of these technologies provides a powerful set of tools for tightly locating a device (meaning a radius of under 200 feet) even when GPS, the most fine-grained way of locating a device, is not turned on.
The core methods by which a device can be geolocated and which we will discuss in the next sections are:
- The Global Positioning System or GPS
- IP Address
- Assisted GPS
- Network Base Station Data
- Network (or Cell Tower) Triangulation
The Global Positioning System
For those few who have never been to a James Bond flick, watched Law and Order, or seen a TomTom commercial, GPS stands for Global Positioning System. But even if you know the term “GPS” you may not know how it works. So let’s start there.
GPS is a space-based satellite navigation system that provides location and time information in all weather, anywhere on or near the Earth, where there is an unobstructed line of sight to four or more GPS satellites. It consists of 24 satellites, is maintained by the United States government and is freely accessible to anyone with a GPS receiver.
GPS is accurate to a very tight radius – current technologies can get a horizontal accuracy of ~1 meter (3 feet) and a vertical accuracy of ~1.5 meters. But GPS accuracy for most mobile phones and pads is probably on the order of a 30-50 foot radius. Garmin, a maker of navigation systems, says its devices are accurate to 15 meters, for example.
Most mobile devices have a GPS receiver built in, although it is not turned on by default due to the fact it drains batteries very quickly. This default is, in fact, the biggest hurdle to accurately geolocating a device, since GPS is by far the most accurate mechanism available.
GPS satellites transmit two low power radio signals, which travel by line of sight. As a result, they can pass through clouds, glass and plastic but will not go through most solid objects such as buildings and mountains.
A GPS signal contains three different bits of information – a pseudorandom code, ephemeris data and almanac data.
The pseudorandom code is simply an I.D. code that identifies which satellite is transmitting information.
Ephemeris data is information GPS satellites transmit about their location (current and predicted), timing and ‘health’. This data is used by GPS receivers to enable them to estimate location relative to the satellites and thus position on earth.
Almanac data tells the GPS receiver where each GPS satellite should be at any time throughout the day. Each satellite transmits almanac data showing the orbital information for that satellite and for every other satellite in the system.
Each GPS satellite is located ~12,000 miles above the Earth and makes two complete rotations every day. GPS receivers in mobile devices attempt to locate four or more of these satellites, calculate the distance to each, and then use the information to geolocate a 3D position (latitude, longitude, altitude). Once the user’s position is determined, the GPS receiver can calculate other information, such as speed, bearing, track, trip distance and much more.
The calculation is based on trilateration, which is a mathematical model for determining the absolute or relative position of points using the geometry of circles, spheres, and triangles. Unlike triangulation, which is what most people think GPS uses to fix a location, it does not involve the measurement of angles. To emphasize this, I have chosen a slightly more technical diagram to represent the concept. Note that this calculation does not just involve calculating the intersection of the three radii (point B, which is what we are geolocating) – there are also components that relate to the relative positions of the three foci of the circles.
Sources of GPS Signal Errors
As we start talking about GPS accuracy and the accuracy of other geolocation technology, we need to understand what types of errors can enter into each system. For GPS, there are six types of signal errors that can occur. Fortunately even with them GPS is incredibly accurate. The table below summarizes the size of the potential effect of various errors, which are then described in more detail.
|Source of Error||Size of Error|
|Multipath Effect||+/- 1 meter|
|Atmospheric Effects||+/- 5 meters|
|Receiver Clock Errors||+/- 2 meters|
|Geometry Shading||+/- 2.5 meters|
|Ephemeris Errors||+/- 1 meter|
Signal Multipath Errors. Signal multipath errors are caused by the GPS signal reflecting off objects such as tall buildings or other large, highly reflective surfaces before it reaches the receiver. This increases the travel time of the signal, thus introducing errors into the calculation. The resulting error typically lies in the range of a few meters.
Atmospheric Delays. Atmospheric delays represent the largest potential source of GPS signal error. Satellite signals slow as they pass through the ionosphere and troposhere. While radio signals travel with the velocity of light in outer space, their propagation in the ionosphere and troposphere is slower. In the ionosphere in a large number of electrons and positive charged ions are formed by the ionizing force of the sun. These charged ions refract the electromagnetic waves from the satellites, resulting in an elongated runtime of the signals. In the troposphere, varying concentrations of water vapor further elongate the runtime of signals. These errors are mostly corrected by calculations in the GPS receivers, since typical variations of the velocity while passing through the atmosphere are well known for standard conditions.
Receiver Clock Errors. A receiver’s built-in clock is not as accurate as the atomic clocks onboard the GPS satellites. Therefore, it may have very slight timing errors.
Ephemeris Errors. Ephemeris errors occur when a satellite incorrectly reports its position.
Too Few Visible Satellites. The more satellites a GPS receiver can “see,” the better the accuracy. Buildings, terrain, electronic interference, or sometimes even dense foliage can block signal reception
Geometry Shading. Another factor influencing the accuracy of the reported position is “satellite geometry”. Satellite geometry describes the position of the satellites relative to each other from the view of the receiver. Ideal satellite geometry exists when the satellites are located at wide angles relative to each other. Poor geometry results when the satellites are located in a line or in a tight grouping. For example, if a receiver sees 4 satellites and all are arranged in the northwest, this leads to a “bad” geometry. In the worst case, no position determination is possible at all, when all distance determinations point to the same direction. Even if a position is determined, the error of the positions may be significant, although in practice it is usually no more than 2.5 meters. If, on the other hand, the 4 satellites are well distributed over the whole firmament the determined position will be much more accurate.
Next Installment: Geolocation using IP Address