Quantcast
Channel: ProgrammableWeb - Location
Viewing all articles
Browse latest Browse all 1090

How to Geocode a Dataset With Open Data

$
0
0
Primary Target Audience: 
Primary Channel: 
Primary category: 
Secondary category: 
Contributed Content: 
Yes

This is the first part of our series How to Add Postcode-Based Proximity Search With Open Data. In this part of our series, we will explain how to geocode a dataset using open data.



In this article, the first in a series about adding location to applications, I explain how to geocode a dataset using open data. Geocoding involves placing a thing or place (e.g. a restaurant) in a specific position. By the end of this article, you should be able to take a dataset and add the position of each record to that dataset.



This whole process is made easier by the increasing availability of open data. Governments around the world are realising the importance of opening up their data to help drive the digital economy. In this context, that means they allow new applications and derivative datasets to be developed, unencumbered by previous licensing restrictions.



In this article series, I work with UK data because that’s where I am from. No matter where you are in the world, however, you should be able to find similar resources.



So, in particular: The UK’s Ordnance Survey geocodes all UK postcodes and releases them as open data, with the exception of Northern Irish postcodes, which are under a more restrictive license. The one reuse requirement is to include an attribution statement.



Whichever dataset you use, the open data geocoding maps to a specific place: a destination you could point to on a map or with a GPS coordinate. Typically, the coordinates provided match a building located roughly centrally in the postcode; a single postcode can contain tens of buildings. If you need to accurately geocode a specific building relative to its neighbors, you cannot currently do this with open data alone.



For this how-to, I use Open Postcode Geo, which is optimized for geospace applications and is available as both a CSV file and an API. As the example project, we are making an application to help people find the pub nearest to them. I use Open Postcode Geo to geocode a list of pubs, by which I mean that I find the position of each pub. I then create an application which builds on this geocoded dataset to allow a user to find the nearest pub to a given location. Because… really, do I need to explain why someone would want to find a pub?!

Dataset to Geocode: Start with Requirements

Perhaps it’s obvious, but the first element you need in building a geolocation-based application or feature is a list of locations and data about those locations. In order to add positions, your dataset needs a postcode for each record you want to geocode. Sometimes that’s information you collect yourself, or your company has as proprietary data. It is likely a collection of addresses, such as  the addresses of every branch in your retail company, or all the contacts in your organization.



But open data gives us access to so much more -- with varying amount of detail. Many open datasets provide addresses but no coordinates (for example, easting and northing or latitude and longitude). Here are some examples:

It is worth noting that the data in Open Postcode Geo is updated at least once a quarter, as new postcodes are issued and old ones are retired. That likely is true no matter which data source you use for location-related data, open-data or otherwise. Restaurants open and close all the time; banks open new branches; museums regularly add to their digital collections. Whatever location-based application you design, it always needs to encompass regular data updates.

For the purposes of this how-to, I use a sample dataset that I produced specifically for the purpose. It contains every premise listed in the “Pub/bar/nightclub” category of the Food Standards Agency’s rating data, which is open data released under the UK’s Open Government License.



The how-to therefore solves a very frequent proximity usecase: Where is the nearest pub?



To keep things simple, the sample dataset contains just three fields:

  • The pub’s name
  • Address
  • Postcode

You can download the sample dataset here. If you want more (and up-to-date) pub data for real world applications, look at the Open Pubs open dataset.

Choosing a coordinate system

The Open Postcode Geo dataset uses two different coordinate systems suitable for geocoding.



The first system is eastings and northings, a simple grid reference describing the distance in meters to the east and north from an origin point in the far southwest corner of the grid, the coordinates of the origin being (0,0).



The second is latitude and longitude, an explanation of which is beyond the scope of this how-to.



Latitude and longitude have the benefit of being able to locate a point anywhere in the world; in contrast, eastings and northings are limited to a position within a specified grid. As we are concerned with the UK, eastings and northings both suit our purposes, and they are much easier to understand mathematically. Hence I for this example I focus on eastings and northings.

Choosing CSV or API

Open Postcode Geo is available in two formats: A CSV file which you can import into your database, and an API .



If you need to geocode your data on an ongoing basis (e.g. new records are added to your application each month which need geocoding), you should consider using the API, as it is always up to date. If you have a large dataset and speed or connectivity is an issue, then the CSV file allows you to quickly geocode a large number of records. Alternatively, you can just use an SQL join to join the postcodes in your dataset to the postcodes in Open Postcode Geo, bringing in whatever coordinates you need at the same time.

Database setup

The first thing to do is set up the database as we need some tables in place into which to load our datasets. Our examples use MySQL.



First, load the Open Postcode Geo CSV file, which comes as comes as CSV or formatted for a MySQL import. If you decided to use the CSV file rather than the API, you need to load it into your database.



For this example, I import the CSV, as the instructions are more easily adapted to other databases.



First, create a new database. (You can skip this step if you are working with an existing database.) As our example dataset is a list of pubs, and we’re building a proximity search, call the database “pub_finder”:

mysql> create database pub_finder;
Query OK, 1 row affected (0.00 sec)

mysql> use pub_finder;
Database changed

Next, create a table into which to load the Open Postcode Geo data. Here is the SQL table create statement:

CREATE TABLE `open_postcode_geo` ( `postcode` char(8) NOT NULL, `status` enum('live','terminated') NOT NULL, `usertype` enum('small','large') NOT NULL, `easting` mediumint(9) DEFAULT NULL, `northing` mediumint(9) DEFAULT NULL, `positional_quality_indicator` tinyint(3) unsigned NOT NULL, `country` enum('England','Wales','Scotland','Northern Ireland','Channel Islands','Isle of Man') NOT NULL, `latitude` decimal(9,6) DEFAULT NULL, `longitude` decimal(9,6) DEFAULT NULL, `postcode_no_space` char(7) NOT NULL, `postcode_fixed_width_seven` char(7) NOT NULL, `postcode_fixed_width_eight` char(8) NOT NULL, `postcode_area` char(2) DEFAULT NULL, `postcode_district` char(4) DEFAULT NULL, `postcode_sector` char(6) DEFAULT NULL, `outcode` char(4) NOT NULL, `incode` char(3) NOT NULL, UNIQUE KEY `postcode` (`postcode`), UNIQUE KEY `postcode_no_space` (`postcode_no_space`), UNIQUE KEY `postcode_fixed_width_seven` (`postcode_fixed_width_seven`), UNIQUE KEY `postcode_fixed_width_eight` (`postcode_fixed_width_eight`)
);

And here is what you should see at the prompt:

mysql> CREATE TABLE `open_postcode_geo` (   -> `postcode` char(8) NOT NULL,   -> `status` enum('live','terminated') NOT NULL,   -> `usertype` enum('small','large') NOT NULL,   -> `easting` mediumint(9) DEFAULT NULL,   -> `northing` mediumint(9) DEFAULT NULL,   -> `positional_quality_indicator` tinyint(3) unsigned NOT NULL,   -> `country` enum('England','Wales','Scotland','Northern Ireland','Channel Islands','Isle of Man') NOT NULL,   -> `latitude` decimal(9,6) DEFAULT NULL,   -> `longitude` decimal(9,6) DEFAULT NULL,   -> `postcode_no_space` char(7) NOT NULL,   -> `postcode_fixed_width_seven` char(7) NOT NULL,   -> `postcode_fixed_width_eight` char(8) NOT NULL,   -> `postcode_area` char(2) DEFAULT NULL,   -> `postcode_district` char(4) DEFAULT NULL,   -> `postcode_sector` char(6) DEFAULT NULL,   -> `outcode` char(4) NOT NULL,   -> `incode` char(3) NOT NULL,   -> UNIQUE KEY `postcode` (`postcode`),   -> UNIQUE KEY `postcode_no_space` (`postcode_no_space`),   -> UNIQUE KEY `postcode_fixed_width_seven` (`postcode_fixed_width_seven`),   -> UNIQUE KEY `postcode_fixed_width_eight` (`postcode_fixed_width_eight`)   -> );
Query OK, 0 rows affected (0.29 sec)



Note: This table has an index on all postcode fields. If you know your postcodes all have the same format, you might decide you do not need to index all postcode fields (or even to retain them).

Continued from page 1



The final step is to import the CSV. You could download the CSV to your computer and then upload to your server. Or, if you have the means, you can download directly to your server. For this example, I use wget:

wget https://www.getthedata.com/downloads/open_postcode_geo.csv.zip

Unzip the file:

unzip open_postcode_geo.csv.zip

Which should yield three files:

  • open_postcode_geo.csv
  • readme.txt
  • licence.txt

Now import open_postcode_geo.csv into the open_postcode_geo table you just created in the pub_finder database:

mysql>  load data infile '/path/to/open_postcode_geo.csv' into table open_postcode_geo fields terminated by ',' lines terminated by '\n';
Query OK, 2525576 rows affected (2 min 6.10 sec)

Set /path/to/open_postcode_geo.csv to the location where you unzipped open_postcode_geo.csv.



Be patient. You can see the import took over two minutes on my server. If you need to speed this up you can create the table without the indexes and add them after the import.



In a follow-on article in this series, I explain proximity queries in more detail. But if you can’t wait, you can now issue your first proximity query against the Open Postcode Geo data:

mysql> select postcode, sqrt(pow(abs(529090 - easting),2) + pow(abs(179645 - northing),2)) as distance from open_postcode_geo where easting is not null and northing is not null order by distance limit 10;

+----------+--------------------+
| postcode | distance           |
+----------+--------------------+
| SW1A 1AA |                  0 |
| SW1E 6LA | 123.76186811776881 |
| SW1E 6JP | 134.72935834479432 |
| SW1E 6JX | 134.72935834479432 |
| SW1E 6JY | 134.72935834479432 |
| SW1E 6LE | 145.34441853748632 |
| SW1E 6LF | 145.34441853748632 |
| SW1E 6NS | 145.34441853748632 |
| SW1E 6JR | 152.00328943809077 |
| SW1E 6WG | 152.34500319997372 |
+----------+--------------------+
10 rows in set (2.51 sec)

This query returns the 10 nearest postcodes to Buckingham Palace, with distances in metres. The response time is rather slow; later, I show ways to speed things up.

Loading the sample pubs dataset

If you are working with your own dataset, you can skip this step.



Start by creating a suitable table in MySQL. Here is an SQL table create statement:

CREATE TABLE `open_pubs` ( `name` varchar(128) DEFAULT NULL, `address` varchar(256) DEFAULT NULL, `postcode` char(8) DEFAULT NULL, KEY `postcode` (`postcode`)
)

Here is what you should see at the prompt:

mysql> CREATE TABLE `open_pubs` (   ->   `name` varchar(128) DEFAULT NULL,   ->   `address` varchar(256) DEFAULT NULL,   ->   `postcode` char(8) DEFAULT NULL,   ->   KEY `postcode` (`postcode`)   -> );
Query OK, 0 rows affected (0.28 sec)

Now download the sample dataset, which you can find here.

wget https://www.getthedata.com/downloads/open_pubs_2016-09.csv.zip

Again, if you need up-to-date pub data, with additional fields, use the Open Pubs dataset.



Unzip the pubs file:

unzip open_pubs_2016-09.csv.zip

Return to the MySQL prompt and load the data into the open_pubs table you created:

mysql> load data infile '/path/to/open_pubs_2016-09.csv' into table open_pubs fields terminated by ',' enclosed by '"' lines terminated by '\n';
Query OK, 54368 rows affected (1.97 sec)

Set /path/to/open_pubs_2016-09.csv to the location where you unzipped the sample data.



If you also loaded the Open Postcode Geo CSV file into the open_postcode_geo table, you now have the basic data required to find the nearest pub to a given point:

mysql> select open_pubs.name, sqrt(pow(abs(529090 - open_postcode_geo.easting),2) + pow(abs(179645 - open_postcode_geo.northing),2)) as distance from open_pubs join open_postcode_geo on open_pubs.postcode = open_postcode_geo.postcode where open_postcode_geo.easting is not null and open_postcode_geo.northing is not null order by distance limit 10;

+------------------------------+--------------------+
| name                         | distance           |
+------------------------------+--------------------+
| The Phoenix Public House     |  256.7586415293554 |
| Bag O Nails                  | 284.76305940202286 |
| Cask And Glass Public House  |  285.9545418418809 |
| Colonies                     |  304.1792234851026 |
| Buckingham Arms Public House | 385.73695700567765 |
| Stage Door Public House      |   443.355387922601 |
| Adam & Eve                   |  448.6423965699185 |
| Victoria Palace Theatre      |  449.0634699015274 |
| Tiles Wine Bar               |  461.1084471141252 |
| Kings Arms Public House      |  491.1710496354605 |
+------------------------------+--------------------+
10 rows in set (1.16 sec)

So far, we have gotten the basics in place:

  • A local copy of Open Postcode Geo, against which you can lookup coordinates from postcodes.
  • A local copy of the Open Pubs dataset, which includes postcodes but no coordinates.

With the tables created, it’s time to add geolocation data to the application. That’s what Ii cover in the next article in the series, Learning to Geocode Data for Location-Friendly Applications.



This is part 1 of our series How to Add Postcode-Based Proximity Search With Open Data. In part 2, we’ll explain three ways to geocode your data using Open Postcode Geo.

Summary: 
This is the first part of our series How to Add Postcode-Based Proximity Search With Open Data. In this article we will explain how to geocode a dataset using open data. By the end of this article, you should be able to take a dataset and add the position of each record to that dataset.
Content type group: 
Articles
source code: 
0

Viewing all articles
Browse latest Browse all 1090

Trending Articles