This is the fourth part of our series How to Add Postcode-Based Proximity Search With Open Data. In part 3, we showed how to implement the common application of nearest locations to a postcode provided by a user.
So far in this series we have geocoded a dataset (that is to say, found the position of each record in the dataset), geocoded user input, and looked at the maths which allows us to find the distance between two sets of coordinates: Pythagorean Theorem.
In this article, I show you how to take a geocoded dataset and geocoded user input (a postcode), and order the dataset by proximity to the postcode. Our example application uses a dataset of pubs, hence the purpose of the application is to find the nearest pub to a location specified by the user.
Ready to put all the pieces to work? Let’s build the search function to include in the geolocation application.
Taking our example pub finder application, let’s imagine the user is planning a night out in central London. She wants to find the nearest pubs to Piccadilly Circus, which happens to have a postcode of W1J 9HP. Using the Open Postcode Geo API we can retrieve the easting and northing for that postcode:
http://api.getthedata.com/postcode/SW1Y+4QF
{
"status": "match",
"match_type": "unit_postcode",
"input": "SW1Y 4QF",
"data": {
"postcode": "SW1Y 4QF",
"status": "live",
"usertype": "large",
"easting": 529577,
"northing": 180605,
"positional_quality_indicator": 1,
"country": "England",
"latitude": "51.509525",
"longitude": "-0.134223",
"postcode_no_space": "SW1Y4QF",
"postcode_fixed_width_seven": "SW1Y4QF",
"postcode_fixed_width_eight": "SW1Y 4QF",
"postcode_area": "SW",
"postcode_district": "SW1Y",
"postcode_sector": "SW1Y 4",
"outcode": "SW1Y",
"incode": "4QF" },
"copyright": [
"Contains OS data (c) Crown copyright and database right 2016",
"Contains Royal Mail data (c) Royal Mail copyright and database right 2016",
"Contains National Statistics data (c) Crown copyright and database right 2016" ]
}
So we now have the coordinates of our search centre:
Easting=529577
Northing=180605
In the last article, we found earlier that, using the Pythagorean Theorem, we can calculate the distance between two sets of coordinates in SQL thusly:
sqrt(pow(abs([easting_a] - [easting_b]),2) + pow(abs([northing_a] - [northing_b]),2))
Where [easting_a] and [northing_a] are one set of coordinates, and [easting_b] and [northing_b] are the other set.
To build a proximity search, we need to find the distance to each record in our dataset from the search centre, and then order the results by that distance. We can do that by including the SQL code above in a more general SQL select statement:
select sqrt(pow(abs([easting_a] - [easting_b]),2) + pow(abs([northing_a] - [northing_b]),2)) as distance from [table] order by distance
Where [table] is a table of geocoded data.
Let’s fill in values for our Piccadilly Circus search, using our example open_pubs dataset. We assume that we already geocoded this dataset with eastings and northings for each record:
select sqrt(pow(abs(529577 - easting),2) + pow(abs(180605 - northing),2)) as distance from open_pubs order by distance
Here we have set [easting_a] and [northing_a] to the coordinates of our Piccadilly Circus search centre. We replace [easting_b] and [northing_b] with the names of the easting and northing fields in our open_pubs table.
If you run this query, you should retrieve all the records in the table, with distance expressed either as null (for records which have not been geocoded) or a decimal number equal to the distance in metres.
This query makes some changes to produce a more useful set of results:
select name, round(sqrt(pow(abs(529577 - easting),2) + pow(abs(180605 - northing),2))) as distance from open_pubs where easting is not null and northing is not null order by distance limit 10
With this change, we now select name as well as distance; we round distance to the nearest metre using MySQL’s round() function; we exclude any records which have either easting=null or northing=null (which would be pointless data to share with the user, because they do not lead her to a pub); and we limit the results to 10 pubs for user experience reasons.
The result should look something like the below:
mysql> select name, round(sqrt(pow(abs(529577 - easting),2) + pow(abs(180605 - northing),2))) as distance from open_pubs where easting is not null and northing is not null order by distance limit 10;
+-------------------------------+----------+
| name | distance |
+-------------------------------+----------+
| Captains Cabin Public House | 99 |
| Three Crowns Public House | 101 |
| Apollo Theatre | 103 |
| Jamies Italian | 148 |
| St James' Tavern Public House | 156 |
| Tom Cribb Public House | 160 |
| The Queen's Head | 168 |
| Prince Of Wales Theatre | 176 |
| Lyric Public House | 195 |
| Waxy O'Connors | 205 |
+-------------------------------+----------+
10 rows in set (0.08 sec)
If you didn’t geocode your dataset, but you did load the Open Postcode Geo dataset, you can achieve the same results with a SQL join:
mysql> select open_pubs.name, round(sqrt(pow(abs(529577 - open_postcode_geo.easting),2) + pow(abs(180605 - open_postcode_geo.northing),2))) as distance from open_pubs join open_postcode_geo on open_pubs.postcode = open_postcode_geo.postcode where open_postcode_geo.easting is not null and open_postcode_geo.northing is not null order by distance limit 10;
+-------------------------------+----------+
| name | distance |
+-------------------------------+----------+
| Captains Cabin Public House | 99 |
| Three Crowns Public House | 101 |
| Apollo Theatre | 103 |
| Jamies Italian | 148 |
| St James' Tavern Public House | 156 |
| Tom Cribb Public House | 160 |
| The Queen's Head | 168 |
| Prince Of Wales Theatre | 176 |
| Lyric Public House | 195 |
| Waxy O'Connors | 205 |
+-------------------------------+----------+
10 rows in set (15.77 sec)
The join is faster if the common key (in this case the postcode field) is indexed in both tables.
Optimizing performance
This may be all you need, but in some applications there is a lot of data to sort through. If you need to improve the speed of the proximity search query, there are two optimisations you can make: database indexes, and adding a bounding box to your query.
Applying database indexes
If all your data is in one table, including coordinates, you can use a covering index to speed up the query. This indexes both the fields in your where clause, and the fields you select, enabling the entire query to be carried out on the index.
Continued from page 1.
In our example, we are just selecting the name of the pub, so our covering index would look like this:
alter table open_pubs add index(easting, northing, name)
This creates an index containing easting and northing (used in the where clause and the select) and name (used in the select).
If you are joining your dataset to a table containing the Open Postcode Geo dataset, then the indexes are a bit more complicated, but they have a big impact on performance.
First let’s look at how fast this query runs with no indexes on either table:
mysql> select open_pubs.name, round(sqrt(pow(abs(529577 - open_postcode_geo.easting),2) + pow(abs(180605 - open_postcode_geo.northing),2))) as distance from open_pubs join open_postcode_geo on open_pubs.postcode = open_postcode_geo.postcode where open_postcode_geo.easting is not null and open_postcode_geo.northing is not null order by distance limit 10;
+-------------------------------+----------+
| name | distance |
+-------------------------------+----------+
| Captains Cabin Public House | 99 |
| Three Crowns Public House | 101 |
| Apollo Theatre | 103 |
| Jamies Italian | 148 |
| St James' Tavern Public House | 156 |
| Tom Cribb Public House | 160 |
| The Queen's Head | 168 |
| Prince Of Wales Theatre | 176 |
| Lyric Public House | 195 |
| Waxy O'Connors | 205 |
+-------------------------------+----------+
10 rows in set (3 hours 5 min 17.66 sec)
Wow: Over three hours!
To judge the effect of adding indexes, we add two covering indexes, one to each table. The covering indexes include the fields in the join, the fields in the where clause, and the fields we want to select. For table open_postcode_geo we join on postcode; we have easting and northing in the where clause; and we also want to select easting and northing. So our index includes these three fields:
mysql> alter table open_postcode_geo add index(postcode, easting, northing);
Query OK, 0 rows affected (37.34 sec)
For table open_pubs, we are joining on postcode; we have no fields in the where clause; and we want to select name. We also include address as we’re going to use that in a later example:
mysql> alter table open_pubs add index(postcode, name, address);
Query OK, 0 rows affected (1.36 sec)
Now let’s try the original proximity query again:
mysql> select open_pubs.name, round(sqrt(pow(abs(529577 - open_postcode_geo.easting),2) + pow(abs(180605 - open_postcode_geo.northing),2))) as distance from open_pubs join open_postcode_geo on open_pubs.postcode = open_postcode_geo.postcode where open_postcode_geo.easting is not null and open_postcode_geo.northing is not null order by distance limit 10;
+-------------------------------+----------+
| name | distance |
+-------------------------------+----------+
| Captains Cabin Public House | 99 |
| Three Crowns Public House | 101 |
| Apollo Theatre | 103 |
| Jamies Italian | 148 |
| St James' Tavern Public House | 156 |
| Tom Cribb Public House | 160 |
| The Queen's Head | 168 |
| Prince Of Wales Theatre | 176 |
| Lyric Public House | 195 |
| Waxy O'Connors | 205 |
+-------------------------------+----------+
10 rows in set (0.47 sec)
From over three hours to under a second? That is a pretty decent performance improvement!
Bounding box
If you need more performance you can improve query time further by adding a bounding box. This additional optimisation involves limiting the records in the query to those which fall into a specific geographical area, defined by a bounding box. It is most effective where you have a large dataset, usually measured in millions of addresses. You should always try adding indexes first, and then add the bounding box if you need more performance.
Let’s demonstrate the use of bounding boxes using our example coordinates for Piccadilly Circus:
Easting=529577
Northing=180605
Our existing SQL query selects every pub in the country and orders them by proximity to Piccadilly Circus. In reality, we can see from the results that we only need to look at pubs which are within 205m of Piccadilly Circus; otherwise they are unsuitable for a Saturday night pub crawl (if for no other reason than the fact that the user would not arrive at the most distant until Sunday, which would be a shame). There’s no point in retrieving data that we know we will throw away from the results. Clearly the density of pubs varies throughout the country, but for our application let’s assume that we are only interested in pubs within 1km (or 1,000m).
We can do this by defining a bounding box centered on the search coordinates, where the centre of each side of the box is 1km from the centre point described by the search coordinates. The bounding box is therefore 2km square.
In SQL we can do this by adding between statements to the where clause to limit the possible values of easting and northing, where easting should be between (529577 - 1000) and (529577 + 1000), and northing should be between (529577 - 1000) and (529577 + 1000):
where easting between 529577 - 1000 and 529577 + 1000 and northing between 180605 - 1000 and 180605 + 1000
Let’s add that clause to our original query and see what happens to performance:
mysql> select open_pubs.name, round(sqrt(pow(abs(529577 - open_postcode_geo.easting),2) + pow(abs(180605 - open_postcode_geo.northing),2))) as distance from open_pubs join open_postcode_geo on open_pubs.postcode = open_postcode_geo.postcode where open_postcode_geo.easting is not null and open_postcode_geo.northing is not null and open_postcode_geo.easting between 529577 - 1000 and 529577 + 1000 and open_postcode_geo.northing between 180605 - 1000 and 180605 + 1000 order by distance limit 10;
+-------------------------------+----------+
| name | distance |
+-------------------------------+----------+
| Captains Cabin Public House | 99 |
| Three Crowns Public House | 101 |
| Apollo Theatre | 103 |
| Jamies Italian | 148 |
| St James' Tavern Public House | 156 |
| Tom Cribb Public House | 160 |
| The Queen's Head | 168 |
| Prince Of Wales Theatre | 176 |
| Lyric Public House | 195 |
| Waxy O'Connors | 205 |
+-------------------------------+----------+
10 rows in set (0.31 sec)
As you can see, performance improves significantly. The performance gain becomes increasingly more significant as your dataset get larger.
Do keep in mind that the user may need a pub more than 1km away. Not every user is in downtown London, after all; she may be in a rural area where the nearest 10 pubs are quite some distance. Factor that requirement into the bounding boxes you implement.
Generally, following these steps gives the best performance possible:
- Geocode your data in its own table. This avoids the overhead of a join.
- Add a covering index.
- Use a bounding box.
- Use the smallest bounding box possible. This might involve varying the size of the bounding box depending on the search centre.
In this article, we ordered a list of geocoded records in a dataset by their proximity to a given position. We looked at a how to optimise the performance of the query using indexes and a bounding box. In the next article, I show how to create a functional application in PHP and MySQL, demonstrating how to find the nearest pub to a postcode supplied by the user.
This is part 4 of our series How to Add Postcode-Based Proximity Search With Open Data. In part 5, we bring together the pieces and build an application that demonstrates how they work together.