Development icon

Large Scale, Server-Side Mapping in Drupal with the Leaflet-Geocluster Stack: Part 1

Eric paul, Senior Developer
#Drupal | Posted

On a recent project here at Phase2, we were tasked with creating a responsive, data-scalable, dense-point map in Drupal 7. The sticking point for this application was that the starting data scale of about 18k data points needed the ability to scale up to around 104 data points. We wanted the page load time to remain under one second on the Drupal 7 application stack.

Our initial knee-jerk reaction was to question whether Drupal was the right tool for this job. We began to research for a way to implement this completely in Drupal. We did find a way: we ended up implementing the Leaflet-Geocluster stack.

In this two-part blog series, we’ll take a look at server-side mapping and look into the performance bottlenecks (and what we can do to mitigate them). Then we’ll take a closer look at our implementation, our pain points, and a few key application-specific customizations. The hope is that you’ll have enough details and reference material to implement your own large-scale mapping application.

The end result is currently in production on the Volunteers in Service to America site. If you’d like to follow along, the code has been released under the GPL and is available on github.

The Problem

HOW DO WE USE DRUPAL AND CURRENT MAPPING TECHNOLOGY TO PROVIDE A RESPONSIVE MAP APPLICATION THAT CAN SCALE UP TO 10,000 DATA POINTS AND IS USABLE ON ANY DEVICE?

The solution to this problem was twofold: first, use server-side processing to produce clusters, then amortize delivery of user-requested data over multiple workflows.

WHY IS THIS A DIFFICULT PROBLEM?

In Drupal, entities that we want to map will usually store an address (with Address Field or Location) in which geodata (latitude-longitude coordinates) are encoded into a separate field (usually Geofiled). Then we either map one entity via a field formatter within an entity view mode, or we use Views to query for a list of geocoded features that act as a data feed for a map.

Current high-density point strategies are mostly limited to client-side clustering, which produces a rendering similar to the following:

map of high density points

Here, we have a usable interface where we can click to zoom in on point clusters to get at the area of interest. As we zoom in, clusters regroup themselves and clickable points start appearing. This is a very familiar interaction, and implementing this with client-side clustering works pretty well when we are mapping on the order of 102 points.

Once we get to a number of points on the order of thousands, tens-of-thousands , and upwards, client-side clustering is not effective due to the page load time to render out the point data before the mapping library even has a chance to transform the data to a visualization.

One of the reasons this breaks down is that in the modern world of various, unpredictable devices that will be consuming our content, we don’t have enough information to theorize about how much client-side processing is appropriate. When we get to the scale of thousands per display, we need to assume the lowest-common denominator and offload as much processing to the server as we can in order to minimize the client-side computational load.

To see why, lets take a brief look under the hood.

diagram of geogield, views and leaflet

In this diagram, we are using Geofield, Views, and Leaflet to produce a map with client-side clustering. The server side is on the left, and the client (browser) side is on the right. Geofield stores the geodata, and a Views query produces either a single point or an array of points. In either case, PHP is rendering the point data one row at a time, and the client-side clustering happens after this delivery.

Te reason this breaks down at larger scale is fairly logical: Geocoded data is encoded in text-based formats like WKT or GeoJSON that must be parsed and processed before rendering the map. Obviously, the larger the dataset, the longer the receive-decode cycle takes. Further, if point data is delivered via PHP during page load–as opposed to asynchronously with AJAX–the whole page will not start rendering until all of the point data has loaded.

Speaking in terms of sequence at any scale, the load process looks like this:

  1. Views (PHP) renders each data point as a row of output, one at a time at page load time.

  2. Views (PHP) renders the popup info (hidden) at page-load time.

  3. The mapping library (JS) parses the location data.

  4. The mapping library (JS) clusters the points.

  5. The mapping library (JS) renders the map.

In this single cycle of receive-decode-render, PHP delivers the raw data, and JavaScript performs the transformation of the data into the visualization. At large scale, the client side is shouldering the majority of the computation, and page loads become highly dependent on the efficiency of the client device. Lightly-resourced devices, or even older PC’s will suffer unusable page-load times.

In order to improve the performance at large scale, we want to perform clustering on the server side, so that the client-side stack only sees any given cluster as a single feature. We have an idea of what the server can handle, and by offloading the more-complex computations to a more predictable environment, we can normalize the performance across devices.

In layman’s terms, we’re simply reducing the number of “things” the client-side browser is seeing in the map, i.e several clusters vs. thousands of points.

The next question is how to implement clustering on the server side. It turns out that we could borrow from a recently-developed web service: geohash.org.

GEOHASHING

Geohashing is a fairly-recent development in geolocation. In 2008, the Geohash Algorithm was developed by Gustavo Niemeyer when creating the the Geohash web service. The service was initially developed to identify geographic points with a unique hash for use in URIs. For example, the linkhttp://geohash.org/u4pruydqqvj uniquely identifies a location in the northern tip of Denmark.

The Geohash service simply turns latitude-longitude pairs into a hash code, appropriately named a “geohash.” Any given point can be geohashed to arbitrary levels of precision, which is represented by the hash length. The shorter the hash, the less precise the hash is, and vice versa. The wikipedia overview of geohashing offers a good example of how the hashes are produced. For our purposes here, the important idea is that Geohashing a group of points creates a “spatial index” (an abstract search index) from which it is computationally cheap to infer relative proximity of points.

The Geocluster Module

The Geocluster module provides a Drupal implementation of the geohash algorithm that integrates with Geofield, Views GeoJSON, and Leaflet to provide server-side clustered GeoJSON map feeds (OpenLayers could likely be swapped out, but nothing has been documented towards that end, but its just a GeoJSON feed that needs to be consumed).

The module is under active development and offers many opportunities for optimization. The project was originally developed by Josef Dabernig (dasjo) as a proof-of-concept for a Master’s thesis on large scale mapping in Drupal. For those interested in performance optimization, the thesis is worth a read.

In a nutshell, three server-side clustering strategies were compared against client-side clustering as a baseline, with a target of 1-second page load.

Drupal mapping query and display modules with leaflet

Here we obviously see that as we transition from 100 to 1000, client-side clustering becomes a lost cause. Even clustering in PHP after the database query (known as post-query clustering) is not of much help either. We start to see usable performance gains once we move to query-level clustering with MySQL or Apache Solr.

We ended up implementing MySQL clustering and were able to achieve less than one-second page loads. At the time this application was developed, Solr clustering was still under development, and whether or not it can really scale up to well beyond 100,000 is not something we know for certain.

Again, empirically, we know that client-side clustering starts to break down beyond a few to several hundred features, which lines up with the performance benchmarking in the plot. This is a convenient threshold to switch from client-side to server-side with query-level clustering. There is some grey area between several hundred points and 1000, so testing your use case can determine what is best for you.

 Onward!

In our next post, we’ll look at the recipe for the Leaflet-Geocluster stack, and take a look at how it was implemented. We’ll cover our pain points, what we did about them, and some customizations towards the application.

Eric paul

Senior Developer