Part 2: An Implementation
In our previous post, we suggested a real-world scenario for server-side mapping and discussed the performance challenges of large scale mapping (10,000 points and up). We briefly looked at how the Geocluster module can help and what the expected performance characteristics could be. In this post, we’ll cover the recipe we used for the client build, some pain points, and take a look at a few key application-specific customizations.
The Leaflet-Geocluster Stack: A Recipe
The main element of the stack, obviously is Geocluster itself, but we leverage several other common and stable contrib modules in our Drupal 7 implementation. Beyond Drupal core, we use the following libraries and modules:
Geocoder (7.x-1.x): Creating geospatial coordinates from address fields.
Geofield (7.x-2.1): Storing the geospatial information.
Geocluster (7.x-1.x): The server-side clustering implementation.
Leaflet (7.x-1.1): Leaflet library integration for Drupal.
Leaflet GeoJSON (7.x-2.x): Parsing GeoJSON feeds for Leaflet map consumption and display rendering with Panels or BEAN (we used Panels).
Views (7.x-3.8): Providing map data as content listings.
Views GeoJSON (7.x-1.x): Providing views listings of geospatial data as GeoJSON feeds.
Adding Geocluster and Leaflet GeoJSON into the stack alters the data flow diagram from the previous post slightly.
In reference to the first block diagram, we add the yellow boxes. Here the original array of geodata points is pre-clustered by Geocluster, and fed into Views GeoJSON. This results in our JSON collection being comprised of just clusters and sparse points.
This GeoJSON feed is then passed through the bounding filter and Leaflet produces the Geocluster-driven visualization. We then let Leaflet produce the interaction with the user. In essence, all we’ve done is found an efficient way to cluster data before Leaflet sees it; The dataset is much smaller and simpler for Leaflet to process. Again, this is where the primary performance gain is realized: offloading clustering to the server.
Once the user is presented with the map, clicking on a cluster zooms in on the map by one step, and the query is re-run against the database, passed through the filters, and all points are re-clustered without a page reload via an AJAX callback. Clicking around on the production map demonstrates this user experience.
It should be pointed out that there is no point data inside any clusters or points. A cluster merely contains a count count indicating the number of points in the cluster, while a single point is a rudimentary map feature only containing the UID of the associated user (any entity ID could have been used, but in our case, we were mapping users). This is another performance strategy in that we are loading much less data per feature; the trade off is that we’ll need to build the popup data in a later step (more on this in a bit). This ended up being very effective in getting the initial view of the map to render quickly.
The Client Build: Patch Bingo
Again, the client build can be seen at Volunteers in Service to America site, and the code has been released under the GPL and is available on github. The Drush Make specification for this client build including all the patches, libraries, and modules can be found in the map feature repository in vista_map.make.
The Leaflet-Geocluster stack was not a perfect implementation for our use case; we wrote a decent number of patches and rolled a new branch of an existing module. This work was mostly generalizing the stack for a wider set of use cases with less assumptions. Some of the more generally useful contributions are covered here.
First, the data for the production deployment were sourced from a migration where the geodata was not available. We had to do a fresh geocode of all the addresses. In order to successfully geocode via the web service of choice, we implemented a patch to the geocode-backfill drush command that allowed us to set a limit and called the command from within a cron job to incrementally geocode daily. Once the entire dataset was geocoded, we turned off this cron job, and let profile2 saves geocode on demand (entity create or update). See the issue queue for more on geocode-backfill.
Once the points were geocoded at the field level, Geocluster needed to hook into the entity save process to generate geohashes of varying precision for the location in question. In order support the extra geohash columns in geofield storage, a core patch was used to allow hook_field_schema_alter to be implemented. A purist might say that the geocluster module should just store it’s data in it’s own tables, but then we’d be talking about table joins between the geofield data and the geocluster data when determining the precision column to pull the geohash from when clustering. Our primary concern in this build is performance, so storing the data natively in geofield’s tables made sense.
Next, we needed a few adjustments to the Geohash algorithm in Geocluster to utilize Views as its query backend. The GROUP_CONCAT operator was added along with providing hook_views_post_execute_query so other modules could modify the view immediately after the query, i.e. abstractly “cluster” results. Then, we needed a little rework with Geocluster to interact with the Entity API. We had some issues with unreliability with hook_field_attach_presave when geohashing took place (geohash columns came up empty). Moving the routine to hook_entity_presave fixed the issue.
Having the ability to cluster data points at the query level, the next step was to filter out points/clusters that were outside of the viewable area of any given view of the map, a.k.a the bounding box. The primary complexity here was that our location data was coming in over a views relationship, and the stable code in Views GeoJSON wrongly assumed that the geodata would be available in the base table of the view. A patch was written to take relationships into account when loading the location data.
Finally having the data processing fixes in place, we just needed Panels support for map placement in our responsive theme. Leaflet itself was already a responsive implementation, so this was just really about mimicking the BEAN map output as a panel pane, we ended up rolling out a new (2.x) branch of Leaflet GeoJSON for this purpose, as well as allowing multiple data layers in one map.
There were a slew of other contributed patches we used in the build, plus some other libraries for UX polish. Refer to the makefile for more information on the particulars.
Primary Performance Gain: Query-Level Clustering
The primary performance gain that is provided by Geocluster is query-level clustering (as we saw in the benchmark plot in the previous post). In a nutshell, Geocluster adds a hierarchical spatial index to geofields based on the geohash algorithm (precision/length-based hashes are stored in separate columns). These geohashes–effectively the clustering metadata–are created when location entities are created or updated. When a map display is rendered, a query for points and clusters is a simple query of the spatial index (via AJAX).
THE POINT HERE IS THAT THE PERFORMANCE GAIN IS DUE TO THE FACT THAT THE CLUSTERING TASK IS SIMPLIFIED FROM PHP UPWARDS THROUGH THE STACK SINCE THE DATABASE QUERY ITSELF PRODUCED CLUSTERS IN THE FIRST PLACE.
This amounts to amortizing the clustered rendering over two different workflows: the query and the rendering. The end user is only subject to the query time if the display feed isn’t cached. In the case that the display feed is a cache hit, the display renders almost instantaneously.
Having the basic framework in place for server-side clustering. We next needed to satisfy some decently-challenging application requirements.
When Geocluster performs it’s query-level clustering, if two or more points are at the exact same geographic location, i.e. identical latitude and longitude, the points will automatically cluster together. When we get to full zoom, the cluster doesn’t explode; we effectively have an infinitely-small cluster. In our application, we had organizations that had multiple users at identical points, so we had to find a way to discern between near-field clusters, and what we refer to as “monolithic” clusters.
The algorithm to determine this is quite simple. We take the (arbitrary) first point in the cluster, and use that as a reference latitude-longitude pair. We then iterate over the rest of the points in the cluster; as soon as we find a point whose latitude-longitude is NOT identical to the reference point, we know we have a non-monolithic cluster. More often than not clusters are non-monolithic, and the iteration breaks out quickly. This negative-logic detection has the effect of not impacting performance in an appreciable way. Further, we knew beforehand that the largest monolithic cluster was known to be on the order of hundreds, so even if we had to iterate an entire cluster, we’re not consuming that much time in the context of page load or map refresh.
Relevant code snippets:
vista_map.module (monolith detection), lines 97-153
vista_map.js (use of the monolithic flag), line 127
ON-DEMAND PIN POPUPS
Recall that in the traditional map receive-decode-render cycle, popup data is generally built per point at page-load time. Again seeking to amortize computational load across multiple workflows, we delayed the building of popup info until it was actually needed, i.e. literally on-demand. The non-cluster point data actually only contained a UID; we utilized this UID to query a View using the UID as a contextual filter.
This was a convenient implementation in that the View output was easily cacheable. Once the caches warmed up, popup rendering was nearly instantaneous. The fully-built map View export can be found in vista_map.views_default.inc.
Relevant code snippets:
vista_map.js (popup construction), lines 324-404
One of the purposes of the map was to encourage users to make local connections. This was manifested in the requirements as a zoom-and-center behavior on the currently logged-in user’s mapped location. With the API provided by the Leaflet integration module, dynamically setting center and zoom was relatively simple.
Relevant code snippet:
vista_map.module (center and zoom reset), lines 290-351
This was implemented as an alter hook into a pane render, which was dependent on the 2.x branch of Leaflet GeoJSON that was contributed as a result of this work.
PROGRESSIVE ENHANCEMENT (GEOCLUSTER)
One of the drawbacks of utilizing geocluster unilaterally on any map–even when the overall scale of the map is quite large–there are display scenarios where the order of magnitude of the displayed points is well below the empirical threshold for server-side clustering–hundreds of points and down. In these displays, a map refresh contains a small-enough number of points that client side clustering will suffice, and invoking the query-level clustering in Geocluster is simply algorithmic bloat.
One idea to mitigate this waste of computation resources is to “progressively enhance” as we zoom in with client side clustering when a certain threshold is reached.
Strictly speaking, this is more of a binary decision on which cluster implementation to use; utilizing both clustering methods on any given feed request doesn’t make much sense (or possibly prohibitively complex). Thus “progressive” enhancement might be a misnomer. In any case, it is easy to envision each end of the spectrum: server-side clustering at low zoom and client-side at high zoom (this assumes we are clearly on one or the other side of the clustering threshold at each end of the spectrum).
BOUNDING BOX QUANTIZATION (LEAFLET GEOJSON)
Make data feeds cacheable by quantizing the viewable bounding box. Data feed URLs take the form:
The zoom argument is an integer, but the bbox arguments are floating point (degrees), derived from the current map dimensions in the viewport. This means it takes a long time for caches to warm up since screen sizes infinitely vary based on user configuration–and thus the zoom arguments could vary effectively without bound. Quantizing these arguments would take the end-user influence out of the cacheability of the data feeds.
We had the opportunity to learn the Views API in a very detailed way through troubleshooting our pain points. In hindsight, it might have made sense to use the Entity Field Query API to construct the queries required for this application: Views itself required some patching, and the overhead incurred by implementing the queries with a GUI-configurable query builder likely could have been mitigated with queries built in code. However, we were able to meet the benchmarks we set out to achieve, so this optimization was never considered. This also allowed the solution to be more accessible to sitebuilders; a decent win for the Drupal community in general.
All of the patches that were written against the Leaflet-Geocluster stack were contributed back to the community. Most notably, the 2.x version of Leaflet GeoJSON implemented Panels support and added multiple data layer support for map displays. All the work towards this application was funded by Volunteers in Service to America.
Resources & References
The production deployment of the Vista Map.
Vista-Map GPL repository containing a Drupal 7 feature representing the build outlined in this post.
Geohash Algorithm article on wikipedia contains a great overview of the Geohash algorithm.