Development icon

Hacking Your Migration with Inheritance and Drush

Adam Ross, Software Architect
#Content Management | Posted

At Phase2 we love the Migrate module. It provides so much out of the box to facilitate moving content into a Drupal site that someone says “eureka!” almost every week in our developer chatroom. Varied data sources, highwater marks, instrumentation, and the many advantages of working with a well-architected, object-oriented system are just a few of its great features. The full range of functionality is not needed on every migration–but on your next project that one unused feature can easily become the lynchpin of success.

On a recent migration project we found a number of problems that needed improvement near the end of the development schedule, and were able to make a number of surprisingly deep changes in short order. In this post I’ll share how we were able to move so quickly by discussing the code techniques we used and some of the specific changes we made.

This post is very technical and assumes you have some basic familiarity with implementing migrations using the Migrate module. While there is a lot of code below, this is about technical approach rather than reusable solutions. If you’d like to do some background reading first, check out Rich Tolocka’s blog on planning your content migration, or Mike Le Du’s overview of object-oriented programming in Drupal 8.

Using Inheritance to Fix Performance

Because the Migrate module uses a nicely object-oriented plugin system, it’s easy to create a custom version of any piece of functionality without needing to duplicate code. As with any custom-built migration, it starts with your base migration class, the controller of the entire process. Let’s get started with an example—migrating Lightning into our new Drupal site.

image of light bulb in black and white

The Lightning source system is a Web API that provides two primary access mechanisms. The first resource returns a list of items from an offset (starting item) to a maximum number of results (limit). For our convenience, it also includes some “pager data” describing the total number of items that match our criteria, even if they are not all available in the immediate API response. For each item in the list, we have a unique ID which can be used to craft the second type of API request, this one to provide all the data available for the item with that ID.

These API calls are used in the following way to list all items:

  • Make a request for the “next” item to migrate with an query of items ordered by creation date. (Creation date is something we can safely treat as “immutable” and thus not going to change between requests.)
  • Request the details of this item so we can process the data for import.
  • Repeat this process until we’ve imported the total number of items as reported in the “pager data” mentioned above.

Now that we understand how to use the data source, it’s time to start putting together the code to manage the migration. All migrations start by inheriting the functionality of the Migration class.

class LightningMigration extends Migration {

  function __construct() {

    parent::__construct();

 

    $this->map = new MigrateSQLMap(...);

 

    // Continues to define the source plugins and field mappings.

  }

}

 

Within its constructor, you can select non-standard plugins for the “map” (the class that tracks the relationship between your Drupal site and the origin system) and the “source” (the class that pulls data from the origin system). The ability to customize behaviors of your migration by surgically replacing code from the core Migrate module allows you to quickly make significant changes.

We need to create our own custom Source plugin to handle the collection of JSON data from the API. Our LightningMigrateSourceHttp class is set up to understand how to traverse the API to list items, and will dispatch the identifiers it extracts for import to an instance of a MigrateItem source handler. We will specifically use MigrateItemJSON because we need its capability of making API calls and processing JSON results.

While our main source plugin could be directly coded to use MigrateItemJSON, we’ll use dependency injection to pass in the class as a parameter. This keeps LightningMigration in control of the details of migration execution, and if we decide to swap out our MigrateItem instance for something more customized it will be a single-line change. Centralization of custom migration logic is really key to facilitate onboarding new developers, as tracing key decisions across an entire directory of classes takes much longer.

$this->source = new LightningMigrateSourceHttp($listUrl, new MigrateItemJSON($itemUrl, $httpOptions), array('id' => 'Identifier'), array('httpOptions' => $httpOptions));

 

This worked quite well for importing all items initially but we encountered performance problems when looking to see whether an item needed to be updated. In order to identify the next item for import from the Lightning API, we need to ask for the next identifier matching our criteria before we can request all the items details to be processed. Crawling all the content of the “Lightning API” looking for items to update is not a very efficient or fast process because it introduces a lot of Internet latency (time between a request to a remote server and the response) into the process. Luckily, we already have a local source for all the nodes we might want to update: the migrate map table.

The Migrate module tracks all imported items via a “map” table automatically generated for each migration. This table has columns for the source identifier, destination identifier (e.g., nid), the current status of the item (does it need to be updated from source?), and a few other processing details. We are interested in the list of source identifiers that this table maintains, allowing us to replace web requests for local database queries.

In LightningMigrateSourceHttp we have implemented getNextRow()which the abstract MigrateSource class uses to identify the next item for import. This is the method that issues the listing API call we wish to replace. Let’s create LightningMigrateSourceHttpFromMap, a new source class that overrides the getNextRow() method with our new logic. Let’s swap it into our LightningMigration constructor:

$this->source = new LightningMigrateSourceHttpFromMap($listUrl, new MigrateItemJSON($itemUrl, $httpOptions), array('id' => 'Identifier'), array('httpOptions' => $httpOptions));

 

The LightningMigrateSourceHttpFromMap class behaves exactly as its parent class, except it has dropped half of its web queries and saves 1-5 seconds per item by asking Drupal’s database how to find the next piece of content to import. Our new getNextRow() logic calls the following function to identify the next item from the map table:

  /**

   * Retrieve the next row marked as needing update.

   *

   * @param int $offset

   *  Specify a number of rows to skip. Used to account for errors.

   * @return stdClass

   *  Map row objects with needs_update==1.

   *

   * @see MigrateSQLMap::getRowsNeedingUpdate().

   */

  protected function nextRowNeedingUpdate($offset = 0) {

    $map = Migration::currentMigration()->getMap();

 

    $rows = array();

    $result = $map->getConnection()->select($map->getMapTable(), 'map')

      ->fields('map')

      ->condition('needs_update', MigrateMap::STATUS_NEEDS_UPDATE)

      ->range($offset, 1)

      ->execute();

 

    foreach ($result as $row) {

      $rows[] = $row;

    }

 

    return empty($rows) ? array() : reset($rows);

}

 

If the return value is empty, we increment an offset counter in getNextRow() and try again, this allows us to skip broken entries until we find a usable row.

We also needed to override how we extracted the source IDs for import, these are both custom methods but are another demonstration of clean inheritance. First the API-driven data structure:

/**

* Extracts the ID from a the first item in a list query.

*

* @param $data

*  Object containing next item data.

*

* @return array

*/

protected function getIDsFromJSON($data) {

  $ids = array();

  if (!empty($data->records) {

    $ids[] = $data->record[0]->id;

  }

 

  return $ids;

}

Now we replace it with a method that uses the SQL result object instead of the JSON response object.

/**

* Overrides LightningMigrateSourceHttp::getIDsFromJSON().

*/

protected function getIDsFromJSON($data) {

  return array($data->sourceid1);

}

A secondary impact of drawing our migrate IDs from the map table is the limitation that our code will only perform updates to already imported content. If we simply made this change as a direct replacement we would never be able to import new items from the source again. There may be a use case for that somewhere, but for our purposes we need both efficient updates and new item imports. It’s time to introduce some new run-time options to how we migrate the Lightning.

Introducing Optional Flags with Drush

At Phase2 we use Drush to run our migrations, it’s a great way to sidestep various memory limit and automation problems you might have using the administrative UI. Migrate has fantastic drush integration, but like any Drush command it has a specific list of options and flags it understands how to handle. That will not stop us.

We could  run the drush migration command with –strict=0 to allow us to use any flags we invent without complaint from Drush, but that is a bad practice: it creates invisible, unintuitive options that in many cases will not be learned by future developers on the project. Drush allows you to go a few steps further to add “official” flags to any command in the system. Let’s add an option to the migrate-import command, which is used to trigger migrations.

/**

* Implements hook_drush_help_alter().

*/

function lightning_drush_help_alter(&$command) {

  if ($command['command'] == 'migrate-import') {

    $command['options']['updates-only'] = '[Lightning] Restrict migration to previously imported content.';

  }

}

 

Now that we have a way to introduce options to the system, we can go ahead and vary the migration. A simple reading of the code above shows the –updates-only flag, let’s go ahead and support that with a quick code hack to our LightningMigration class:

 

if (function_exists('drush_get_option') && drush_get_option('updates-only')) {

  $this->source = new LightningMigrateSourceHttpFromMap($listUrl, new MigrateItemJSON($itemUrl, $httpOptions), array('id' => 'Identifier'), array('httpOptions' => $httpOptions));

}

else {

  $this->source = new LightningMigrateSourceHttp($listUrl, new MigrateItemJSON($itemUrl, $httpOptions), array('id' => 'Identifier'), array('httpOptions' => $httpOptions));

}

Now instead of just one source plugin, we have a selection of two source plugins depending on whether the new Drush option is in use. Be careful to always check for the availability of Drush code before using any of its functions, this caution keeps the migration code compatible with the Migrate UI.

Executing Our Migration

Now we have two different migrations in one: the first operates using our performance enhancement and will only be able to update content, and the other uses the normal behavior. For a complete process of importing new content and updating any previously imported content, we need to run two different commands.

$> drush -u 1 migrate-import LightningJSON --update --updates-only

$> drush -u 1 migrate-import LightningJSON

 

The first command marks all previously imported content as needing an update, and has the flag telling the migration to use our new database-driven logic. The second is a normal migration run, which will focus exclusively on importing new content. The latter can be considered “exclusively new”  because we assume all content needing an update will complete during the first command.

This partitioning of responsibility is also good for resource limits like memory usage since we now use two different PHP processes to handle the operation. Managing a large migration requires careful management of system requirements.

Ongoing Synchronization

Unfortunately the import of new content still has a problem. Part of our use case is the ability to periodically pull down Lightning to make sure the Drupal system remains in sync with the canonical data. This means our migration needs to crawl the entire Lightning API looking for new items to import. We just went to some effort to avoid such a broad-spectrum effort, so let’s make some more changes to complete this work.

Since this is an article about migration hacks and not perfect migration designs, we won’t talk about using highwater marks, even though that is the correct approach. In fact, I look forward to retrofitting highwater into the Lightning migration in the future. Instead, let’s consider something more simple:

  • Add a –recent-changes flag to Drush.

  • If this option is present, only list content created within the last few days by adding a date condition to the API call for listing content.

With these tweaks in place, our migration commands to only capture recent changes is executed a little differently:

$> drush -u 1 migrate-import LightningJSON --update --updates-only

$> drush -u 1 migrate-import LightningJSON --recent-changes

Now we have a lean migration that is only concerned with updating for the last few days of changes. The downfall of this simplified highwater system is any series of days where this process fails will result in losing track of some changes, the actual highwater system records the dates when the process occurs and reaches back over days or weeks as needed. However, since this behavior is only used when manually triggered via –recent-changes we can use the complete migration to fill in any gaps.

Respecting the Source Provider

If your migration process involves a lot of content and many troubleshooting cycles, you should spare a thought for the provider infrastructure. Do you have exclusive use of its database to avoid disrupting other users? In our example we are leveraging API-provided data and respectful behavior means not pummelling the API with thousands of redundant requests.

Many APIs have a request throttling or rate limit mechanism to keep server resources available to all users. For use cases like migration where many API requests are needed it’s not only respect but necessity that will force us to take measures to avoid excessive API usage. Local caching is a great way to avoid extra HTTP requests and be a good Internet citizen, let’s change our code to first see if we have already cached the data before making a new API request.

You saw in the code above we passed MigrateItemJSON as a source plugin for extracting content from individual JSON records by requesting the item’s URL.

new MigrateItemJSON($itemUrl, $httpOptions)

Let’s swap that out for our own class where we can tailor the caching.

new LightningMigrateItemJSON($itemUrl, $httpOptions)

Here’s what that class might look like in practice. This works because the Lightning data updates at most once per day, and we can always clear the cache if we need a clean sweep.

 

class LightningMigrateItemJSON extends MigrateItemJSON {

 

  /**

   * @var string

   */

  const CACHE_BIN = 'cache_lightning';

 

  /**

    * Overrides MigrateItemJSON::loadJSONUrl().

    */

  protected function loadJSONUrl($item_url) {

    $cid = $item_url;

    if (cache_get($cid, LightningMigrateItemJSON::CACHE_BIN)) {

      if (isset($cache->data)) {

        watchdog('lightning', 'Loaded data from cache: !url', array('!url' => $item_url), WATCHDOG_DEBUG);

        $data = $cache->data;

      }

    }

    else {

      $data = parent::loadJSONUrl($item_url);

      if (empty($data)) {

        watchdog('lightning', 'Invalid JSON data from: !url', array('!url' => $item_url), WATCHDOG_WARNING);

        return array();

      }

      cache_set($cid, $data, LightningMigrateItemJSON::CACHE_BIN, CACHE_TEMPORARY);

    }

 

    return $data;

  }

}

The simple inheritance of MigrateItemJSON allows us to override a single method to wrap cache handling into the system. Now we can re-run the migration repeatedly while testing the process without producing an excessive number of requests to the source API. This change is focused on adding a caching layer, so we are still calling out to the parent implementation of loadJsonUrl()to make the actual API call. The ability to layer lean slices of functionality is what makes object-oriented reuse fun. (If you want to implement the above code yourself, don’t forget to also create the cache table and wire up hook_flush_caches()!)

This change does not need any options added, though it might be interesting to facilitate a cache bypass for spot-testing the end-to-end process. From a Drush mindset, that might be as simple as adding a check for whether the drush migrate-import option –idlist is in use and skipping the cache for the specified items.

A Taste of Drupal 8

Approaching your migration with an eye for all the powerful tools of object-oriented code is a great way to get a taste of Drupal 8 development. Creating strong object-oriented code architecture is a skill in it’s own right but the Migrate module has been polished for years and is a great place to start.

Very often a clever migration hack is too specific to easily reuse, but if you’ve got a trick please share in the comments below! Your thought process and specific points in the code you use to customize your migration can be reused even if your use case cannot.

Adam Ross

Software Architect