Content migration is a topic with a lot of facets. We’ve already covered some important migration information on our blog:
- Drupal 8 Content Migration: A Guide For Marketers - What content should we migrate, and how do we organize and plan a migration?
- Estimating Drupal 8 Migration Scope - How long will all this take?
- The Top 5 Myths of Content Migration - Mistaken ideas, traps, gotchas, and mismanaged expectations.
- Managing Your Drupal 8 Migration - Key concepts, setting up the tools, and starting with a user migration.
- Drupal 8 Migrations: Taxonomy and Nodes - Migrate the bulk of Drupal content and classifications.
So far, readers of this series will have gotten lots of good process information, and learned how to move a Drupal 6 or 7 site into Drupal 8. This post, though, will cover what you do when your content is in some other data framework. If you haven’t read through the previous installments, I highly recommend you do so. We’ll be building on some of those concepts here.
Content Type Translation
One of the first steps of a Drupal to Drupal migration is setting up the content types in the destination site. But what do you do if you are moving to Drupal from another system? Well, you will need to do a little extra analysis in your discovery phase, but it’s very doable.
Most content management systems have at least some structure that is similar to Drupal’s node types, as well as a tag/classification/category system that is analogous to Drupal’s taxonomy. And it’s almost certain to have some sort of user account. So, the first part of your job is to figure out how all that works.
Is there only one ‘content type’, which is differentiated by some sort of tag (“Blog Post”, “Product Page”, etc.)? Well, then, each of those might be a different content type in Drupal. Are Editors and Writers stored in two different database tables? Well, you probably just discovered two different user roles, and will be putting both user types into Drupal users, but with different roles. Does your source site allow comments? That maps pretty closely to Drupal comments, but make sure that you actually want to migrate them before putting in the work! Drupal 8 Content Migration: A Guide For Marketers, one of the early posts in this series, can help you make that decision.
Most CMS systems will also have a set of meta-data that is pretty similar to Drupal’s:
status and so on. You should give some thought to how you will map those fields across as well. Note that
author is often a reference to users, so you’ll need to consider migration order as well.
If your source data is not in a content management system (or you don’t have access to it), you may have to dig into the database directly. If you have received some or all of your content in the XML, CSV, or other text-type formats, you may just have to open the files and read them to see what you are working with.
In short, your job here will be to distill the non-Drupal conventions of your source site into a set of Drupal-compatible entity types, and then build them.
Migration from CSV
CSV is an acronym for “Comma-Separated Value”, and is a file format often used for transferring data in large quantity. If you get some of your data from a client in a spreadsheet, it’s wise to export it to CSV. This format strips all the MS Office or Google Sheets gobbledygook, and just gives you a straight block of data.
Currently, migrations of CSV files into Drupal use the Migrate Source CSV module. However, this module is being moved into core and deprecated. Check the Bring migrate_source_csv to core issue to see what the status on that is, and adjust this information accordingly.
First, know that CSV isn’t super-well structured, so each entity type will need to be a separate file. If you have a spreadsheet with multiple tabs, you will need to export each separately, as well.
Second, connecting to it is somewhat different than connecting to a Drupal database. Let’s take a look at the data and source configuration from the default example linked above.
[php]id,first_name,last_name,email,country,ip_address,date_of_birth 1,Justin,Dean,firstname.lastname@example.org,Indonesia,18.104.22.168,01/05/1955 2,Joan,Jordan,email@example.com,Thailand,22.214.171.124,10/14/1958 3,William,Ray,firstname.lastname@example.org,Germany,126.96.36.199,08/13/1962[/php]
[php]... source: plugin: csv path: /artifacts/people.csv keys: - id header_row_count: 1 column_names: - id: Identifier - first_name: 'First Name' - last_name: 'Last Name' - email: 'Email Address' - country: Country - ip_address: 'IP Address' - date_of_birth: 'Date of Birth' ...[/php]
Note first that this migration is using
plugin: csv, instead of the
d7_taxonomy_term that we’ve seen previously. This plugin is in the Migrate Source CSV module, and handles reading the data from the CSV file.
[php] path: /artifacts/people.csv[/php]
path config, as you can probably imagine, is the path to the file you’re migrating. In this case, the file is contained within the module itself.
[php] keys: - id[/php]
The keys config is an array of columns that are the unique id of the data.
[php] header_row_count: 1 column_names: - id: Identifier - first_name: 'First Name' - last_name: 'Last Name' ...[/php]
These two configurations interact in an interesting way. If your data has a row of headers at the top, you will need to let Drupal know about it by setting a
header_row_count. When you do that, Drupal will parse the header row into field ids, then move the file to the next line for actual data parsing.
However, if you set the
column_names configuration, Drupal will override the field ids created when it parsed the header row. By passing only select field ids, you can skip fields entirely without having to edit the actual data. It also allows you to specify a human-readable field name for the column of data, which can be handy for your reference, or if you’re using Drupal Migrate’s admin interface.
You really should set at least one of these for each CSV migration.
The process configuration will treat these field ids exactly the same as a Drupal fieldname.
Process and Destination configuration for CSV files are pretty much the same as with a Drupal-to-Drupal import, and they are run with Drush exactly the same.
Migration from XML/RSS
XML’s a common data storage format, that presents data in a tagged format. Many content management systems or databases have an ‘export as xml’ option. One advantage XML has over CSV is that you can put multiple data types into a single file. Of course, if you have lots of data, this advantage could turn into a disadvantage as the file size balloons! Weigh your choice carefully.
The Migrate Plus module has a data parser for XMl, so if you’ve been following along with our series so far, you should already have this capability installed.
Much like CSV, you will have to connect to a file, rather than a database. RSS is a commonly used xml format, so we’ll walk through connecting to an RSS file for our example. I pulled some data from Phase2’s own blog RSS for our use, too.
[php]<?xml version="1.0" encoding="utf-8"?> <rss ... xml:base="https://www.phase2technology.com/ideas/rss.xml"> <channel> <title>Phase2 Ideas</title> <link>https://www.phase2technology.com/ideas/rss.xml</link> <description/> <language>en</language> <item> <title>The Top 5 Myths of Content Migration *plus one bonus fairytale</title> <link>https://www.phase2technology.com/blog/top-5-myths-content</link> <description>The Top 5 Myths of Content Migration ... </description> <pubDate>Wed, 08 Aug 2018 14:23:34 +0000</pubDate> <dc:creator>Bonnie Strong</dc:creator> <guid isPermaLink="false">1304 at https://www.phase2technology.com</guid> </item> </channel> </rss>[/php]
[php]id: example_xml_articles label: 'Import articles' status: true source: plugin: url data_fetcher_plugin: http urls: 'https://www.phase2technology.com/ideas/rss.xml' data_parser_plugin: simple_xml item_selector: /rss/channel/item fields: - name: guid label: GUID selector: guid - name: title label: Title selector: title - name: pub_date label: 'Publication date' selector: pubDate - name: link label: 'Origin link' selector: link - name: summary label: Summary selector: description ids: guid: type: string destination: plugin: 'entity:node' process: title: plugin: get source: title field_remote_url: link body: summary created: plugin: format_date from_format: 'D, d M Y H:i:s O' to_format: 'U' source: pub_date status: plugin: default_value default_value: 1 type: plugin: default_value default_value: article[/php]
The key bits here are in the source configuration.
[php]source: plugin: url data_fetcher_plugin: http urls: 'https://www.phase2technology.com/ideas/rss.xml' data_parser_plugin: simple_xml item_selector: /rss/channel/item[/php]
Much like CSV’s use of the
csv plugin to read a file, XML is not using the
d7_taxonomy_term plugin to read the data. Instead, it’s pulling in a url and reading the data it finds there. The
data_fetcher_plugin takes one of two different possible values, either
file. HTTP is for a remote source, like an RSS feed, while File is for a local file. The
urls config should be pretty obvious.
data_parser_plugin specifies what php library to use to read and interpret the data. Possible parsers here include JSON, SOAP, XML and SimpleXML. SimpleXML’s a great library, so we’re using that here.
item_selector defines where in the XML the items we’re importing can be found. If you look at our data example above, you’ll see that the actual nodes are in rss -> channel -> item. Each node would be an item.
[php] fields: ... - name: pub_date label: 'Publication date' selector: pubDate ...[/php]
Here you see one of the fields from the xml. The label is just a human-readable label for the field, while the selector is the field within the XML item we’re getting.
The name is what we’ll call a pseudo-field. A pseudo-fields acts as a temporary storage for data. When we get to the Process section, the pseudo-fields are treated essentially as though they were fields in a database.
We’ve seen pseudo-fields before, when we were migrating taxonomy fields in Drupal 8 Migrations: Taxonomy and Nodes. We will see why they are important here in a minute, but there’s one more important thing in source.
[php] ids: guid: type: string[/php]
This snippet here sets the guid to be a unique of the article we’re importing. This guarantees us uniqueness and is very important to specify.
Finally, we get to the process section.
[php]process: ... created: plugin: format_date from_format: 'D, d M Y H:i:s O' to_format: 'U' source: pub_date ...[/php]
So, here is where we’re using the pseudo-field we set up before. This takes the value from
pubDate that we stored in the pseudo-field
pub_date, does some formatting to it, and assigns it to the
created field in Drupal. The rest of the fields are done in a similar fashion.
Destination is set up exactly like a Drupal-to-Drupal migration, and the whole thing is run with Drush the exact same way. Since RSS is a feed of real-time content, it would be easy to set up a cron job to run that drush command, add the
--update flag, and have this migration go from one-time content import to being a regular update job that kept your site in sync with the source.
Migration from WordPress
A common migration path is from WordPress to Drupal. Phase2 recently did so with our own site, and we have done it for clients as well. There are several ways to go about it, but our own migration used the WordPress Migrate module.
In your WordPress site, under Tools >> Export, you will find a tool to dump your site data into a customized xml format. You can also use the wp-cli tool to do it from the command line, if you like.
Once you have this file, it becomes your source for all the migrations. Here’s some good news: it’s an XML file, so working with it is very similar to working with RSS. The main difference is in how we specify our source connections.
[php]langcode: en status: true dependencies: enforced: module: - phase2_migrate id: example_wordpress_authors class: null field_plugin_method: null cck_plugin_method: null migration_tags: - example_wordpress - users migration_group: example_wordpress_group label: 'Import authors (users) from WordPress WXL file.' source: plugin: url data_fetcher_plugin: file data_parser_plugin: xml item_selector: '/rss/channel/wp:author' namespaces: wp: 'http://wordpress.org/export/1.2/' excerpt: 'http://wordpress.org/export/1.2/excerpt/' content: 'http://purl.org/rss/1.0/modules/content/' wfw: 'http://wellformedweb.org/CommentAPI/ dc: 'http://purl.org/dc/elements/1.1/' urls: - 'private://example_output.wordpress.2018-01-31.000.xml' fields: - name: author_login label: 'WordPress username' selector: 'wp:author_login' - name: author_email label: 'WordPress email address' selector: 'wp:author_email' - name: author_display_name label: 'WordPress display name (defaults to username)' selector: 'wp:author_display_name' - name: author_first_name label: 'WordPress author first name' selector: 'wp:author_first_name' - name: author_last_name label: 'WordPress author last name' selector: 'wp:author_last_name' ids: author_login: type: string process: name: plugin: get source: author_login mail: plugin: get source: author_email field_display_name plugin: get source: author_display_name field_first_name: plugin: get source: author_first_name field_last_name: plugin: get source: author_last_name status: plugin: default_value default_value: 0 destination: plugin: 'entity:user' migration_dependencies: null[/php]
If you’ve been following along in our series, a lot of this should look familiar.
[php]source: plugin: url data_fetcher_plugin: file data_parser_plugin: xml item_selector: '/rss/channel/wp:author'[/php]
This section works just exactly like the XML RSS example above. Instead of using
http, we are using
file for the
data_fetcher_plugin, so it looks for a local file instead of making an http request. Additionally, due to the difference in the structure of an RSS feed compared to a WordPress WXL file, the
item_selector is different, but it works the same way.
[php] namespaces: wp: 'http://wordpress.org/export/1.2/' excerpt: 'http://wordpress.org/export/1.2/excerpt/' content: 'http://purl.org/rss/1.0/modules/content/' wfw: 'http://wellformedweb.org/CommentAPI/' dc: 'http://purl.org/dc/elements/1.1/'[/php]
These namespace designations allow Drupal’s xml parser to understand the particular brand and format of the Wordpress export.
[php] urls: - 'private://example_output.wordpress.2018-01-31.000.xml'[/php]
Finally, this is the path to your export file. Note that it is in the private filespace for Drupal, so you will need to have private file management configured in your Drupal site before you can use it.
[php] fields: - name: author_login label: 'WordPress username' selector: 'wp:author_login'[/php]
We’re also setting up pseudo-fields again, storing the value from
Finally, we get to the process section.
[php]process: name: plugin: get source: author_login[/php]
So, here is where we’re using the pseudo-field we set up before. This takes the value from
wp:author_login that we stored in
author_login and assigns it to the name field in Drupal.
Configuration for the migration of the rest of the entities - categories, tags, posts, and pages - look pretty much the same. The main difference is that the source will change slightly:
[php]source: ... item_selector: '/rss/channel/wp:category'[/php]
[php]source: ... item_selector: '/rss/channel/wp:tag'[/php]
[php]source: ... item_selector: '/rss/channel/item[wp:post_type="post"]'[/php]
And, just like our previous two examples, Wordpress migrations can be run with Drush.
A cautionary tale
As we noted in Managing Your Drupal 8 Migration, it’s possible to write custom Process Plugins. Depending on your data structure, it may be necessary to write a couple to handle values in these fields. On the migration of Phase2’s site recently, after doing a baseline test migration of our content, we discovered a ton of malformed links and media entities. So, we wrote a process plugin that did a bunch of
preg_replace to clean up links, file paths, and code formatting in our body content. This was chained with the default get plugin like so:
[php]process: body/value: - plugin: get source: content - plugin: p2body[/php]
The plugin itself is a pretty custom bit of work, so I’m not including it here. However, a post on custom plugins for migration is in the works, so stay tuned.
Useful Resources and References
- Migrate Source Plugins documentation: https://www.drupal.org/docs/8/api/migrate-api/migrate-source-plugins
- Stop Waiting for Feeds Module (Use Migrate for XML/RSS): https://ohthehugemanatee.org/blog/2017/06/07/stop-waiting-for-feeds-module-how-to-import-remote-feeds-in-drupal-8/
- Migrate Plus module: https://www.drupal.org/project/migrate_plus
- Migrate Source CSV module: https://www.drupal.org/project/migrate_source_csv
If you’ve enjoyed this series so far, we think you might enjoy a live version, too! Please drop by our session proposal for Drupalcon Seattle, Moving Out, Moving In! Migrating Content to Drupal 8 and leave some positive comments.