WordPress Export Manipulation using PHP

Robert Bates, Senior Developer
Posted

In the world of content management systems (CMS), WordPress is one of the mainstays of bloggers, for individuals or organizations, public or private. However, sometimes there’s a requirement to migrate WordPress content into a different CMS that provides more generalized functionality. When this happens, the core WXR export is a lifesaver.

WXR stands for WordPress eXtended RSS, which is exactly what the WXR format is - a giant RSS dump of the site’s content, media, and tagging system. WordPress extends RSS to support all its custom elements via the “wp” namespace which keeps the content RSS compliant while adding what’s needed to replicate the site’s content in its entirety. The WXR format is the quickest way to export and import data between WordPress sites as well as for use as an archiving tool. It also happens to work really well as a source for migrations into alternate systems...

One Site to Rule Them All

I recently worked on a project where a new site was being built in Drupal 7, and a requirement had surfaced to migrate multiple WordPress sites into the single Drupal site, with each post being tagged in Drupal to associate it with the original source site for categorical blogrolls. We chose to leverage the WordPress category system to accomplish our goal.

Great! We now have a set of WXR files and a plan of action - now how do we get the data into Drupal? Here is where the very functional WordPress Migrate module came into play. This module leverages the Migrate module’s new wizard UI introduced in version 2.6, and steps the admin through building a dynamic migration with uploaded file (or source WordPress site, if you have all the credentials), field mappings, and behavior. We can conveniently map the WordPress category tags to the appropriate term reference field on our target Blog content type. But there’s the catch - the exported data has multiple legacy categories, where we need one new one, per WXR dump…

PHP DOM FTW!

Let’s see… XML manipulation at the element level. Ease of coding. PHP + DOM to the rescue! Of course, you could use something like SimpleXML and override all kinds of parsing callbacks, but the DOM model and class hierarchy make it the quick and dirty choice for a one-off XML manipulation. First off, we need to define a class that can handle any WXR-specific features that we want to expose, so we derive a new class from DOMDocument:

  1. // Define custom DOMDocument class.
  2. class WXRDocument extends DOMDocument {
  3. public $wp_ns_uri = '';
  4. private $term_id_max = 0;
  5.  
  6. // Override load method so we can extract the wp namespace URI.
  7. public function load($filename, $options = 0) {
  8. $retval = parent::load($filename, $options);
  9. if (FALSE !== $retval) {
  10. // Extract wp namespace URI from document.
  11. $this->wp_ns_uri = $this->lookupNamespaceURI('wp');
  12.  
  13. // Find max value of term ids.
  14. foreach (array('category', 'tag') as $el_name) {
  15. $terms = $this->getElementsByTagNameNS($this->wp_ns_uri, $el_name);
  16. foreach ($terms as $term) {
  17. $term_id = $term->getElementsByTagNameNS($this->wp_ns_uri, 'term_id');
  18. $this->term_id_max = max($this->term_id_max, $term_id->item(0)->textContent);
  19. }
  20. }
  21. }
  22.  
  23. return $retval;
  24. }
  25.  
  26. // Add new method to create new WP categories in the WXR file. Array key is nicename, value is cat_name.
  27. public function addCategories($new_cats = array()) {
  28. $channels = $this->getElementsByTagName('channel');
  29. $channel = $channels->item(0);
  30. foreach ($new_cats as $nicename => $cat_name) {
  31. $new_cat = $this->createElementNS($this->wp_ns_uri, 'wp:category');
  32. $new_cat->appendChild($this->createElementNS($this->wp_ns_uri, 'wp:term_id', ++$this->term_id_max));
  33. $new_cat->appendChild($this->createElementNS($this->wp_ns_uri, 'wp:category_nicename', $nicename));
  34. $new_cat->appendChild($this->createElementNS($this->wp_ns_uri, 'wp:category_parent'));
  35.  
  36. $new_cat_name = $this->createElementNS($this->wp_ns_uri, 'wp:cat_name');
  37. $new_cat_name->appendChild(new DOMCdataSection($cat_name));
  38. $new_cat->appendChild($new_cat_name);
  39. $channel->appendChild($new_cat);
  40. }
  41. }
  42. }

Our primary goal for this class is to load the WXR file, extract the namespace information, and initialize properties related to managing categories at the document level. The method addCategories allows us to arbitrarily add more categories to the document without having to worry about term ID collisions and abstracts it so we can focus on what we need to do in our code, not how to do it. This will allow us to add the new categories for each site at the export level, which is analogous to managing a vocabulary in Drupal.

Next up, we need to be able to manipulate individual posts to remove all old category tags and add our new ones. This is also accomplished by extending a built-in PHP class, DOMElement:

  1. // Define custom DOMElement class.
  2. class WXRElement extends DOMElement {
  3. public function setCategory($nicename, $cat_name) {
  4. // Remove all existing categories.
  5. $cats = $this->getElementsByTagName('category');
  6. foreach ($cats as $cat) {
  7. if ('category' == $cat->getAttribute('domain')) {
  8. $this->removeChild($cat);
  9. }
  10. }
  11.  
  12. // Set desired category.
  13. $new_cat = $this->ownerDocument->createElement('category');
  14. $new_cat->appendChild(new DOMCdataSection($cat_name));
  15. $new_cat->setAttribute('domain', 'category');
  16. $new_cat->setAttribute('nicename', $nicename);
  17. $this->appendChild($new_cat);
  18. }
  19. }

The setCategory method will allow us to provide a single category that we want to replace all existing categories with on a single post. Again, this is a convenience method to simplify our WXR manipulation code and abstract the more complex DOM manipulation away from our core script logic.

Now that we have all our classes in place, let’s import the WXR document and register our element class:

  1. // Set up DOM document for manipulation.
  2. $xml_filename = 'blog1-export.xml';
  3. print "Processing {$xml_filename}\n";
  4. $doc = new WXRDocument();
  5. $doc->registerNodeClass('DOMElement', 'WXRElement');
  6. $doc->load($xml_filename);

OK, the WXR file has now been loaded into an instance of our new class, and the elements are all being instantiated as WXRElement objects. Now that we’ve got the DOM document prepped and ready to go, let’s add our new category to the top level in a couple lines:

  1. // Add new categories.
  2. $new_cats = array(
  3. 'category-blog-1' => "Blog Site 1",
  4. );
  5. $doc->addCategories($new_cats);

We could have added a plethora of categories at this point, but in this case we only needed the one new one. Now that the category has been added, let’s locate our post elements and update them:

  1. // Iterate over WP items looking for posts.
  2. $items = $doc->getElementsByTagName('item');
  3. foreach ($items as $item) {
  4. $post_type = $item->getElementsByTagNameNS($doc->wp_ns_uri, 'post_type');
  5. if ('post' == $post_type->item(0)->textContent) {
  6. // Set new post category.
  7. $nicename = 'category-blog-1';
  8. $category = $new_cats[$nicename];
  9. $item->setCategory($nicename, $category);
  10. }
  11. }

Note that the code supports potential logic to pick a $nicename value based on the post’s content; in this simplified version we only have the one category. The code searches all elements for WXR signatures for post metadata, and then directly updates the category information on the post element. We could have added some iterator logic to the WXRDocument class and type-identification methods to the WXRElement class, but we weren’t going for OO perfection, just basic functionality and speed with convenience.

Now that we’ve added the new category to the document and updated all posts in the export to use the new category, we’re ready to save it out:

  1. // Save out processed XML.
  2. $save_filename = pathinfo($xml_filename, PATHINFO_FILENAME) . '-processed.xml';
  3. print "Saving {$save_filename}\n";
  4. $doc->save($save_filename);

We now have a modified WXR file with the proper category tagging to import into Drupal, and the WordPress Migrate module will have no problem processing it. Minor modifications to update the source filename and target category are all that’s required for subsequent runs. Luckily we only had a few to process so it went quickly after the initial script was written.

Closing Notes

The WXRDocument and WXRElement classes defined above are not provided as the be-all, end-all for WXR DOM manipulation. Instead, they were a proof of concept regarding how the DOM base classes in PHP can be extended to provide customized XML doctype support, and possibly a good starting point for an extensible WXR-friendly solution in the event requirements changed mid-project. Where possible, the methods were coded with reusability in mind and support for multi-value parameters via arrays, with single values being the edge case.

If you’ve got experience with manipulating WXR exports or extending the DOM class library, drop a comment in below!

Robert Bates

Senior Developer