Optimizing Referenced Content in the Database for the Pac-12 Conference

Joshua Turton, Senior Developer
#Drupal | Posted

Database optimization doesn’t always take the form of tweaks to mysql.ini, indexing another column on an overworked database, or query caching. Sometimes it’s more valuable to re-think the way you’re gathering, storing, and accessing your data in the first place. Recent work with our friends and client, Pac-12 Networks, has shown the value of this approach.

Pac-12’s recent redesign and launch of their site includes a number of new content types: EPGs (streaming videos, which play only on specific dates and times), VODs (Video on Demand, always available), Articles, Galleries, and Podcasts. However the most complicated piece of output is their Event nodes.

Events Pages

Events pages are a cornerstone of the new Pac-12.com, and must be versatile because they include and display content from other pages in four panels.

The content that appears is defined using the Reference and Node Reference modules, but not on the Event node. It made more sense for the references to appear on the content nodes (EPG, VOD, Article, Gallery, and Podcast) than on the Event nodes because EPGs and VODs are imported, and come into Drupal with event data already included. By necessity, looking up content related to a given node is more complicated and will not be done automatically on node_load by the node_reference module.

Pac-12 also defined a specific prioritization of what content should appear on event pages based on content type, taxonomy, and freshness of content. The prioritization looks like this:

A) CONTENT TAGGED TO THE EVENT ITSELF IN THE FOLLOWING ORDER:

  • EPG, if one is currently playing.
  • VOD, tagged with Highlight taxonomy term
  • VOD, tagged with Replay taxonomy term
  • VOD, untagged
  • Photo Gallery
  • Article, tagged with Preview or Recap taxonomy term
    • When more than one Pac-12 school is competing in the event, there will possibly be multiple previews / recaps.
  • Article, untagged
  • Podcast

Within any content type if there are multiple items at the same level, they’ll be sequenced in reverse chronological order.

Now, should there not be enough content that is tagged to a given event, Pac-12 also defined a way to identify default, related content to fill out those content spots.

B) CONTENT TAGGED TO ONE OR MORE OF THE PAC-12 TEAMS PARTICIPATING IN THE EVENT IN REVERSE CHRONOLOGICAL ORDER.

  • VOD
  • Photo Gallery
  • Article
  • Podcast

Here’s what a frontal approach to determining the content to display on any given event looks like:

  1. function pac12_event_hero_area_get_featured_content($event_id = NULL) {
  2. if (!empty($event_id)) {
  3. // Find related EPG node.
  4. $segs = db_query("
  5. SELECT ots.virtual_asset_id
  6. FROM {ooyala_track_segment_events} otse
  7. LEFT JOIN {ooyala_track_segment} ots ON ots.segment_id = otse.segment_id
  8. WHERE otse.event_id = :event_id
  9. AND ots.start_time <= NOW()
  10. AND ots.end_time >= NOW()
  11. ORDER BY ots.network_weight ASC
  12. LIMIT 0,1",
  13. array(':event_id' => $event_id));
  14.  
  15. $vaid = $segs->fetchObject();
  16. if (!empty($vaid)) {
  17. $query = new EntityFieldQuery();
  18. $query->entityCondition('entity_type', 'node')
  19. ->entityCondition('bundle', 'epg')
  20. ->propertyCondition('status', 1)
  21. ->fieldCondition('field_video', 'embed_code', $vaid->virtual_asset_id)
  22. ->range(0, 1);
  23. $epg = $query->execute();
  24. if (!empty($epg['node'])) {
  25. $node = array_pop($epg['node']);
  26. $node = node_load($node->nid);
  27. return $node;
  28. }
  29. }
  30. // If there's no related EPG, load related VOD.
  31. $query = new EntityFieldQuery();
  32. $query->entityCondition('entity_type', 'node')
  33. ->entityCondition('bundle', 'vod')
  34. ->propertyCondition('status', 1)
  35. ->fieldCondition('field_event', 'nid', $event_id)
  36. ->propertyOrderBy('changed', 'DESC')
  37. ->range(0, 1);
  38. $vods = $query->execute();
  39. if (!empty($vods['node'])) {
  40. $node = array_pop($vods['node']);
  41. $node = node_load($node->nid);
  42. return $node;
  43. }
  44. return NULL;
  45. }
  46. }

This wasn’t even a complete pass at the problem, as the code displayed here only retrieved EPGs if available, and VODs without consideration of the VOD tagging. Taxonomy association to VODs, plus the addition of Galleries, Articles, and Podcasts with their attendant tagging were not yet included. This doesn’t solve the problem for the additional filler content if the first pass isn’t enough either.

It was already apparent that those queries and their processing were going to be a significant source of server load under real traffic, no matter how much indexing or caching was applied.

LET’S FACE IT – THAT’S GOING TO WIND UP BEING A MASSIVE FUNCTION.

We realized that this function was actually repeatedly using a lot of server resources. It was calculating useful, usable, relatively static data that we weren’t storing. So, why not calculate it once and store it, instead of trying to calculate it many, many times on the fly?

To ameliorate this, we created an additional table called event_priority. This table has four important fields – Event ID (enid), Content ID (fnid), the Content’s “changed” value (changed), and a calculated Priority Score (priority_score).

When a content node is saved with attached events, a score is calculated for that node, then one or more records is saved in event priority table. Here’s how the score is calculated:

  1. function _pac12_event_find_priority($node = NULL) {
  2. $priority = 0;
  3.  
  4. // Setup weighting array.
  5. $weighting = array(
  6. 'vod' => 40,
  7. 'gallery' => 30,
  8. 'article' => 20,
  9. 'podcast' => 10,
  10. PAC12_EVENT_TID_HIGHLIGHT => 2,
  11. PAC12_EVENT_TID_REPLAY => 1,
  12. PAC12_EVENT_TID_PREVIEW => 2,
  13. PAC12_EVENT_TID_RECAP => 1,
  14. );
  15. if (isset($weighting[$node->type]) && $node->status && !empty($node->field_events) && is_array($node->field_events)) {
  16. // Priority by Node Type.
  17. $priority = $weighting[$node->type];
  18. // Priority by Taxonomy Term - additive.
  19. if (!empty($node->field_content_type) && is_array($node->field_content_type)) {
  20. foreach ($node->field_content_type as $tax) {
  21. if (!empty($tax[0]['tid']) && !empty($weighting[$tax[0]['tid']])) {
  22. $priority += $weighting[$tax[0]['tid']];
  23. }
  24. }
  25. }
  26. }
  27. return $priority;
  28. }

Each type is given a value and each taxonomy term an additional value. They are added together to give each content node a score. For example, a VOD tagged “Highlight” would have a priority score of 42; a “Recap” Article would be 21. This is recorded in the database with the attached Event Node ID, the Content ID, and the Content changed timestamp.

A cron job runs a drush command every few minutes that polls the EPGs for ones that should be playing right now, and updates this table to include those EPGs, giving them a priority_score of 100. It also cleans house and removes EPG priority records after their time slot is over.

When the Event page is loaded, this query:

  1. SELECT * FROM {event_priority}
  2. WHERE enid = :event_id
  3. ORDER BY priority_score DESC, changed DESC
  4. LIMIT 0,4;

retrieves all the necessary content in one, small, well-indexed, high-performing query. It’s properly ordered based on the determined business rules, and gives all the information we need to load up those nodes.

In essence, we’ve traded a little bit of storage space on the database server for a lot of intensive work on both the web server and the database server – and that’s almost always a win!

Read more about Pac-12 Networks’ redesign in Will O’Beirne’s blog post “A Quick Look at CSS Pre-processing For The Pac-12 Conference’s New Platform Theme”

Joshua Turton

Senior Developer