Recently, I revisited the publishing system we built for Thomson-Reuters' London Olympics coverage, one of the features I reviewed was the taxonomy processing aspects of the content ingestion engine. we built this to take in content feeds from Reuters' wire service content management and routing system. When you are in the weeds of building out a system, it's hard to appreciate the complexities of the systems that you are building. It was illustrative to return to the site months after we launched it and gain a deeper appreciation for the challenges we faced in building out the publishing engines that processed thousands of assets per day throughout the duration of the games.
The application of the taxonomies was a multi-layered process that progressively applied terms to the article nodes in several distinct steps:
- Sports codes (example: "Athletics", or "Basketball") were parsed out of a series of tags in the article XML and matched against Sport records pulled from the third-party Olympic data provider. When the Sport records were imported during development and the database populated with Sports and Events, the standard Olympic codes were included, and it was these that were mapped to.
- In some cases, the codes were mapped instead against a local table of alternative Sport codes used internally by photographers to ensure that these alternative publishing paths would result in equivalent mappings.
- Events also included in the tags within the XML, but not always.
- The slugline was crafted to include sport, event, and match information, although only the match information was parsed out.
- Athlete associations were applied by passing the text elements - title, caption, article body, summaries - through Thomson-Reuters' OpenCalais semantic tagging engine, and pulling 'people' terms from their library of terms. If there were any matches between the person references returned and the Athlete records created from the Olympics data associations with the Athletes, then they were applied.
- Countries were NOT pulled using OpenCalais, although those mappings were available - the concern was that there would be far too many false-positives applied for Great Britain, given that nearly every article contained references to the host country. Instead, if Athlete associations were obtained, we queried the Athlete record for the Country with which they were affiliated, and applied that reference to the article.
Although there were aspects of this process that were worked out as requirements changed and evolved, (in particular, it was discovered relatively late that photographers were using an alternate standard for sports tagging,) the system was ultimately successful because we had mapped out the process well before beginning development. We understood the complexities inherent in Reuters' content model.
It seems elementary that these things would be worked out ahead of time, but requirements evolve, and sometimes you just have to roll with the changes in order to ensure the success of the project. What makes this process successful is a successful content strategy.
Data Informs Process
We had many sessions where we discussed potential ways of mapping data into the system, and there were a number of alternatives that were rejected because there were potentially too many holes in the processes of managing the data. Make sure you get a look at production-level data as soon as possible in the project, and make sure your technical leads have a chance to work through any potential issues with decoding and processing the data. If you can see ahead of time, that there are basic compatibility issues between what should be relatable data points in different third-party data feeds, then there is still time to get alterations made to the data, and, failing that, devising work-arounds, alternative mappings, or transformations using contextual clues in the data.
Additional processing steps can be applied to handle systemic issues - as we did by using OpenCalais to gain athlete association before using athlete to create country associations. Semantic tagging can be used to handle other cases where you know that a key piece of information might be missing from the original article, but an educated guess can be made as to what it is by seeing what subjects and terms are pulled through the parsing. For example, if a set of articles are missing top-line mappings to sections within a larger news site, using OpenCalais or a similar technology can tell you that topically, it produces the strongest associations within particular vocabularies. References to sports teams and athletes would indicate that it should be a sports article, and references to members of Congress would place it within a politics vertical.
Sometimes it's simpler to accept that weaknesses in the data, can be more easily handled by empowering the client with smart tools or smarter business processes. If the problems can be isolated to an easily identifiable subset of content, these particular articles might be routed to a group of editors whose purview would include handling remapping the missing meta data. If you know that there are systemic weaknesses in how taxonomies are applied in general - you know that a certain percentage of articles will as a matter of course, be missing terms. You can work the creation of more sophisticated taxonomy management tools into the budget to allow editors more immediate access to the taxonomy. If your stakeholders decide that the incidence of bad data can be best solved by leaning on their editors and writers to use their internal taxonomies consistently and correctly, they'll start laying the law down as soon as you determine with them that this is the most efficient and promising route to better online content.
Try to Break the Content
Key to all of this is is the conversations you have with the client, where you work through their publishing workflows, sources of data, and the intersections between the two. This needs to go beyond gathering requirements and documenting user stories - you need to try to break the system. Brainstorm about worst-case scenarios. Let the client talk about their worst fears regarding the system. Poke holes in their ideas. Let them challenge you on yours, and be prepared to walk through any implementations you have in mind. You'll be much better prepared for the unexpected if you try to narrow down the possibilities for what that might be.