Creating a Taxonomy to Drive Your AI

Creating a Taxonomy to Drive Your AI

By Josh Bowers and Denise Parris Co-Founders

Creating a Taxonomy to Drive Your AI

AI driven personalization, regardless of the statistical method used, measures the relevance of a label to a user. How and how well these labels are defined is what then determines the quality and sophistication of your personalization efforts. Creating your label structure and defining each label entails defining a taxonomy where the topics, themes, and ideas represented in your products and content mix are defined.

At Pavilion our team has spent over 10 years advancing research methods for taxonomy creation and the systematic evaluation of content for use in user engagement, two skills which form the foundation to enable meaningful recommendations. Building a taxonomy is the starting point for those organizations looking to leverage personalization as a point of competitive advantage.

To gain a deeper understanding of the topics, themes, and ideas associated with your products, services, or content requires categorizing information based on traits, characteristics, sentiments, and other associated information. This requires training in using qualitative data—i.e. descriptive and conceptual information. Qualitative research is the scientific method for understanding the meanings of things, contexts, situations, and actions.

However, it is exceedingly rare for someone to be trained in both qualitative and quantitative methods. It’s not by accident that most recommendation engines are built on quantitative data—information that can be counted, measured, and expressed in numbers, such as: views, counts, product category (i.e., 1 = short sleeve vs. 2 = long sleeve or 3 = insulated vs. 4 = non-insulated). To become a data scientist or a developer your knowledge base to analyze data will be applying quantitative methods. The only training most college graduates have in understanding the meaning of text is writing a book report or literature review on a specific topic. Asked to conduct qualitative research a data scientist will count the number of times a word was stated which is not a valid method for analyzing content and driving meaning.

Understanding deeper level associations requires human intelligence and training in both qualitative and quantitative data methods. It is the fusing of qualitative and quantitative methods that power AI to make meaningful recommendations.

Taxonomies are where these methods meet. A taxonomy is used as asemantic layer enabling the structuring of your organizations content, (i.e. assigning content to labels) and even potentially being used in the evaluation of content from third party sources.

Wait… What is a sematic layer? Go ahead and google it. Wikipedia defines it as “a business representation of corporate data that helps end users access data autonomously using common business terms.”

In the terms of AI this not the numbers, math models, or indexing codes that enable the computer to crunch the data instead it is the words and phrases that define the topics, themes, and ideas from a user’s perspective. The semantic layer is a representation (i.e., abstraction) of the meaning of the data.

Thus, the first step in building a taxonomy is identifying core topics, themes, and ideas represented in your digital assets, products, and services, as well as, third party data that your customers would also find interesting and valuable. This requires reading and analyzing words. To help you and your team begin the process of building a taxonomy, as well as, provide insight into our methodological processes at Pavilion, we provide a high level break down on how to do content analysis.

Content analysis starts with an inductive process. The initial coding—labeling of preliminary categories is open ended.  In the second phase of coding it begins to become apparent that specific categories are more common and are related to other categories. Thus, the research moves from specific instances (i.e., units of analysis) into identifying multiple instances where the researcher finds common topics, themes, and ideas that when combined form a defined label that represent the multiple instances. For this process the researcher reviews all the content and start to generalize your initial topics, themes, and ideas that enables you to begin making logical connections between them.

The Pavilion team can do all the work to build your taxonomy. However, we have learned that it critical your team gets hands on in the process to understand the methodology and see their knowledge and expertise infused into the AI recommendation engine. This builds trust and transparency into the method as well as ownership in the success of AI solution. Next, we present how to build a taxonomy.

Step 1: The Review

If you have previously defined user personas those can be a great place to start. Pull out your user personas and identify who those people are and list what they care about. Bullet point format. This is a great place to compile and identify your initial topics, themes, and ideas (i.e., higher order concepts).

Just like when you do book reviews the professor will ask: What are the main themes of the book? These main themes are your higher order concepts.

However, when you read you realize that within each major theme are subthemes, which in qualitative research form your second order concepts. The process of determining themes and subthemes in your content is called “coding” by qualitative researchers.

As you code and identify topics, themes, and ideas often a tree will form with different levels of codes (first level, second level, and third level). These data trees start to illustrate how different concepts are associated. Your higher-level concepts should be evident in your user personas that your team already created. Often, researchers start with topics, themes, and ideas in mind and explore the data to see if they are correct. While some topics, themes, and ideas emerge from the data (i.e., you find and identify as you read). We recommend with starting you 3-5 high level topics, themes, and ideas based on your user personas.

Next gather all the content which is available to you.

If you want to go old school. Print as much as you can and buy yourself a big pack of highlighters with as many different colors as possible and start highlighting! Assign colors to concepts and ideas, and highlight every passage representing that idea. If you want to geek out google search “Open Axial Coding” for the scholarly research method which you are employing.

At Pavilion we find value working with clients to go old school too as physically coding the data can make the themes, topics, and ideas tangible.  There are also some very useful tools we use like NVivo or Dedoose depending on the project and size of the research team.

As you code you will discover new concepts and ideas which were not on your original list. Add it to your listing and incorporate it into your coding moving forward. If you find you are 2/3rds are more through your review material and are still finding new concepts frequently you will need to revisit the major themes you captured at the beginning of the process to evaluate if they are still accurate and whether you need to make changes. If you do make changes you will need to re-review the content to apply the changes you made. Given the time and effort required to organize your coding we highly recommend you work with a Pavilion researcher who will analyze and validate the themes before beginning the analysis and can take the lead on performing their verification.

Engage a broad circle of stakeholders to review the created concept listing.  Stop by peoples’ desk with your listing of the concepts and ask “Hey—what else do our customers talk about” from inside your organization, and to the furthest extent as possible engage your customers in the coding process from forming the initial codes to verification.

Step 2: The Cull

After step one you should have a long list of topics, themes, and ideas. The next question is to identify which are variations of the same idea, and which have a parent child relationship, or to put another way which concepts cannot exist without another existing first?

Next, which of the identified themes can be purely defined by their vocabulary (i.e. a set of familiar words). Computers are terrible at inferring from context. So, if you have defined a concept which cannot be identified on the words or phrases used to describe it, a computer will be unable to identify it with the consistency you require.

Step 3: The Definition

For taxonomists their goal is to make this definition a pure concept. For your purposes you need to determine the following:

  • What label best describes your concept?
  • What is the specific listing of vocabulary which can be used to identify this concept?

For example, if our concept was outdoor recreation, we would specify vocabulary like: Hiking, Waterfalls, Backpacking, Trail Running, Nature Photography, Wilderness, Camp, Camping etc.

You want to be sure an incorporate plural, future, and past tense versions of the word and other variants for example: Hikes, Hiking, hike, hiked. 

  • Ensure the language used to define concepts is unique. Once these labels begin are used for AI driven personalization sloppy definition is what creates poor recommendations.

Once you feel comfortable you have completed your definition work. Test them by filtering the content you compiled in Step 1 with you concept definitions. This can be done by hand, but we recommend applying “automagic”. You want to verify your content is attributed to the same concepts which you identified when you conducted the review by hand. Working with Pavilion we automatically test the definitions against your data to measure if the correct labels are being applied, as well as, measure the extent of cross labeling which occurs.  

A question you will have to ask yourself is if your concept names will ever be used in a navigation or customer facing capacity. If so, it is worth your effort to use consistent formatting, and good labels so that you do not have to come back and redo the labeling in the future. You would be surprised by how much work this creates.

Step 4: Defining Assignment Criteria

Assignment criteria is how you determine when a label is applied and is just as important as defining the label. Natural text is complex—it is common for the same or similar vocabulary to describe two very different concepts. As you are going to present users with recommendations tied to your taxonomy make every effort to ensure the correct association of labels.  

How rigorous of a bar for assignment you set is entirely dependent on your context and the data sources. For example, if you are only working with content which your organization produces you probably do not need to set a very high bar for label assignment. However, if you want to incorporate third-party content from the public web in your email campaigns you will need to set a very high bar. Regardless of the data source—we strongly recommend avoiding straight keyword assignment. Even when working with your organization’s own content it will result in sloppy labeling. . 

  • Density of concepts:
    • Most useful when assigning content to major themes or subthemes. In this method we measure the frequency of occurrence of subthemes within a content piece. In order for label assignment content must possess greater or equal to the specified criteria.  
  • Triangulation of concept identifiers:
    • Most useful when sourcing content from high-volume-low-quality content sources like the public web. This method requires concept vocabulary in multiple locations such as the title and body. This can be further enhanced by additionally through applying a concept density requirement and running images through a computer vision service to verify the meta data for the image is consistent with the concept assignment.
  • Concept associations (i.e., must have concept x and concept y):
    • This method works best when working in highly interconnected environments and where the vocabulary used for any one concept may not be particularly unique. When using this method, we require that the content possess multiple concepts which have been defined as related. Label assignment is determined by which concept has the greatest density.

Yes, developing a taxonomy structure to drive your AI powered personalization efforts is a lot of work. That’s why product recommendations are driven off the catalog meta data, and content recommendations from the library meta-data, and why inferring recommendation between content and products is nearly unsupported. Putting in the work to define a taxonomy structure specifically for personalization is the only way to create a personalization experience which takes a holistic view of the user. There is no amount of data wizardry which can match the benefits of knowing and defining the why and what behind a user’s interaction.

At Pavilion our research team are specialists in the development of taxonomies for user engagement and systematic review methods for driving concept definitions. Schedule a consultation and let’s discuss your project.