6 Reasons to train your Large Language Models (LLM) with structured content
16 Jun 2023
5 mins
At this point there is no question that LLMs will radically change the way enterprises deliver content to employees, customers, partners, and regulators. While it’s clear this change is coming, it’s less clear what this means for how enterprises create and manage their content. A particular question I’m hearing a lot these days is whether enterprises still need to invest in structuring and enriching their content, or whether LLMs can generate satisfactory results from unstructured and un-enriched text? After all, the base models from OpenAI, Anthropic and others do a pretty amazing job considering they are trained on raw text and don’t explicitly consider markup such as HTML or XML tags at all during the initial training process.
My answer to this question is that investing in structuring and enriching your enterprise content turbocharges the results you can deliver via an enterprise LLM. It’s not an either/or question. Structured content enables LLMs to deliver on their promise.
Garbage-in Garbage-Out
LLMs suffer from a GIGO (garbage-in garbage-out) problem. They learn patterns and associations from the data they are trained on. If the training data is flawed, contains biases, inaccuracies, or includes low-quality content, the model will learn and replicate those flaws in its responses. LLMs don't have inherent knowledge or understanding. They rely on the statistical patterns present in the data they were trained on.
Why does this matter in an enterprise knowledge management scenario? Well, the two initial objections to the use of LLMs in the enterprise centered on data protection and accuracy. Let’s look at each of these.
Data protection
It became evident almost immediately that the issue of protecting sensitive enterprise data would get solved quickly. Today there are approaches like Amazon’s Bedrock service and Microsoft’s Azure OpenAI Services that solve this problem in their private clouds. Enterprises can also license pre-trained foundation or base models to run on-premises if desired, or experiment with open source. Data does not have to leave your enterprise perimeter.
Accuracy
For enterprises, it’s the accuracy problem that is more vexing. The tolerance for possible hallucination is much lower in an enterprise knowledge management scenario than in consumer use cases. Explaining to a customer or regulator that the false data your team provided was fabricated by an LLM is not going to fly. This is where GIGO becomes a concern, and where structuring, enriching, and curating enterprise data becomes vital.
First, what does it mean to structure and enrich content? For many enterprises, most of their content assets are locked up in document format: .doc, .pdf, etc. This means the document itself represents a combination of the content, design and output format. However, when you author content using a structured model like DITA you develop granular, reusable content components that can be easily repurposed for different audiences, channels, form factors and publication types. As you are creating these content components, you can also enrich them at the component level by adding valuable metadata.
You may say to yourself, okay, I understand why structuring and enriching content adds value, but it seems like a lot of work, and once my enterprise has access to a private LLM, won’t the LLM just be able to work its magic on my unstructured content? Sort of, yes, but if you care about accuracy, you are going to be disappointed. You don’t want GIGO, you want Quality-In Quality-Out, and that’s where the investment in structuring and enriching pays off, particularly when it comes to the dataset your enterprise will be using to fine tune whichever foundation or base model they will be using.
Training approaches – fine tuning versus hybrid search
At this point, it’s worth highlighting that when someone references “training an LLM on enterprise data” they could be referring to either of two very different approaches.
The first approach is what is known as Fine Tuning. This is the approach most people have in mind when they think about “training an LLM” on enterprise data or for a specific task or domain. In fine-tuning, the enterprise's own data is used to update the weights of the base model using a framework like PyTorch or TensorFlow. The challenge with this approach is that it requires a high degree of in-house expertise and can be expensive enough to make it appropriate for only the largest organizations.
The second approach does not actually involve training a base model at all and is more accurately called Hybrid Search. The approach is similar to how Bing Search has integrated traditional search with LLMs. In the hybrid search model, an enterprise search engine is used to first find relevant enterprise data. The search engine uses some form of cognitive search to identify documents (or components) that contain the user's query terms. This enterprise information is then turned into a prompt that is passed to the LLM, which then generates a conversational response to the user's query.
The speed of innovation in this area makes predictions difficult, but the hybrid search model may end up being more commonly deployed than the fine-tuning approach because it’s simpler and more cost effective for many enterprises.
6 Reasons to use structured content
To return to the topic of this post, the key takeaway is that investing in properly structuring and enriching enterprise data results in substantially improved response accuracy, regardless of whether you pursue a fine-tuning or hybrid search approach.
Here are some ways that having your content in a format like DITA and enriching it with metadata connected to a taxonomy can result in better fine-tuning or hybrid search results and improved LLM performance:
- Granular Information Extraction: DITA divides content into reusable and modular components. Tagging and including metadata at the component level during training enables LLMs to learn and understand these granular units of information. By fine-tuning on structured data, the models develop the ability to extract and process specific components more accurately, resulting in better comprehension and accuracy. This concept also applies to the hybrid search scenario where the enterprise search engine will be able to incorporate vetted component-level content in the prompt passed to the LLM.
- Structure and Context: Markup languages like DITA provide a structured way of representing information. By including markup in the training data, the language model can leverage the hierarchical relationships, semantic meaning, and contextual information encoded in the markup. This helps the model better understand and generate text that adheres to the intended structure and formatting.
- Enhanced Contextual Information: Taxonomies provide a structured way of categorizing and organizing information. By training LLMs on text that is tagged with metadata from a taxonomy, the models learn to understand the relationships between different concepts, categories, or topics. This enriched contextual information helps the models in generating more contextually relevant and accurate responses. Again, the same applies in hybrid search where the contextual information helps the search engine form a more precise and accurate prompt.
- Personalization: Tagging and including metadata at the component level in DITA also supports personalized content delivery by the LLM. By associating metadata with specific components, LLMs can adapt their responses based on user preferences, interests, or contextual information. This enables the models to generate personalized and tailored content that aligns with individual user needs. The same holds true with hybrid search where the personalization is happening at the search engine pre-processing stage.
- Domain-Specific Knowledge: Training LLMs on text with markup that has been approved and vetted by domain experts exposes the models to specialized terminology, document structures, and formatting conventions unique to those domains. Using taxonomy and metadata familiarizes LLMs to the domain-specific categorization and terminology. This leads to more informed, accurate, and precise responses when interacting with content related to that domain.
- Enhanced Document Processing: Structured content allows for the representation of complex document structures, such as tables, lists, footnotes, citations, or multimedia elements. By training on markup-rich data, LLMs can learn to interpret and process these structures more effectively. This leads to improved handling of complex document layouts, better identification of sections or subsections, and more accurate generation of content that adheres to the intended document structure.
This is a field in the hyper-innovation stage with transformational potential for enterprises --- IF the accuracy problem can be solved or mitigated. Using structured, enriched, and validated enterprise content to augment LLMs is by far the most practical approach to solving the accuracy challenge.
Investing in Quality-In will get you better Quality-Out and ultimately help deliver on the promise of LLMs to increase productivity, improve decision-making, and deliver better customer and employee experiences.
If you would like to learn more about how Tridion can help you, click here.