Large Language Models (LLMs) have dramatically transformed the landscape, heralding a significant shift in research focus and reshaping the industry at large. This transformative impact primarily stems from advancements in computational scale rather than methodological innovations. While scaling up has undeniably enhanced model performance, it is crucial to recognize that several challenges remain unaddressed by mere scale, including:

Alignment
Continual Learning
Context Length
Hallucinations
Computational Efficiency

In this post, we explore an additional issue potentially belonging to this list, one that has been overlooked despite its importance: knowledge consolidation.

Knowledge Consolidation

The training corpora of LLMs are immense, containing a myriad of facts, events, topics, concepts, ideas, and various pieces of knowledge. These elements can relate to one another in many different ways, forming a complex web of dependencies.

However, LLMs do not inherently learn all these dependencies.

This limitation stems from their training objectives. LLMs are trained to predict tokens based on the context within a given text sequence, utilizing the knowledge encoded in their training parameters.

As a result, if the content of one training sample does not aid in predicting the content of another, the model receives no signal to establish a relationship between them. Consider the example of two news articles: one mentioning a celebrity's wedding in Miami in six months, and another, dated six months later, reporting a hurricane in Florida.

Since the content of one article doesn't directly help in predicting the content of the other, the model does not learn to relate them. Therefore, during inference, the model wouldn't associate the hurricane with the potential impact on the wedding's logistics.

This leads us to the crux of knowledge consolidation: the model is aware of both events (A and B), but it is unaware of the relationship between them.

To overcome this, the model needs to consolidate its knowledge, learning to connect related but disparate pieces of information.

EpiK-Eval Benchmark

The EpiK-Eval benchmark is key in revealing the knowledge consolidation challenge in LLMs. Our primary objective is to assess LLMs' ability to utilize information distributed across different training samples.

We begin by dividing stories into smaller passages.

Next, we train one model on these unsegmented stories

and another on the segmented versions.

The critical test involves querying the latter model with questions that requires synthesizing information from multiple story segments.

This approach allows us to compare its performance against the prior model trained on unsegmented stories. The comparison highlights the added challenge of processing information dispersed across various training samples as opposed to recalling it from a single source.

Unlike multi-hop question answering, our models rely exclusively on the information encoded in their parameters, without access to the original documents.

For practical reasons1, we use self-generated short stories, each only a few sentences long, and segment them by sentence. We train the models using standard pre-training objectives, such as causal or masked language modeling. Here's an illustrative example: a story about a person visiting a restaurant and ordering items.

In our question-answer format, the model first has to recall the entire story, testing its ability to consolidate and recall the learned segments. This is followed by a reasoning statement and a conclusive answer.

To acclimate models to this format, we incorporate question-answer example pairs in the training dataset.

1. Optimally, our intention is to segment extensive texts, including books; however, we face computational limitations that restrict our ability to do so.

Results

Our findings reveal a significant disparity in performance between models trained on segmented stories and those trained on unsegmented stories. The following plot illustrates this difference: models trained on segmented stories (orange) underperform compared to those trained on unsegmented stories (blue), with the y-axis representing the percentage of correctly answered questions.

Breaking down our analysis into the three components—recall, reasoning, and the final answer—yields the following insights. Firstly, we assess the accuracy of recall alone. Here, a notable performance gap is observed, favoring models trained on unsegmented stories.

Next, focusing on instances where recall is accurate, we evaluate the correctness of reasoning.

The results show similar performance levels between both setups. Although models trained on segmented stories exhibit marginally better reasoning, we attribute this to variance. This is because the subset of correctly recalled answers is smaller for models trained on segmented stories, as shown in the Recall plot. Thus, we do not infer any enhancement in reasoning capabilities from segmented story training.

Finally, we analyze the subset of answers where both recall and reasoning are correct, examining the accuracy of the final answer.

Once again, the performance is comparable between both setups, with the argument about variance remaining valid.

Overall, our results indicate that the primary challenge for models trained on segmented stories lies in accurately recalling and consolidating knowledge, unlike their unsegmented counterparts. Despite both setups showing perfect memorization of training samples—as evidenced by a near-zero hallucination rate2 during training—

a different trend emerges during inference. Models trained on segmented stories exhibit a higher rate of hallucinations in their recall responses.

Interestingly, these models not only struggled with piecing together the story segments but also begin introducing hallucinations within sentences they had previously memorized flawlessly. This phenomenon occurs exclusively when tasked with reconstructing the entire story, not when recalling individual sentences.

2. We define the hallucination rate as the number of recalled sentences that contain an error (does not match with the actual sentence in the narrative) over the total number of recalled sentences.

Implications and Future Directions

A pertinent question arising from our study is whether scaling up models will address the challenge of knowledge consolidation. While we observe general performance improvement with increased scale in both segmented and unsegmented story training setups, a key indicator for scale resolving knowledge consolidation would be a disproportionately higher improvement rate in models trained on segmented stories.

Though possible at larger scales, and certainly warranting further investigation with larger models, the inherent limitations of the training objectives – which do not explicitly encourage learning inter-sample dependencies – leads us to believe that scale alone may not suffice. However, it's worth considering that a sufficiently large and diverse training corpus might inadvertently cover essential dependencies, mitigating this issue in practice. This hypothesis requires deeper exploration.

Another query might be: why not employ retrieval tools to supplement model recall? While feasible, and already implemented in systems like ChatGPT, the fundamental challenge persists.

In the long term, cutting-edge retrieval systems are likely to be AI-driven rather than manually engineered. Whether it's a language model or a retrieval system, the AI needs to comprehend and navigate the intricate web of relationships among diverse data types—topics, facts, books, news articles, blog posts, papers, etc. Given that the only ground truth are the documents themselves,

and not the inter-document relationships, any system we deploy must autonomously learn and consolidate these relationships from the dataset.

Conclusion

Our study with the EpiK-Eval benchmark highlights a significant limitation of Large Language Models (LLMs): they do not inherently learn all dependencies present in their training data, even though such dependencies are crucial for effective problem-solving. This realization underscores the need for continued research into this phenomenon. More importantly, it calls for the development of innovative methods to address this fundamental challenge. Successfully consolidating the knowledge within LLMs could greatly enhance their utility across various scales and applications.

For a comprehensive understanding of our findings and methodology, we encourage delving into the full EpiK-Eval paper. For those interested in hands-on exploration or further research, our code is available on Github. We encourage collaboration and innovation in this exciting field and look forward to seeing how the community advances these concepts.

Gabriele Prato Jerry Huang

Prasannna Parthasarathi Shagun Sodhani

Sarath Chandar

Gabriele Prato Jerry Huang Prasannna Parthasarathi

Shagun Sodhani Sarath Chandar