Benchmark

Text2Graph methods
across multiple LLMs

12

Models compared

3

prompting methods

19

Ontologies cross-domains

Introduction

Many companies have treasure troves of data that is currently sitting idle and not explored. Unstructured text documents can contain important data for your company, but the knowledge is very hard to access digitally or to query - like we might do in a database. We can take this step further and use a graph database, where the graph adds relationships found in the text as additional datapoints - strengthening the query ability of the data, and finding connections that basic Natural Language Processing methods may not find. (link to my last blog post)

Lettria’s Text2Graph is an innovative tool that reads your unstructured data and imports it into a Graph database - allowing for querying - or even turning into a Retrieval Augmented Generation (RAG) for generative AI chatbots. In this post, we’d like to examine the evolution of the Text2Graph models and ontologies to be transparent on how we have settled on our current system

As we built Text2Graph, we tried many different approaches to generate the highest quality data into our structured databases. Clearly, the highest quality data with the strongest connections are critical for structuring our unstructured data. Our initial iteration used ontologies - a formal representation of the relationships in the files to be examined. Our ontologies are stored in RDF files. RDFs aim to describe how a graph of objects are interconnected. This is typically done with triples - the subject, the property/relationship and the object. For example “Alice knows Bob” might be represented as follows :

```"http://example.org/person/Alice "http://xmlns.com/foaf/0.1/knows" "http://example.org/person/Bob" ```

This worked quite well: fast to implement, and the results were independent of the ontologies provided. However, as seen above, RDF files have a lot of “overhead”: extra text that is not really required to complete the job. This extra text had to be included in our prompts, leaving little room in the input for the input files. And since the ontology had to be input each time, the process was slow. Finally, the resultant RDF required additional processing to ensure that it was valid.

Our next iteration of Text2Graph removed the bulkiness of the RDF syntax, and converted the RDF into a JSON of triples. The triple above is now:

`{``"triples": [{``"sub": "Alice",``"rel": "knows",``"obj": "Bob",``},{``...``}]``{`

This greatly reduces the input requirements for Text2Graph, providing more room for the text input. With this update, we also tried a few training methods from Zero-shot to few-shot as a part of the training.

What is Zero shot and few shot training?

When prompting a model, one can provide examples as a part of the prompting. In zero shot prompting, the model classifies the data without any examples. In few shot prompting, a small number of examples are provided during prompting. In our testing of Text2Graph, we used manually created samples, and generated samples using LLMs.

Evaluation of Text2Graph

Fetching the rights datasets

To evaluate the various models, we used the Open Source Text to Knowledge graph Benchmark tool (Text2KGBench).

Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying to the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

⛔️ Please note that for the purposes of this benchmark, we have only used the WebNLG dataset, which has been disambiguated with regard to DBpedia.

Here is an example test sentence provided by the community behind Text2KGBench :

Test Sentence:{"id": "ont_music_test_n", "sent": "\\"The Loco-Motion\\" is a 1962 pop song written byAmerican songwriters Gerry Goffin and Carole King."}

An example ontology:

This expected output would be :

{ "id": "ont_k_music_test_n", "sent": "\\"The Loco-Motion\\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", "triples": [ { "sub": "The Loco-Motion", "rel": "publication date", "obj": "01 January 1962" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Gerry Goffin" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Carole King" },] }

Evaluation methodology

We tested our new JSON based method with three techniques to prompt a LLM: Zero shot, fewshot and fewshot-gen. Additionally, we applied these methods to twelve different LLMs: GPT4, GPT4 Turbo, GPT4o, GPT4o mini, Gemini 1.5 Pro, Gemini 1.5 Flash, Claude 3.5 Sonnet, Clause 3 Opus, Mistral Large, CodeStral, DeepSeek-Coder-V2 et Triplex.

What does all of this mean?

  • Text2KGBench is a Project that evaluates how well LLMs generate knowledge graphs from an ontology. We used our Text2Graph pipeline to create the ontologies.

The “number of shots” describes how many examples are given as a part of the ontology.

  • Zero Shot: no examples are given as a part of the training.
  • Few Shot: 1, 5 or 8 examples are given as a part of the ontology. These come from the sample data
  • Few Shot (Gen): we use LLM generated examples instead of data extracted from the dataset.

Finally, we utilized 12 LLMs to see how the different language models affect the graph creation (GPT4, GPT4 Turbo, GPT4o, GPT4o mini, Gemini 1.5 Pro, Gemini 1.5 Flash, Claude 3.5 Sonnet, Clause 3 Opus, Mistral Large, CodeStral, DeepSeek-Coder-V2 et Triplex).

We ran each of these 84 configurations (7 different shots across 12 different LLMs) with 19 different sets of data provided by the Text2KGBench tool.

What does Text2KGBench measure?

  • Precision: how well the correct triples match predicted triples)
  • Recall: correct triples/test set triples),
  • F1: harmonic Mean of the Precision and Recall
  • Sub_halluc: average Subject Hallucination
  • Rel_halluc: average relation Hallucination
  • Obj_halluc: average object Hallucination

To simplify our analysis, we’ll summarize the Global averages for each of the 84 models across the 189 models.

F1 data

The F1 data is a way to describe the precision and recall in one value, so it is a great way to compare our 7 models across the 12 LLMs.

We find that across all 7 models that Claude Opus performs better than the others LLMs- and markedly better for the “few-shot” models with 8 examples.

Hallucinations

Hallucinations are when the LLM model creates information that is incorrect or misleading. Clearly, for a data model, we want to reduce all hallucinations as much as possible. Tests were run on hallucinations for all three aspects of the triple: subject, relation and object. For the three hallucination tests, none of the models had any significant hallucination for the relation - the piece that connects the subject and the object:

We do see significant differences in Subject and Object hallucination data.

We see that all twelve LLMs minimize hallucination on the zero and few shot methods as opposed to the generated few shot. Again Claude Opus has the lowest score - especially for the few shot methods.

Conclusion

The research shows that the new approach for Lettria’s Text2Graph performs best with Claude Opus using the few shot method. These models had the highest F1 scores (precision and recall) as well as the lowest hallucination scores.

We plan to continue our research in finding the best and most effective models and methods for Text2Graph - ensuring that our customers have the highest quality conversion of their unstructured data into graph databases.

To learn more about Lettria Text2Graph, book a demo with the team!

Get started with NLP in just 2 minutes.
Get started ->