Data-driven and High Tech

Figuring out language

Photos in this story: Shutterstock

If you want everyone on the same page in a multidisciplinary research project, it helps if all participants are using the same vocabulary. Wageningen has developed the TALK Tool for precisely such situations. In addition to fostering alignment and mutual understanding, it also stimulates discussions and provides inspiration and useful data.

Even the Old Testament contains a warning that joint endeavours are doomed to failure if the people involved don’t speak the same language or can’t understand one another. The Tower of Babel and the associated linguistic confusion is a metaphor that is as apt as ever, thousands of years later.

Jump ahead a few millennia and you will find proof in our modern times that even slight discrepancies in communication can have catastrophic results. In 1999, the Mars Climate Orbiter was destroyed in the atmosphere of the red planet (representing a loss of 125 million dollars) because two teams working on it had used different measurement units: one metric and the other imperial.

But you don’t have to go to Mars or Mesopotamia for examples of miscommunication with major consequences, because this can occur in any research project – including in Wageningen.

Jan Top, a member of the Wageningen expertise group Food Informatics, sees this too: “We find that people often don’t understand one another, especially in multidisciplinary projects where sociologists, economists and technicians, for example, have to collaborate. Perhaps even worse is the situation where everyone thinks they are talking about the same thing but it gradually becomes clear that each participant has their own interpretation of a term.”

Top cites the term ‘delta’ as a typical example. “A mathematician will think of the Greek letter used to denote a change, while someone else will assume it refers to the mouth of a river. In this case, the participants will soon realise there has been a misunderstanding, but there are also real-life examples where people only found out afterwards that different interpretations had been used during a project.”

‘With this tool, you can already create a shared vocabulary during the introductory meeting’

Language might seem straightforward, but reality is complex. Top: “To give an example, what exactly is meant by ‘security’? It is generally associated with cybersecurity but in WUR research it often means food security. A simple word like ‘water’ can refer to drinking water, surface water or contaminated water, which are all very different things. And if you talk about ‘food producers’, who do you mean: farmers, or companies like Unilever?”

It is important to identify possible differences in interpretation as soon as possible, preferably at the start of a project. With that aim in mind, the Food Informatics expertise group has developed the TALK Tool. Using a playful approach, it lets researchers determine the terminology required for their specific project. The acronym TALK stands for Team Associations for Linking Knowledge, which basically tells you how the instrument works. The initiator Top explains: “It is essentially a very simple game that takes about twenty minutes to play. Each participant in turn submits a word they find important for the project. The TALK Tool then automatically generates words based on texts in the WUR library that are related to that term.”

According to Top, the advantage of this approach is that you eliminate the human bias you would get if the participants themselves had to identify those words. “The word that was submitted is shown in a circle in the middle of the screen, and the program automatically generates nine or so terms that are displayed around that word. The participant removes the terms that don’t actually refer to their word, and adds more appropriate terms. It is all highly interactive. If you go through this process with all the participants, you will already have created a shared vocabulary during the introductory meeting. That vocabulary not only makes clear what you mean as a group but can also help set the course for the entire project.”

It is an important exercise, and also one that participants find fun and inspiring (which in turn increases the chance of success). However, TALK is more than just an icebreaker, and the resulting list of terms is not the end of the road either. Top: “Those words serve as a useful starting point for the vocabularies that are needed to link models and datasets to one another.”

‘We carried out an automated analysis of summaries from the WUR library’

Photo: Jurjen Poeles

That is usually a very time-consuming task involving talking to numerous experts to construct a taxonomy together. “That process can be made a lot quicker if you let an algorithm make suggestions based on existing texts. You can also link various attributes to the descriptions that help to interpret the measurements. These attributes are the metadata, an incredibly important aspect of data these days. Project financers are making such metadata a requirement as it allows the data to be understood and processed by machines as well as other people.”

Making use of existing texts seems logical, but the choice of texts matters a lot.

Top: “The underlying algorithm measures how often various words appear in the vicinity of other words, so it can work with Twitter and Google, for example.” But of course such sources are very general, and terms that are typical of the Wageningen domain will appear relatively infrequently. “Terms like ‘enzyme’ or ‘amino acid’ are more likely to turn up in WUR texts than in everyday communication. That is why we carried out an automated analysis of the summaries in our library. We used the Word2Vec method for this, a technique for analysing natural language in which numbers are used to indicate the degree of relationship between words based on their proximity to one another in a large set of texts.”

‘Project financers are making metadata a requirement as it allows the data to be understood by machines as well as people’

Making use of existing texts seems logical, but the choice of texts matters a lot. Top: “The underlying algorithm measures how often various words appear in the vicinity of other words, so it can work with Twitter and Google, for example.” But of course such sources are very general, and terms that are typical of the Wageningen domain will appear relatively infrequently. “Terms like ‘enzyme’ or ‘amino acid’ are more likely to turn up in WUR texts than in everyday communication. That is why we carried out an automated analysis of the summaries in our library. We used the Word2Vec method for this, a technique for analysing natural language in which numbers are used to indicate the degree of relationship between words based on their proximity to one another in a large set of texts.”

The use of a good vocabulary not only helps ensure a smooth research process without any holdups; these days it is increasingly a requirement. “Research funds and publishers require data to be FAIR: findable, accessible, interoperable and reusable.

In everyday language, that means it should be possible to find your data, understand it and reuse it. Developing vocabularies makes this much easier.”

While the application’s secret weapon is a statistical tool, the art lies in its ability to perform calculations on language. “What is interesting about our Food Informatics research group is that people assume we are real numbers nerds because we spend so much time with computers and AI, but in fact it’s far more about language than numbers. Or perhaps to put it more precisely, we focus more on the language associated with the numbers than the figures themselves.”

The TALK Tool is open source and freely available. The intention is to make it accessible to the general public, for example by publishing it on a website. “Like the Dutch-language word-guessing game semantle.be, which works on the same Word2Vec principle,” says Top. “But to be honest, I think our TALK Tool looks better.”

Share this article