Understanding LLMs and Language: A Linguist's Take
You’ve used ChatGPT. Do you know how it actually works? And what does it mean for the translation industry?
Do not go gentle into that good night, Old age should burn and rave at close of day; Rage, rage against the dying of the light.
Though wise men at their end know dark is right, Because their words had forked no lightning they Do not go gentle into that good night.
Dylan Thomas
I’m not an advocate of using LLMs (ChatGPT) for tasks like translation, content creation or copywriting. I’m actually against the use of such models for tasks involving human language in general. But that doesn’t preclude me from trying to understand this technology better and in more depth. That’s why I have decided to dedicate a not-so-significant portion of my weekend to a Stanford lecture on machine learning and building large language models.
Actually, I realised that generally, in life, I’m categorically opposed to being opposed to something one does not understand – if you catch my drift. So, I began to be categorically opposed to me being categorically opposed to the use of LLMs in creative tasks (like translation and copywriting) because I realized that I hadn’t understood how LLMs actually work.
Now that I have remedied this ignorance on my part – I’m still, and maybe even more than ever, categorically opposed to using LLMs for tasks involving high sensitivity to human language, context and semantics.
Based on the lecture I watched over the weekend, I will try to explain why and where this sentiment of mine is coming from.
Stanford University seemed like a credible-enough source for such an endeavour, though I don’t doubt there are plenty other relevant, high-quality and credible resources that deal with the topic of LLMs – especially in this ChatGPT/AI craze period.
The aim of this article will be mainly to point to the functioning of such large language models, at least from the technical point of view – how these tools deal with language and how they use it, and maybe how this usage differs from that of humans. I will also point to the (un)usefulness of these tools in the areas of copywriting and translation, and I will try to provide some logical arguments to support my ideas and opinions (with an emphasis on “will try”).
Though I believe both Google and OpenAI offer some sort of courses on the functioning of their proprietary technology, I wanted something more on the academic side and less on the business side without the overbearing tinge of “large language models are the greatest thing to ever touch the face of the earth since sliced bread”. Not that I was looking for a take that would be inherently biased toward the other extreme; but, I guess, I was looking for something more academic, rational, balanced and technical (as my writing also seems to naturally veer more toward the academic and technical; less so to the marketing/popular/hook-kind of thing).
And technical I got; probably too technical for my brain capacity – as I am not a coder or IT guru, rather a linguist. Thus, I focused on the aspects I could understand from the linguistic point of view. I believe that people with better technical backgrounds would be able to extract much more, or maybe something different, from the lectures. So, in effect, this is a “linguists take” on the topic.
What are Large Language Models (LLMs)?
LLMs (Large Language Models) are neural networks. Yes, as a translator or someone working in translation, you’ve probably heard this term. Neural networks… Hmmm. Right: Google Translate and DeepL. The famous online translators also use neural networks as the base for their functioning. And ChatGPT is somewhat similar.
Language model – hmmm. That sounds like it works with language.
Fair enough. What we need to understand, though, is that LLMs don’t work with language on the same basis as humans do (obviously) – they work with language based on statistics, math and tokens. After all, math and numbers are the “language of computers”, so in this respect it only makes sense.
As the lecture mentions, LLMs work based on the “probability of distribution of tokens” – so how probable is the occurrence of the next token in a given sequence. In layman’s terms, we can think about it like this:
If I write the word “mother”, what is the most probable word to come after it based on a specific set of data (corpus, text) I have available and have used as a database (for training)?
So, a game of probability. Of numbers. Definitely not semantics or, god forbid, meaning.
So how does the probability work? It depends on the set of data you train on, but in general:
Sentences with grammatical mistakes are less likely to occur (i.e. be generated)
Sentences with logical errors are less likely to occur:
“Mouse ate the cheese” is more likely than “Cheese ate the mouse” (more likely, not impossible)
The probability of the occurrence of the next token in sequence is always calculated based on all the previous tokens (it’s sort of like a Fibonacci sequence, but not at all like it)
So, based on the first word, the probability of second word is calculated. The probability of the third word is calculated based on the first two words and so on. It would seem like the LLM then works with context, but this is all only on the level of statistical context – not semantic or linguistic context; though it may happen that these two coincide. The problem is, however, that the main condition for the token distribution is still statistics, and not linguistic/semantic criteria.
So, all a large language model does, it would seem, is that it predicts the next word in a given sequence. That would seem right, but it’s not entirely true. LLMs don’t actually work at the level of “words”.
What do they work with, then?
If not words, the next thing that came to my mind were morphemes. A morpheme is an indivisible linguistic unit that carries meaning – that would make a lot of sense. LLMs could parse the language and categorise morphemes according to their meaning and/or similarity, group them and arrive at conclusions etc.
But that’s not entirely true either. Actually, not true at all. It’s all based on statistics and “token distribution”.
So instead of semantics, words and morphemes, LLMs work with tokens. What are tokens?
One token may be one word – that would be a perfectly sound conclusion. Or one token may be one sentence – that would require much more computing power and energy, though. That’s not entirely how it works – so I will try to explain what tokenization looks like.
Language models work with language in two steps:
Tokenization of language/words. They calculate the probability of distribution of these tokens (based on the available language corpus), and the next token value in the chain of tokens is then calculated. That’s how the words and sentences that appear on screen in ChatGPT come about – behind the scenes.
Detokenization. That’s the process of reverting the tokens (word parts) back into actual words and sentences. Because if we got an output in tokens, it probably wouldn’t make much sense to us, as ChatGPT literally butchers language into “edible” but arbitrarily meaningless parts (i.e., the “tokens” in question).
So, the next token in sequence is always calculated based on the previous tokens – that’s called an auto-regressive neural language model. Auto-regressive in this case is, I believe, self-explanatory and makes sense, as the model always refers to the previous token and calculates and re-calculates the probability of the next token in the sequence.
We could then ask, as does Yann Dubois of Stanford, why don’t the models work on the level of words? Or semantically sound units – morphemes?
And that’s a question of input data – the data that the model trains on. These are basically large corpora of texts, or large sets of question–answer segments (we may imagine it as a sort of translation memory, but in this case it’s all in the same language and includes questions and respective answers paired up).
These are then parsed into edible pieces – tokens – for the LLM to work with. We can’t work at the level of words because these might include typos. Funnily enough, no one is really checking the input data – not even for typos. But I could understand that because that’s probably a time-consuming activity.
Also, if we were to work on the level of words, it would cause trouble in languages such as Chinese – because there are no clear divisions between words.
Let’s have a look at an example of the tokenization process. It’s an iterative process which analyses and deconstructs language into smaller units. Not based on meaning or semantics, but based on distribution and occurrence of the same “structural” elements.
There are 5 steps in the process of tokenization:
Get a large corpus of text/language. We’ll work with a sample text:
Tokenizer: text to token index
Assign every character a different token. So, at this point: One character equals one token.
Merge the most common pairs of tokens, like so (tokens with the highest number of occurences):
4. Keep finding similarities and group them together – iterate.
5. Finish when all possible units are “distributed”, i.e. tokenized.
Once again, we can see that the model doesn’t work with morphemes or semantic units of texts, rather with structural similarities devoid of context and meaning – based on statistical occurrence of the same elements. These might or might not coincide with morphemes or semantically sound elements.
Training LLMs
So, first of all, LLMs tokenize the information we provide them in the form of text. But what exactly is the text we provide them with? Where does it come from? Is it written by someone or…?
We can also call this “text” data. After all, that’s all it is for the LLM. It’s raw data. And how do we obtain such data?
Let’s imagine you want to obtain data from a single web page to train your model. Here is what you need to do:
Extract text from HTML code;
Filter undesirable content/websites (blacklist);
Deduplication – remove repeating headers and footers etc., links that point to the same content;
Remove low-quality documents/content – manually, so it’s highly subjective;
Classify data into different domains (code, books, entertainment…) – assign weights to the domains; try to overfit on high-quality data.
This then seems that the quality of your language model, or of the data it has at its disposal, mostly depends on a couple of things:
The amount of data (websites) you can crawl and extract the HTML codes from.
The efficiency with which you can then process this data and make sense of it.
And… What do you consider low- and high-quality content? This is extremely subjective, at least in my view, so I am quite skeptical when it comes to the data which is actually being used to “feed” these models.
So, the first step is… Data. Lots of data – preferably, high-quality, which is extremely subjective, and it seems to be highly unlikely that it can be managed in an efficient way that would lead to satisfactory results. But that also points to the decreasing demand for high-quality outputs when it comes to language in general – this is either the root cause of the problem; or its result; or, arguably, both.
The thing now is, I have all this data, and I managed to somehow “stuff” it into my LLM, so it can be used to predict words. So, my model is “trained”. But how do I make it useful for the public? How do I make it ChatGPT-useful? (Of course, the degree of its usefulness is up for a debate, but you know what I mean…)
This happens in the phase that’s called post-training.
Post-training LLMs
Post-training is the process of alignment of instructions/questions and answers/solutions – so the models become actually useful (ehm) for the public, and they are not just aggregators of data.
So, you’d take the “raw” LLM – now trained on the Internet data (dubious, but OK) and fine-tune it. How exactly would you do that?
You keep training it on the data you want it to focus on – on the desired answers you collected or obtained.
So, in theory, I should be able to train my model for translation if I have enough data – for example, in this case, it could be a bilingual database of original–translation segments (sort of a translation memory). At least that’s how I can imagine it, but I’m not entirely sure about the application of such approach.
What regular LLMs are trained on, however, are pairs of questions and answers. The first data sets of such Q&A pairs are usually obtained from humans – but then this process, if it were to be scaled with humans, would become expensive and inefficient.
So, what the companies do, they take a sample of such human Q&A database and tell the model: Based on this database, come up with your own questions and answers, and train on that.
Wild.
It’s like if someone at school – let’s say you have an exam and the subject is history – told you to make up your own questions and then answer them, and you’ll be evaluated based on your performance. How that is relevant is beyond me, but I am no machine learning engineer…
So, the scaling is basically a synthetic data generation – creating questions and answers for the sake of creating questions and answers. Again, the question would be what degree of oversight is applied during such a process. And if we think about it, if there is little to none oversight, this could explain the “hallucinations” of LLMs or the completely made-up and false set of information they can sometimes spew out.
I mean, given the fact that the LLM – in the process of its training – makes up its own questions and answers, maybe based on the initial data set we provided? But we can’t really know how much it has gone astray, and what questions and answers it is generating. At least that’s the general feeling I got from the Stanford lecture.
This seems, at the very best, irresponsible – and at worst, quite dangerous. We’re basically denying the power and importance of language and dismissing it as mere datapoints which can be dealt with on the level of statistics, instead of semantics and linguistics. And what’s more, we are putting in the hands of the model, which we can’t even check or control in a satisfactory manner.
But! There is, sometimes, as is the new trend in translation as well – human in the loop! I’m not sure about the extent and degree of such a human QA in the process of training the LLM, but there certainly is or should be a human involved.
But, what some of the research has found is that humans are also partially to blame for LLMs providing either completely outlandish answers or answers in the form of convoluted sentences with myriad of buzzwords in them.
Why and how? Well, because we are lazy.
In the process of human training and QA of the LLM, the human agent is supposed to rate answers based on their accuracy and label the most accurate ones, which the model will then focus on and train on.
The thing is, people don’t like to read. It’s a waste of time for them – so they don’t read the answers, and they simply almost invariably go for the longer or more complicated one. Because if it’s longer and uses big words, it must be right!
And, I believe, therein lies the secret behind ChatGPT using so many convoluted words and structures such as unleash, ensure, in this ever-changing world/landscape, leverage etc. People prefer them because they are more complicated, sound more sophisticated and thus, more credible. And the answers are longer. Irrational but probably true.
It's funny when we think about it – there are so many buzzwords because someone couldn’t be bothered to properly evaluate a set of data (text), and instead just chose the shortcut, or the option that seemed the most convenient in the first second or so.
The thing is that within this process, the human QA is being then slowly phased out and replaced by LLM QA – which works based on what it has learned from the human labeling and answers. Thus, you keep repeating the same mistakes and exacerbating them even further, probably eventually coming to more and more complicated and convoluted answers; provided that it’s what the model learns from the human decisions and choices.
As Yann Dubois states in his lecture, concise answers have a 40% lower preference than verbose answers (on average), which are then used to train and feed the model. That’s a huge difference, or as he called it, variance.
Hallucinations
So, the question we might ask ourselves is: If the model trains on human answers, and answers that are based on human answers, which then undergo a QA – how come it still produces hallucinations?
This probably has to do with the fact that people sometimes answer the questions by providing a reference for their reasoning or solution. Logically, the model should analyze or parse the reference and learn from it – but that’s not what happens. Instead, the model thinks you just made it up. So, in this spirit, the model then makes up its own reference (non-existent) because it thinks that’s the way to do it. And there: we get the made-up facts and references that don’t exist. Thank you, ChatGPT? Or, rather – thank you, lazy people who trained it.
In the end, it’s all our fault.
After a certain set of such questions and answers with reasoning, the model learns what the probable preference of the human is when it comes to answers. And once again, human agents are no longer needed – at this point, we already have a statistical model for the preference of answers, so we only extrapolate it over the larger dataset (the set of questions and answers).
What’s scary is the fact, as Yann Dubois says, “It might not perfectly model human preferences…” And we are actually pretty certain about that. Once again, similarly to translation, we are taking humans “out of the loop” in the spirit of efficiency and cutting costs, disregarding and sacrificing semantic accuracy and linguistic sensitivity.
It might be true that this approach has somehow influenced what’s going on now in the translation and localization industry as well, but I don’t want to veer far off into conjecture.
It’s a gold rush for quantity of data, not for quality. Quality is an afterthought, if a thought at all, after optimizing for efficiency and costs. The thing is, we are talking about the quality of language here, which shouldn’t be taken lightly – but, apparently, is. It’s not even dealt with as language anymore; rather as data points or data sets which can be reduced to numbers devoid of meaning, semantics and context (tokens).
Language, and the way we use it, will be increasingly more dependent on data and our ability to process data. Therefore, CPUs. The more data you have and can process, the larger the language model, the higher the quality, the better the performance.
If we say that language is power – true, but not in a way you’d think. Language transformed and distilled into data is power. So, the power doesn’t reside in the semantics, in the meaning – the power of words, as we could dramatically argue. But it rather resides in quantity, and the ability of machines to process it efficiently.
What does it mean for the future of, say, translation?
Translation will become a question of data. Whoever will hold the largest data sets and databases will probably win the race – in the short term. Translation will no longer require “highly skilled individuals” and “linguists”, but rather highly specialized and high-quality data. These data will be obtained through – you guessed it – training LLMs with… Right – different LLMs. Will there be a human-in-the-loop? Probably, but the loop will be extremely “loose”, and the role of human oversight will be increasingly redundant.
Instead of the “translation industry”, we will get something more like “language-generation industry” based on data and automated evaluation with feedback loops between different LLMs and machine translation engines. The industry will probably become more concentrated, the biggest players with the deepest pockets will be able to come up with the best computing, power, analytics and performance. I’d say that companies like Meta and Google (maybe Microsoft, given the OpenAI connection) should have the upper hand; maybe we could count with DeepL to some extent, but I don’t think they have the power or the resources to catch up to the “big boys”.
Translation industry, and translation as a concept, will be in the hands of entities that have nothing to do with translation as a cognitive process; rather they own vast amounts of data (the question of morality and how they got to the data is another thing…).
Conclusion
Based on the presented facts and opinions, there seems to be no logical and linguistic reason why we should use LLMs for tasks that require high level of sensitivity to language, context and semantics – such as translation, copywriting etc. And if people actually understood how these models operate, and provided that their IQ level doesn’t inhibit them from thinking logically and making rational implications and conclusions, there would be no logical path where we could efficiently use large language models for tasks requiring high sensitivity to context, semantics and the meaning of language.
Then, it seems, what one of the ways forward might be is to yield to the seemingly self-destructive opportunities on the job market related to positions such as “AI data labelling” or “AI content evaluation” and try to create the highest possible quality of outputs within said domains. The job, however, seems to be in its nature finite, and doesn’t seem to contribute to the overall personal or linguistic development of the individual, or the collective society, in any form or way. I have received numerous offers for training and assessing and labelling AI data, language, outputs and translations. The rate? 7.50/hour. I could probably make as much as a university student working a part-time job.
Language and its use will become increasingly dependent on computational power and the quality of available data, or our/computational ability to build high-quality, high-volume data sets. We are reducing the concept of language to statistics and “bits”, which can be understood by computers and maybe engineers or developers, essentially cutting out any agent who has actual linguistic and semantic understanding at the level of context and meaning. And that’s also the point where we purge language of humanity, and probably culture, creating an isolated concept chiefly for the use of the LLM in question, and not for the use of humans. Even though humans are exactly, for now, the end users of such large language models.
So, in essence, we are forcing the replacement of human capabilities of dealing with language (which are perfectly fine and better than that of the LLM) with the learned capabilities of the LLM, which are inherently suboptimal, in order to, I guess, achieve efficiency and cut costs.
Language, when AI uses it, is based on the computational ability to predict the next most likely occurrence of a word. Our brain, however, works in a completely different way – at least how I imagine we use language. First, we need to have some kind of an abstract notion within our imagination – we can usually “see” it – and then we try to find the proper words to describe such an “imagined” phenomenon based on the extent and individual range of our vocabulary and our individual capacity to string words together. It also works the other way around: if you’re reading a book, you are basically “hallucinating” – or we may call it a guided hallucination – based on the words and notions we ascribe to each word. If ten people read the same book, we would probably arrive at ten distinctive and different hallucinations induced by the very same text.
Will this variance be neglected in the face of efficiency, data and costs? So far, it’s the most likely scenario.
In the meantime…
Do not go gentle into that good night.
Thanks for your thoughts. In the absence of a workable linguistic model of language (which linguists have struggled with for decades) AI companies have taken the number crunching route as you explain. The results will therefore never be as good as the best minds though will be good enough for many applications. The secrets of language remain to be discovered and part of me hopes they never will be as therein I believe lies the definition of what it means to be conscious, what it means to be alive. If chatbots ever get that far we won’t have much need for human contact.
Great walk-through and, as always from you, reasoning and considerations that make sense.
About what we logically should do – when it comes to buying machines to do human work, those who buy them are not those who know what is good. In fact, they often care the least about quality, they just want to show that they have saved money.
You, who know what is needed for creating good quality, are the one who will be replaced by the machine, so nobody will ask you.
That's the kind of logic that works in business life.