Lengoo is permanently closed. Read more here.
Lorem ipsum dolor sit amet consectetur adipiscing elit interdum ullamcorper sed pharetra senectus donec nunc quis nostrud exercitation ullamco.
In the last article we talked about how significant it would be to build an Artificial General Intelligence (AGI) and that its impact should not be underestimated. It’s the holy grail. The last invention. We then walked through how elusive this endeavor has been over centuries of research in many fields, from basic science and math all the way to electronics and computer engineering. Finally we ended by talking about how work on deep learning accelerated the progress on multiple problems that were very hard to tackle and that are now being solved at an increasing rate.
Today we will take a look in particular at the language task, from representing words as numbers all the way to chatting about numbers.
“The limits of my language means the limits of my world.”
—Ludwig Wittgenstein
Language and intelligence are strongly intertwined. The Turing test is perhaps one of the most famous examples of that strong association.
While we know of animals that have better vision systems, better senses of smell, can run faster, can solve puzzles, animals that have better physical control over their environment, etc, an essential part of why we consider our intelligence unique is our human language. Language is an innate part of being human. A baby growing up with other humans picks up their language, while other animals don’t. So it’s not just a matter of learning: humans have an inherent ability for it.
Language as expressed through writing is also what helped us preserve our collective knowledge and take it to new heights. So in written texts, a significant part of humanity’s collective knowledge still exists, some of which may yet to have been truly decoded.
“Writing is thinking. To write well is to think clearly."
—David McCullough
There is thus strong evidence that building machines that can effectively understand and generate human language would be a significant step towards understanding and replicating human intelligence.
But similar to other problems in AI, this has been very difficult to even come close to.
We know computers were built for number crunching. But human language does not have a natural way to be represented numerically. In fact, one of the hardest problems in the language processing field has been to transform words into numerical representations that preserve their meaning. If we could build a computer program that gives us two close numbers for two sentences that are close in meaning, we can use that program to do more number crunching on sentences.
This is not as easy as it may sound: While we could represent and process words as raw data or print them on a screen, as we would process the pixels of an image, we were not able to simply measure whether we have semantic equivalence between two sentences. And yet, being able to do this seemingly basic operation means we have a system with a good internal representation of language that with that system, we can do much more processing than sentence equivalence.
Consider this endeavor as our effort to establish an equality axiom in a first order logic: a foundational operation in a logic system.
Today, we have systems like ChatGPT that are capable of recognizing and also generating semantically equivalent sentences as demonstrated in the picture. But how did that happen? Let’s use these two generated examples and go backwards to try to see how we can programmatically find out whether two sentences have the same meaning.
The trivial case is the visual one: if two sentences look the same, they are the same.
The cat is on the mat. = The cat is on the mat.
This only uses visual clues: what the letters look like. Computers can easily do that by using the internal binary representation of these pieces of data. Even if the casing is different (one sentence is written: The Cat is on the Mat) this would just require us to add an extra normalization step of casing to lowercase all letters then check.
This hack would not work with German, for example, where all nouns are capitalized, so casing can be valuable information we shouldn’t throw away.
But as you can surmise from our examples, this approach does not work at all as soon as the semantically equivalent sentences have different word orders. We need to start looking at the words themselves.
One technique for this is the famous bag of words approach. Here we simply split the sentence into words (for English would be to split on spaces plus separate punctuation marks) and compare the words in each sentence, disregarding the order in which they originally appeared. Hence: a bag of words.
So in our example sentences, we would see that both sentences have the set of words : (the, cat, is, mat, on, and “.” ) and thus we can conclude that they are equivalent. But what about the sentence “The mat is on the cat.” It has the same bag of words but is obviously not semantically equivalent to “The cat is on the mat.” Even with such a short and simple sentence, it becomes clear that the bag of words approach is of limited help.
One next step to take is to consider a bag of n-grams, which is just a bag of a consecutive sequence of words. N here is based on the length of the sequence we use: so 1-grams (or unigrams) is the same as our original bag of words, 2-grams (or bigrams) is a sequence of two- word groupings, e.g. (the cat, cat is, is on, on the, the mat), and we can also use 3-grams (or trigrams): (the cat is, cat is on, is on the, on the mat).
If we use bigrams:
The cat is on the mat -> (the cat, cat is, is on, on the, the mat)
On the mat is the cat -> (on the, the mat, mat is, is the, the cat)
Again ignoring the original order of the n-grams, if we look at the differing bigrams we find that sentence one has “cat is” while sentence two has “mat is” which is different and tells us that these two are not the same.
Bag of n-grams is actually an effective baseline in many language tasks, such as search and text classification. In this case one would use a count of how many times each n-gram appears instead of a binary 0 or 1 for whether it exists or does not exist in a sentence as we did above. However you can see that very quickly it does not really capture semantics in a meaningful way (pun intended!). And we still haven’t considered the harder or the two examples, in which we have a rephrasing with “decided” vs “made the decision”. So we need something that captures more than just surface level features.
A key insight into processing language is to use statistics. Sequences of letters do not have an assigned meaning by default. So we cannot put together a random sequence of letters and it will mean something. Rather, meaning is assigned to sequences of letters, when humans use those sequences frequently enough to express something. And the same sequence of letters can mean different things in different contexts and, of course, in different languages.
This continuous death and rebirth of words coming in and going out of fashion tells us that we need a system that also continuously learns from exposure to changing language in the real world. And a learning system would be faster and more scalable than keeping track of the meaning of words through manual programmatic updates.
Still, how do we actually know the meaning of a sequence of letters (i.e. a word) in a context? The answer: by using more words! Other words. This is a key insight from Distributional semantics.
“You shall know a word by the company it keeps.”
—John Rupert Firth
This idea was already commonly used as an exercise in language learning: fill in the missing word (aka, a Cloze test). If a person can correctly predict the missing word in a sentence, it would mean that not only they know the grammar and morphology of a language, but also they have an understanding of how the vocabulary of this language works in different contexts.
So we can assume that if we can build a machine that does well on that exercise we would also have a machine that has a good understanding of words in context. This is where language modeling comes in. The nice thing about this exercise is that there is an abundance of data that can be used: basically take a sentence, randomly remove a word, and then we have training problems and their solutions: a sentence missing a word and the word that is missing.
As mentioned earlier, we are still trying to have a numerical representation of words that helps us do number crunching on them with computers. This is where the concept of word embeddings comes in. A word embedding is simply a number (or list of numbers) that are used to represent a word. Instead of giving the word a unique numerical identifier, we can assign it a list of natural numbers. It’s like mapping a word into a multidimensional space of meaning, with words closer to each other being similar in meaning.
To generate these embeddings, we would rely on the same concept from the last section: guessing from context. There have been a lot of systems to represent words that have done just that, specifically neural systems that are trained to learn these embedding values. Some word embedding models generate a fixed word embedding per word like word2vec, while others, such as ELMo, which stands for Embeddings from Language Model, generate a different word embedding for each word based on its surrounding context words. The latter being an important step to go beyond a static vector per word approach and towards full end-to-end language modeling of sequences of words and their representation by a full-scale deep-learning neural network. This also started a trend of naming NLP models after Sesame Street muppets.
While word embeddings are a great tool and represent a giant leap forward from our bag of n-grams approach in our pursuit of semantic equivalence of sentences, they are not completely satisfactory on their own. To use them to solve our semantic equivalence problem we still need to add, average, or compare them in some way, which turns out to be cumbersome and error-prone. But what we have learned from them can be scaled, so let’s try to use a similar approach to how we got here to directly model sequences of words and compare them to other sequences of words.
“You can't cram the meaning of a whole %&!$# sentence into a single $&!#* vector!"
—Raymond J. Mooney
Luckily people have been working on a similar problem for a long time. And not only to tell us if a sentence is similar to another one, but to do this across languages, too! That problem is machine translation.
Machine translation has the nice property that it forces one to think about solutions to building a system that not only needs to capture the semantics of a sequence of words, but to also regenerate that sequence into a different system of semantics. To visualize this, think of it as the problem of creating a digital mirror. Building it means you need a perfect way to capture images, and to display them, with little to no distortion, and so that the image shown also matches the perception of the beholder.
The attempt to solve the problem of machine translation, i.e. the problem of understanding a sequence and generating another sequence out of it in a different representation, has led to a lot of progress in the field of natural language processing. And more generally, from solving sequence-to-sequence learning.
Ideas such as splitting words into subwords, attention, and the whole neural architecture that is the T in GPT: transformers. The same architecture is also used in self-driving car systems. More importantly: the transformer architecture is simpler and more parallelizable, meaning it could be scaled to train much larger neural networks to do higher fidelity language modeling. In our digital mirror example, think of it as enabling us to capture, process, and generate higher resolution images, which are also more computationally expensive than lower resolution ones. Thus, soon after the transformers paper was published in 2017, we saw the emergence of the first of what are now referred to as Large Language Models. The most famous of which were BERT and GPT. Both systems reached a high level of language understanding by learning from (or pre-training on) large datasets of text. The information learned from these datasets become the model’s internal representation of language, which can then be leveraged to do different tasks through further domain or task specific fine-tuning or through adding more layers to the neural network that are specifically tuned for these tasks.
This paradigm of simple algorithms, which is designed to use computation to learn as much information as possible from large datasets and then use that general knowledge on downstream specific tasks, continues to be state of the art in neural network models. And the bitter lesson continues to be learned. But this is not new: billion scale models and trillion words dataset large language models have existed with n-grams even before they became neural. Afterall, the human brain is itself an incredibly complex computing system, the main difference being that it is natural and also incredibly energy efficient.
While the first of these large language transformers could do impressively very well on various language understanding and processing tasks, especially when further specifically trained to do them, they were still not as good at doing what we ask them to do in natural language. They were trained to predict the next word, but that is not the same as just being trained to talk, to simply have a chat, or to be helpful and useful.
A crucial element of any learning process is feedback. And for language learning it's no different. Large-scale training on language data imbued neural networks with language understanding and general world knowledge. Further task-specific training gave them specific skills. Neither were enough to make them more general agents of conversation or to be helpful assistants. Until recently.
Instruction fine tuning emerged as a way to solve this not by training models to do tasks by giving them input and output examples of that task, but instead by training them to more generally follow instructions. This means that language models would then generally do tasks instructed through natural language. It’s one level of abstraction higher. This also seems to scale well, with larger models able to generalize better to new instructions that are further away from the ones seen during training.
This mechanism is further improved by changing what the model is being optimized for. Most NLP models are trained on some objective of likelihood maximization, as in optimizing the model’s output to be as close to what it has been trained on. This works well for doing tasks and following instructions. But what happens when we want our models to be cleverer than that? To also sometimes not follow instructions? This is almost as if we were trying to give our language models wisdom. The wisdom of knowing when not to do a task. The metric or data for that seems like it needs to be a bit different.
When we train models we actually don’t just want them to generate the same output as the training data, as usually the training data is a mix of the real world’s good and bad examples. What we really want is for our models to always generate good output. And the tricky part here is that for many tasks, we cannot accurately describe what we mean by good output, which is why we use machine learning and not try to explicitly program it in software. This is especially true for many language applications because language is dynamic and subjective. So the metric for correctness can even vary from one person to the next and from one context to another. Still, there is probably some general notion of goodness that could be captured if we try to let our models learn from the judgment of experts, rather than the output of all doers.
This is where methods like reinforcement learning from human feedback (or RLHF) come in. Instead of only training models to optimize for performance on tasks, we further tune them to optimize for what a group of humans prefer as an output for that task. This is done by asking humans to rank different system outputs and model those preferences using a smaller machine learning model, which, in turn, is used as a trainer to tune our large language model. There is information in those preferences that is a useful signal for the model to learn from, as opposed to purely task-driven training. There is usually also cleaner data in a preference ranked dataset than just raw large scale datasets.
In our quest for intelligent machines, we seem to be mainly in search of automated human intelligence. For language tasks such as writing, we just want machines to write the way we would. But in many cases, we also have higher expectations of the machine’s performance on those tasks. Our expectation is that they perform much better than the average human. We want them to perform as well as the best of us, or better. This would mean that we need our machines to learn from human experts.
Spooner: “Can a robot write a symphony? Can a robot turn a canvas into a beautiful masterpiece?"
Robot: "Can you?"
—From: I, Robot (2004)
Getting high-quality expert feedback though is expensive and time-consuming, which also rings true for us when we seek education, mentoring, or coaching. But experts in all fields do exist and can be found working in organizations. And in many of them, they spend a good amount of their time training others. Throughout history, an effective way experts share their knowledge at scale is through communication. Books, lectures, talks, and more recently, massive open online courses, podcasts, and videos are all ways in which experts’ knowledge is shared at scale. Through language. This is the way we have done it for millenia, and this is how today we are teaching machines our skills and how we do things, the human way.
We now have instant access to a lot of documented expert knowledge through the internet, but cannot possibly make the best out of it due to our limited time and focus. We have even already built this into computer systems. The main problem is that a lot of that knowledge is static, and we can mainly learn from it by acquiring it. But imagine if we instead have access to experts on demand, and instead of looking up or asking them how to solve problems, we can just ask those experts to solve problems for us. We might even be able to solve problems that we do not yet fully understand by iteratively working with artificial experts as we refine our understanding of those problems. A future where experts, not just experts’ knowledge, are abundant and available 24/7.
An essential next step to a future like that is to scale the process of collecting expert human knowledge and feedback. Ideally as part of a continuous usage and learning process, not unlike how we work with each other.
And just like we are still figuring out how to best work together, many questions remain open to reaching a future where we can build useful and highly collaborative expert AIs :
We are closing in on answers for some of these questions, but others remain elusive. One thing seems true in the short term: We still want those expert machines to speak our language, and through it, demonstrate the abilities described above.
Machines may not need to understand language the way we do to be useful, but we still need to understand things in our own language, for new knowledge and solutions to our problems to be useful to us.
We have seen that the journey of getting AI systems to understand words to then speaking sentences and to actually solving problems has been a long one. It included a lot of work to turn words into crunchable numbers. (And ironically once we developed systems that did well with words, they did not end up being particularly good with numbers!) Along the way, what seemed to be a common theme is giving these systems access to better, richer sources of real data, and not trying to bake in our “how” to these systems but rather giving them access to learn our “why”.
The reason for that also seems to be that for many tasks that we care about, there isn’t always a clear answer, but instead preferences. And thus by getting models to actively learn through interaction with its users, they can capture more information about humans and how language works in the real world. It seems that machines need to be as close as possible to living with us so they can be better at doing human tasks the way humans do them. All it requires is to build a model of the human world, which seems to work in the direction of just building very good compression algorithms of all the data signals we can get our hands on.
The extent to which the current systems and the methods that led to them have fundamental limitations besides efficiency, scale, and access to more and better data is still unclear. And as compute power gets even more focused on the types of workloads these transformer models require, we will see orders of magnitude of improvement in the efficiency of such systems, and consequently, their capability and ubiquity. But will this be all we need for building AI systems that are experts in many or all fields? If not, which fields won’t be amenable to such modeling? And why?
This article covered how we got to the AI systems that are powering the language technologies of today, and how they work on a high level. In our next and final article in this series, we will take a speculative look towards the future of these AI systems, how they might evolve, and how our lives could change because of them, and what would still be our open problems once we potentially have super intelligent machines.