TLDR
*This section is an automatically generated summary from the following language model review, you can judge how well our summarization models work!
The new GPT-3 is a massive language model, essentially, trained on the entirety of the internet. it has 100x more parameters and ingests 100x more training data than the previous generation. However, one criticism of DL language models, including GPT-3 is that they do not contain a semantic representation of language.
Introduction
In the past 2 years, language models have been progressing at break-neck speed. And If you’ve paid any attention to deep learning news, or tech news in general, you must have heard of the latest hype by now: GPT-3 by OpenAI, a Transformer-based model taken to the extreme in size and capability.[1] For those not in the know, the new GPT-3 is a massive language model trained on the entirety of the internet. The sheer size of GPT-3 alone is astounding, weighing in at 100x more parameters, and ingesting 100x more training data, than the previous generation of language models.
Language Models: a history
To understand GPT-3, it’s helpful to understand a little bit about the history of language models. The language of computers is numbers. The input to all machine learning algorithms is ultimately numbers as well. The question of a language model is: how do you translate words into numbers to take full advantage of the machine learning magic?
In the early days of language models there was the bag-of-words. In the BOW model you simply count the number of times a given word appears in a document, giving you one number per word. This model ignores all grammar and surrounding context, putting everything into the same “bag”. A more sophisticated version of BOW is Term Frequency-Inverse Document Frequency (TF-IDF). At a high level, this is BOW normalized to how often a word occurs across all documents. The intuition of this is that words such as “the” and “as” will often have a frequency of occurrence in all documents but not tell you very much about the content within that document.
The next iteration of the language model was the word2vec type models. Instead of converting each word into one number, the idea was to convert each word in your document into a vector. The idea is that two words that are similar to each other such as man and woman should be in the same “neighborhood” in the vector space. This allows you to do interesting computations such as adding and subtracting words. For instance, in word2vec the following equation holds true: “king” – “man” + “woman” = “queen”.
Word2vec models learned the vector embeddings of words using a shallow neural network. This worked well; however, what if you want to consider the context in which a word is represented in a sentence? For instance, the word “bark” can refer to tree bark or a dog’s bark. The word “don’t” modifies the meaning of stop such that the meaning is actually opposite. Also, why would you stop at shallow learning? If you used a larger, deeper network then maybe you’d get a better model of your words. In the process, you might just be able to address context dependency of meaning.
Some of the earlier deep learning language models made use of the recurrent neural network (RNN) architecture. This led to a significant increase in performance for language models in downstream tasks such as text autocompletion and document classification and even opened the door to some small scale text generation. The latest innovation in language models is the Transformer architecture. Transformers improved upon the RNN architecture and did wildly better in modeling long-range word interactions as well as circumventing some problems with RNNs’ computer memory usage. Now GPT-3 has taken the transformer concept to its furthest limits.
Is the hype merited? In many respects, yes. The things people have been able to accomplish with GPT-3 are jaw dropping. Some examples of its accomplishments include writing a blog that hit the top of hackerank news[2], extracting structured data from unstructured data[3], writing poems[3], and even writing code given plain speech instructions/prompts[3,4]. The crazy thing about GPT-3 is that, because it’s trained on basically all the internet, it needs few, or even no, examples to accomplish the task you want it to. In the deep learning field this is referred to as few shot/no-shot tasks. Typically if you wanted to accomplish the same thing with the previous generation of models you would need tens of thousands of examples for each specific task. You would then take a base language model, trained on Wikipedia and various other books, and then fine-tune the base model with your custom dataset. GPT-3 skips all this extra work, needing only a few, or no, examples to accomplish the same things along with some additional tasks previous models could not accomplish. GPT-2, the predecessor model to GPT-3 released by OpenAI just last year, and other models of that generation didn’t have the capability to extract data from unstructured text or put it into tables or to generate usable code given written instructions, for instance.
What are some short-comings of GPT-3? One criticism of DL language models, including GPT-3, is that they do not contain a semantic representation of language. That is, they have no understanding of the meaning behind the words they are using. A model such as gpt-3 would have no idea that, for instance, a pig flying is absurd, but it could produce that scenario because of the statistical connection between the word pig and other words used to describe animals as well as the statistical connection between some words used to describe animals and the word fly. This leads to a tendency for them to make up facts, contradict themselves, or describe physically impossible situations when generating text, a problem that GPT-3 is not immune to.[5,6] The authors of GPT-3 also point out that they likely have hit the limit in performance that can be gained from scaling.[7] That is, throwing larger models, more GPUs, and more data at the problem will not necessarily yield a better language model. Instead, other training modalities are needed.
What can GPT-3 do for your company? At the time of writing this post GPT-3 access is limited. OpenAI has released an API that is currently in beta. When it is opened for general use we at Mosaic are happy to consult with you on how to best take advantage. In the meantime we can discuss what currently accessible technologies can be tailored to your own natural language applications.
[1]https://openai.com/blog/openai-api/ [2]https://www.theverge.com/2020/8/16/21371049/gpt3-hacker-news-ai-blog [3]https://towardsdatascience.com/gpt-3-creative-potential-of-nlp-d5ccae16c1ab [4]https://analyticsindiamag.com/open-ai-gpt-3-code-generator-app-building/ [5] https://www.theverge.com/21346343/gpt-3-explainer-openai-examples-errors-agi-potential [6]https://www.forbes.com/sites/robtoews/2020/07/19/gpt-3-is-amazingand-overhyped/#70b1a3bf1b1c [7]pg 34 of https://arxiv.org/pdf/2005.14165.pdf 
													 
													