1 Introduction

The Winograd Schema Challenge has been met; in fact, one might say that it has been defeated or, perhaps more dramatically, that it has had a violent death. This is a sad ending for a challenge once hailed as a leading alternative to the Turing test [24,25,26,27].

A very detailed discussion of the defeat of the Winograd Schema Challenge (WSC) has been produced by Kocijan et al. [21]. It seems that very little can be added when it comes to describing the tools for the WSC. The “autopsy” by Kocijan et al. also offers valuable insight as to the reasons why the WSC did not resist when attacked, and the reasons why the WSC was perhaps not a very good test for commonsense reasoning and for overall intelligence in the first place.

Still, it is fair to ask: What exactly killed the WSC? Was it too weak that it in fact died of some common disease while still young? Was it really murdered? If so, who did it, and what was the murder weapon? By reading commentary around the WSC, one feels that there is no consensus on the cause of death. How could a leading substitute for the venerable Turing test die suddenly with so little understanding of the event?

Two previous references deserve attention.

The most prominent one is the already mentioned analysis by Kocijan et al. [21], whose authors have produced substantial contributions to the WSC [20, 22]. They seem to argue, and cite several colleagues who seem to agree with them, that the problem with the WSC was that the challenge was weaker than expected in a number of ways: schemas were harder to build than expected and were less meaningful than originally thought; schemas were more constrained than planned and subjected to large scale correlations that led to death in the hands of large language models such as BERT and RoBERTa. In short, they argue that the WSC was weak and that any specific challenge may be weak in capturing commonsense reasoning [21]. This is a far-reaching position, in which the large language model RoBERTa killed the challenge in a context where the challenge was so weak that it almost did not deserve to live. Hence, there is no prison time for RoBERTa, but not much credit for it either.

Another analysis can be found in the original paper on the WinoGrande corpus, the corpus that was used to refine RoBERTa so as to obtain human performance [42]. Indeed, one might argue that the pair RoBERTa \(+\) WinoGrande was the murderer. Or one might argue that the WinoGrande corpus alone was the murderer, on the grounds that the real defeat came only when researchers realized that more data, even of low quality, could be used to refine a large language model to victory over the WSC. In any case, the authors of the WinoGrande corpus seem to believe that biases in corpora let language models excel in the WSC. For those authors, the Achilles heel of the WSC was that, when building a corpus with questions and answers, human subjects introduce biases that, once captured by large language models, lead to the correct resolution of Winograd schemas. They seem to argue that, if one could produce a very large unbiased corpus with Winograd schemas, then the WSC with respect to that corpus would require commonsense. So the problem was not with the challenge, but with the difficulty in building the required corpus. (Note that such difficulty was deemed by Kocijan et al. to be a defect of the WSC, so there are divergent opinions even on the meaning of the difficulty.) In any case, the position by the authors of WinoGrande seems to be that the WSC is not dead but it may be in a frozen state to be resuscitated when we find a difficult enough corpus.

We offer a different analysis. We agree with others that the RoBERTa language model, and even the WinoGrande corpus, were key elements in this story; however, we think previous analysis has not been able to explain why is it that the WSC succumbed so easily and what exactly killed it — and we believe we have a credible answer to these questions. To develop our argument, we first describe the main elements of the WSC in Sect. 2 and then we focus on RoBERTa and the WinoGrande corpus in Sect. 3. Our solution to the puzzle surrounding the death of the WSC is elaborated in Sects. 4, 5, and 6. Our closing arguments are given in Sect. 7.

2 The Winograd Schema Challenge

The WSC consists of a set of Winograd Schemas, where each schema is a pronoun disambiguation problem such as:

The city councilmen refused the demonstrators a permit because they

feared/advocated for violence. Who are “they”?

A Winograd Schema contains a pronoun that refers to one of two entities (in the example, city councilmen/demonstrators); the reference depends on a particular word. Each schema can be instantiated in two ways by selecting this particular word (in the example, feared/advocated). Then the resulting anaphora must be solved for each sentence. The average human performance in the challenge ranges from 92.1% [3] to 94% [43]. Winograd schemas are devised to display no strong statistical correlations between words; hence, knowledge, and more importantly, commonsense knowledge should be needed to solve them [10].

Hector Levesque, who proposed the WSC, submitted to the AI community, at IJCAI-16, when the challenge ran as an actual competition, that: “Doing this [solving the challenge] correctly appears to rely on having a solid base of commonsense knowledge and the ability to reason intelligently with that knowledge” [24]. In fact, in the original paper about the WSC the following ambitious statements were given by Levesque [24]:

The claim of this paper in its strongest form might be this: with a very high probability, anything that answers correctly a series of these questions ...is thinking in the full-bodied sense we usually reserve for people. To defend this claim, however, we would have to defend a philosophical position that Turing sought to avoid with his original Turing Test. So like Turing, it is best to make a weaker claim: with a very high probability, anything that answers correctly is engaging in behaviour that we would say shows thinking in people. Whether or not a subject that passes the test is really and truly thinking is the philosophical question that Turing sidesteps. [Emphases added]

We may thus distinguish two different claims in Levesque’s proposal:

Strong (Ontological) claim:

Anything that solves the WSC is thinking in the full-bodied sense we usually reserve for people.

Weak (Pragmatic) claim:

Anything that solves the WSC is engaging in behaviour that we would say shows thinking in people.

In any case, the idea behind the WSC is to test the ability to answer commonsense questions as related to sentence comprehension. The research community followed this lead with various degrees of enthusiasm, taking the WSC as a new type of Turing test [2, 15, 32,33,34, 51].

Fig. 1.
figure 1

Accuracy of systems on the Winograd Challenge [6].

Machines performed poorly up to 2017. That suddenly changed with language models, esp. GPT-2 (Generative Pre-trainedTransformer) [38], BERT (Bidirectional Encoder Representations from Transformers) [12], and RoBERTa (Robustly optimized BERT approach) [29]. Figure 1 shows the astonishing evolution in accuracy. Today we have (extremely) large language models that can solve a variety of linguistic tasks with high accuracy [1, 46].

We can say that the defeat of the WSC took place by the end of 2019 [44]. Interestingly, high expectations were assigned to the WSC until its sudden defeat. For instance, Marcus and Davis insisted in their book for a non-specialized audience, in 2019, that the WSC is “one of the most challenging tests for machines that is currently available” [30].

The literature on the WSC is relatively small after the WinoGrande solution was widely disseminated in 2020 [42]. Variants of the WSC have been proposed, and various applications have been tried — for instance, to test gender bias or to require explanations. Such work is reviewed with exceptional acumen by Kocijan et al. [21], so we do not repeat here what they have done in a definite fashion.

3 LLMs, RoBERTa and WinoGrande: Main Suspects

A language model usually yields probabilities over tokens of a language, given other tokens [18]. Several language models resort to Markov chains and similar probabilistic tools; recently, large language models have been built by training deep neural networks with properly designed architectures. There has been a shift up in the performance with the introduction of Large Language Models (LLMs) based on the transformers architecture [50]. In short, a transformer is a neural network model that uses attention as it main mechanism; because this mechanism lends itself to parallelization, learning techniques can rely on larger training sets. Besides parallelization, the self-attention mechanism encodes contextual information.Footnote 1 The most popular LLMs, such as BERT, T5, GPT, are transformers. Transformer-based LLMs frequently achieve state-of-the-art results in tasks like question answering and sentiment analysis [17, 49].

LLMs rose to center stage in 2018 after Trihn and Le [48] employed RNN-based models. Even though their models’ performance was relatively low (63.7%), they optimistically claimed that they had a good indication their model had a “good grasp of commonsense knowledge” [48]. They accepted that the better the performance in the challenge the more confidently one can say the model carries commonsense knowledge.

Significant evolution on the WSC was observed after Thihn and Le’s work, as one can see in Fig. 1. An spectacular gain in performance was then reported with the publication of the WinoGrande corpus and associated solution for the WSC (first in arXiv [44] and then in the AAAI-2020 conference [42]). At that point the WSC was defeated.

WinoGrande is a dataset of schemas inspired by the original Winograd schemas. The full version WinoGrande contained 43,972 problems while the debiased version contained 12,282 problems.

The WinoGrande paper reported at the AAAI-2020 attained 90.1% accuracy in the Winograd Schema Challenge [43]. The content of that AAAI-2020 paper was originally uploaded in November 2019 on arXiv [42]. However, there was a first version of this paper uploaded to arXiv on July 2019 which presented a significant lower performance, 77.6%, still the state-of-art at that moment [44]. Basically, the difference in performance was produced by the RoBERTa fine-tuned with WinoGrande in the second version, whereas the first version adopts BERT fine-tuned with WinoGrande. Comparing both versions, the sole reason for the performance difference is the adoption of RoBERTa in the final version in contrast to BERT in the first version.

The Winogrande team argued that the performance of existing systems could be due to the “extent to which spurious effects are prevalent in existing datasets, which run the risk of overestimating the true capabilities of machine intelligence on commonsense reasoning” [44]. The initial motivation for WinoGrande focused on dealing with biases that might exist in the hand-crafted Winograd schema and that might thus unfairly help existing solutions. Such biases might be of two types [47]: (a) language-based and (b) dataset-specific biases. In fact, Trichelair et al. observed that at least 37 sentences in the WSC273 dataset (13.6%) are conceptually rather easy due to language associations [47]. For instance, take

In the storm, the tree fell down and crashed through the roof of my house. Now, I have to get it repaired/removed.

and note that trees are removed more often than repaired, while roofs are repaired more often than removed. On the other hand, 131 sentences (out of 273) were then found to yield meaningful examples even if candidates in the sentence were switched. This was called “swichtability” [47]. As an example, take

Bob collapsed on the sidewalk. Soon he saw Carl coming to help. He was very ill/concerned.

In this sentence, Bob and Carl can be switched to obtain an equivalent example with the opposite answers. Such schemas were called “switchable”. Trichelair et al. [2018] encouraged future researchers to additionally report results on the switchable dataset (when the candidates are switched, and when they are not).

The WinoGrande team also discussed dataset-specific biases, such as annotation artifacts or spurious correlations in crowdsourced datasets.

Their solution was to develop an unbiased dataset (to be robust against both types of biases discussed above [44]), so presenting “problems that are more challenging by reducing such biases, while also scaling to a significantly larger number of problems (273 to 44k) by crowdsourcing” [44].

In the second version of the WinoGrande paper, the team’s declared ambitions with the new dataset increased. The dataset would allow a “true estimation of the machine commonsense capabilities ...with 44k problems that are inspired by the original design of WSC, but modified to improve both the scale and hardness of the problems” [42, 43]. Their goal was then to minimize this bias in the original Winograd Schema. The key steps in WinoGrande construction consisted of (1) a carefully designed crowdsourcing procedure, followed by (2) a novel algorithm AFLITE that generalizes human-detectable biases based on word occurrences to machine-detectable biases based on embedding occurrences. The key motivation for this procedure was that it is difficult for humans to write problems without accidentally inserting unwanted biases.

After training BERT [44] and RoBERTa [42] in the WinoGrande-All dataset and achieving good results, the team was somewhat skeptical about performance. They questioned whether neural language models successfully acquired commonsense or it was just an overestimation of the true capabilities of machine commonsense:

the potential overestimation leads to another crucial question regarding potential unwanted biases that the large-scale neural language models might be exploiting, essentially solving the problems right, but for wrong reasons. ...While such biases and annotation artifacts are not apparent for individual instances, they get introduced in the dataset as problem authors subconsciously repeat similar problem-crafting strategies.

To proceed, we must examine one key element in BERT (and RoBERTa) training; that is, masked language modeling. We will then be able to understand the way commonsense is absorbed by LLMs.

4 BERT and Cloze Procedures

The Bidirectional Encoder Representations from Transformers, BERT for short, is an LLM that handles contextual information in the pretraining phase. Notable previous efforts can be cited: semi-supervised sequence learning was developed in 2015 [9],Footnote 2 and well designed embeddings (ELMo) appeared in 2017 [36]. The problem with those efforts is that they only used either left context or right context of a word, but language understanding is bidirectional, as indicated in decades of psychological studies, particularly in Gestalt Theory, and in philosophical discussion dating back to Frege [16]. The main idea there is that meaning lies in a sentence and not in single words. For a century now, experimenters have been reporting findings that may be interpreted as showing that language behavior depends on total context:

The results indicate that the ability to identify, learn, recognize, remember or produce any language “symbol” (element or pattern) depends heavily on the variable degrees to which it is associated with everything else by larger and meaningful (familiar) overall combinations. [31]

The authors of BERT proposed a solution for bidirectional language learning, where they masked out k% words during training.

The authors of BERT explicitly mentioned the psychological test called cloze test, also known as cloze procedure, in their commentary about BERT.Footnote 3 In our analysis, this is an essential character in the death of the WSC, so it pays to examine cloze procedures in more detail.

A cloze procedure is a psychological tool for measuring the effectiveness of communication, introduced in 1953 by Wilson Taylor [45]. Such a procedure is based on cloze units, in which the word “cloze” derives from the concept of closure. Gestalt psychology defines the latter concept as the human tendency to complete a familiar but not-quite-finished pattern — for instance, when seeing a broken circle as a whole one. In other words, “closure” refers to our natural tendency to mentally close gaps. The same tendency applies to language; for instance, given “Chickens cackle and — quack” almost anyone can instantly supply “ducks” [45]. In a cloze procedure, if the word uttered by a human subject is really the same as the word omitted, the person scores one cloze unit for correctly closing the gap in the language pattern.

It should be clear that a cloze procedure is really similar to masked language modeling as performed when learning BERT and similar large language models.

For our investigation to proceed, it makes sense to examine in more detail the definition of linguistic commonsense:

[T]he sentence pattern is a complex one made up of many subpatterns. One must know not only the meanings (i.e., patterns of symbol- meaning relationships) and forms (patterns of letters) of all the five words, but also the meanings of given combinations of them – plus the fact that the sentence structure seems to demand a term parallel to “cackle” but associated with ducks instead of chickens. In other words, one must guess what the mutilated sentence means as a whole, then complete its pattern to fit that whole meaning. [Emphases added] [45]

Any method that intercepts a message from a “transmitter” (writer or speaker), mutilates it, and administers it to “receivers” (readers or listeners) that then attempt to fill the missing parts, is a method that potentially yields a number of cloze units. As Taylor argued, different persons may express the same meaning in somewhat differing ways, and the same language patterns may have differing meanings for different people.

The total context of a language behavior, such as the one involved in the resolution of a cloze unit, includes everything that tends to motivate, guide, assist or hinder that behavior. As noted by Miller, all such factors are combined into a notion of commonsense:

“I heard a — bark” is likely to elicit “dog” both because that word is habitually associated with “bark” and because it fits in with past experience with noisy dogs. If the verbal context is enlarged to “For the first time, I heard a — bark,” the impulse to supply “dog” may be reduced by commonsense; the subject may ask himself: “Who is this guy that has never heard a dog? Could he be referring to some other animal?” And if the preceding sentence has mentioned a voyage to the Pribilof Islands, the reader may draw on past knowledge to produce “seal.” [31]

We emphasize the expression “reduced by commonsense” as it reflects exactly what researchers supporting the WSC argued for. Habits of expression take over most of the work of translating an individual’s meaning into an organized series of language symbols for transmission to others. Likewise, habits of reading or listening cause the reader or listener to anticipate words, almost automatically, when receiving messages.

In many ways, language masked modeling is already discussed approvingly in connection with cloze procedures by Taylor:

How can a random system play fair when some words are easier to replace than others? Obviously, one is more likely to be able to supply “an” in “A quinidine is — alkaloid isomeric ...” than to guess “$6,425” in “The city council voted — for a new swimming pool.” Yet the former example is far more difficult reading. The answer is that if enough words are struck out at random, the blanks will come to represent proportionately all kinds of words to the extent that they occur. [45]

As a digression, we note that both BERT’s masked language modeling and cloze procedures have to decide on how many words to delete. BERT was trained using 15% of words randomly masked out, a number based on empirical performance. It was then noted that deleting too few words leads to too much training effort, while deleting too many words may eliminate too much context.

In short: BERT’s pretraining applies a sort of cloze procedure at scale; we may refer to BERT’s pretraining as going through a large number of “masked language closure” steps. In a sense, BERT’s training is teaching a machine to solve a linguistic commonsense task. As Winograd schemas are actually similar to cloze units, it turns out that BERT’s designed training clearly address Winograd schemas. We emphasize the connection with cloze units in the next section.

5 Textual Entailment, and Schemas as Cloze Units

The explicit linguistic problem to be solved given a Winograd schema is anaphora resolution. However, Winograd schemas are also variants of another task, Recognizing Textual Entailment (RTE) [4, 8, 41]. RTE asks a machine to recognize what is the Textual Entailment (TE) from a given sentence. A Textual Entailment is a linguistic phenomenon that appears everywhere in daily life communication; it expresses the relationship between sentences or fragments of texts in which one implies the other. For instance, “John bought a novel yesterday,” textually entails, “John bought a book.” Dagan et al. have proposed a simple yet well-accepted definition of Textual Entailment: T textually entails H if and only if a human reading T would typically infer that H is most probably true [7].

As we can see here, solving a RTE problem qualifies as a test for implicit causes in both human understanding and perception. Winograd schema inherit such a feature. More accurately, 94.6% of all existing Winograd schemas have this structure according to our own manual analysis.

However, recognizing the correct direction of causation requires a prior understanding of what the entities in the text mean. Consider the following example. The sentence “Ferrous sulfate heptahydrate is green” (T) textually entails “\(FeSO_{4}7H_{2}0\) is green” (H) [23], which is a valid entailment because “Ferrous sulfate heptahydrate’ and “\(FeSO_{4}7H_{2}0\)” refer to the same entity in the world in philosophical jargon. Although this entailment looks, at first sight, a synonym problem, a human reader cannot tell whether T entails H by working with words alone. A piece of appropriate background knowledge is required.

This example has the same structure as the Morning Star - Evening Star puzzle that Gottlob Frege posed a century ago with respect to the difference between meaning (Bedeutung) and reference (Sinn) [16]. Knowing that both stars are the same astronomical object (Venus) required new empirical information. As the philosopher of language W.V. Quine put it, “astronomical observation was needed, and not mere reflection on meanings, to determine the sameness of the entity in question” [37]. This sort of difficulty appears in the “ — barks” example discussed before in connection with cloze procedures.

Not surprisingly, research in RTE has arrived at a definition that takes into account appropriate background knowledge:

A text T textually entails a hypothesis H relative to a group of end-users G just in case, typically, a member of G reading T would be justified in inferring the proposition expressed by H from the proposition expressed by T. [23]

The challenge here lies in the high level of dependency on contextualized information within a community. Even though the intuition behind RTE seems correct as a test for commonsense reasoning, it is hard to come up with a well-delimited test in which knowledge representation is not about general issues.

Indeed, Levesque did not take RTE as the best way of testing commonsense because there is an indefinite number of valid and correct consequences that may follow from a sentence within RTE [27]. The problem of inferential causality [35] is unsurmountable as a practical test of machine intelligence. As an alternative, he proposed the WSC as a test that is not only feasible but also measurable.

Levesque’s innovation is the addition of a double ambiguity: a pronominal anaphora plus a coreferential ambiguity. For instance:

(information) The city councilmen denied the demonstrators a permit.

(ambiguity) Because they feared/advocated for violence.

The personal pronoun “they” may refer to either the city councilmen or the demonstrators. If we ask “who feared violence?” the answer is “the city councilmen”. Conversely, if we ask “who advocated for violence,” the answer is “the demonstrators”. Therefore, the coreferential ambiguity is based on a pair of special words that considerably shift the meaning of the sentence.

We can see that the WSC, in an attempt to move away from the complexities of RTE, arrived at a modified form of cloze test. First of all, the blank to be replaced, in a Winograd schema, by either one of the pair of keywords is just a mask with only one correct answer. For instance, we have:

The city councilmen denied the demonstrators a permit. Because they [MASK] violence.

To make the connection with anaphora resolution even more evident, suppose the pronoum that has to be disambiguated is also masked. We would end up with two masks as follows:

The city councilmen denied the demonstrators a permit. Because [MASK] [MASK] violence.

In this particular example we have a little bit more than 15% of masked tokens, and we have preserved enough context to a good guessing from the machine. (As a digression, note that, by limiting the flexibility of placeholders, Winograd schemas introduce some sort of bias, a social desirability bias — the requirement that nouns are mapped correctly to keywords so as to infer what the designers of the test take as the correct and obvious alternative.)

We must now examine what is it that RoBERTa brought to the plot, and verify the effect of the WinoGrande corpus, so as to understand the role played by those suspects.

6 A Second Look at RoBERTa and WinoGrande

The large language model RoBERTa is a refined version of BERT. As indicated by the authors of RoBERTa, its improved pretraining, with more corpora and better hyperparameters, led to significance performance gains across many tasks [29]. Besides BookCorpus [52] and the English Wikipedia (the original data used to train BERT), RoBERTa was trained with the English portion of the CC-NEWS dataset,Footnote 4 OpenWebText,Footnote 5 and, finally, and most notably, STORIES, a dataset introduced by Trinh and Le [48] containing a subset of CommonCrawl data filtered exactly to match the style of Winograd schemas.

From the second version of the WinoGrande paper [42], we have that the fine-tuning of RoBERTa with DPR (Definite Pronoun Resolution Dataset) achieved 83.1% accuracy in the WSC. DPR was a fairly common used dataset for fine-tuning models for solving the Winograd Schema that introduced 1886 additional Winograd schemas authored by 30 undergraduate students. That result was already a state-of-art performance for the WSC. And then the fine-tuning of RoBERTa with WinoGrande, as we know, achieved 90.1% accuracy in the WSC.

Indeed, RoBERTa \(+\) DPR already produced excellent results. For instance, consider the PDP Pronoun Disambiguation Problems) dataset. This dataset consists of 80 pronoun disambiguation problems, all formulated through multiple choice questions, in which a pronoun must be resolved to one of up to 5 (but mostly binary) options. This is clearly related to the WSC, and often used as proxy for the WSC itself. For the PDP, RoBERTa + DPR and RoBERTa + WinoGrande had close results, 86.3% and 87.5% respectively. We can see similar results for other datasets in Table 1.

Table 1. Comparison between DPR and WinoGrande fine-tuned RoBERTa. Note that SuperGLUE-WSC is discussed at https://super.gluebenchmark.com/tasks/.

The authors of the WinoGrande corpus briefly discussed those points:

Our model is based on RoBERTa fine-tuned with WinoGrande (train and dev sets). To compare different corpora used as a resource, we also fine-tune RoBERTa on DPR (train and test sets). For hyper parameter search, we use the same grid search strategy as in Sect 4. Overall, RoBERTa fine-tuned on WinoGrande helps improve the accuracy on all the related tasks (Table 6), and performs consistently better than when RoBERTa is fine-tuned on DPR. While improvements on some related datasets (particularly WSC, PDP, and DPR) might seem expected, the significant improvement on COPA is not so. [42]

Intuitively, the larger the dataset, the better; the WinoGrande corpus is valuable because it is a large one. One should expect the performance of RoBERTa to increase by adding 40k+ Winograd Schemas.

Our goal through this discussion is to suggest that RoBERTa was a key character in our plot, due to the masked learning modeling that lies beneath it, while WinoGrande was an important element of the plot but one that simply added learning material to the underlying infrastructure.

Before we leave this discussion, we must comment on the analysis by Elazar et al. [13]. They argue that the perceived progress in the WSC is due to flawed evaluation, artifacts, and commonsense knowledge gleaned from a supervised training set rather than advancements in LLMs; clearly their analysis emphasizes features of the WSC that are not aligned with the analysis in this paper. While it is undeniable that flawed evaluation and artifacts have influenced the perceived progress in the WSC, evidence suggests that the contribution of pre-trained LLMs is significant and should not be disregarded. Pre-trained LLMs indeed play a significant role in improving word sense disambiguation and substantial progress has been observed through BERT to RoBERTa (and note that BERT alone got 64.9% accuracy in the WSC, while RoBERTa alone reached 79.1% accuracy). Further evidence supporting this claim is the change in performance across different versions of the WinoGrande paper: despite minimal changes in the WinoGrande dataset, Sakaguchi et al.’s BERT-based model achieved 77.6% in July 2019, and with the availability of RoBERTa in November 2019, performance jumped to 90.1% [42, 44].

To conclude this section, we note that we have focused on the events that took place around 2019/2020, when the WSC was defeated. LLMs have had an explosive evolution since them; masked-modeling training is certainly not the only technique that deserves attention now. While gap-filling (cloze) training aligns with the human reading experience, where we move forward and backward through text, models such as GPT predict the next token using solely previous tokens. The latter strategy is similar to everyday human conversation, where information is exchanged linearly over time, and participants cannot go forward to examine tokens yet to be uttered. Given the surprising success in question answering and dialogue by GPT-like models, one might thus think that there are other mechanisms that may defeat the WSC, perhaps provided that one can use really large amounts of data and really large models. But even an extraordinarily large model such as GPT-3 [5] does not lead to spectacular performance in the WSC: our testing led to mere 67% accuracy. Things improve for the even larger GPT-4 [1], where we then got impressive 85.2% accuracy — but we do not really any access to that LLM, so we do not know whether it has been trained with even larger sets of Winograd-like sentences. More investigation is required, and we leave the matter to future investigation.

7 Closing (Clozing?) Arguments

The Winograd Schema Challenge was killed because the cloze-style training that is used to produce many recent language models is perfectly tailored to solve Winograd schemas. Both RoBERTa and WinoGrande were important in this plot, and RoBERTa in particular was a key character as it carried the critical information needed for the killing.

But in fact the WSC was not murdered by a single language model or a corpus. Rather, the WSC died because it was supposedly designed to capture commonsense but in fact it only offered a task that is in essence a version of cloze procedures, a task that masked language modeling, as employed in large language models such as BERT, is specifically designed to address.