key: cord-0661322-f3fktino
authors: Lee-Robbins, Elsie; He, Shiqing; Adar, Eytan
title: Learning Objectives, Insights, and Assessments: How Specification Formats Impact Design
date: 2021-08-06
journal: nan
DOI: nan
sha: 291bbfc10e0dd2f1b1cedf3ed24df6d3f9eeb542
doc_id: 661322
cord_uid: f3fktino

Despite the ubiquity of communicative visualizations, specifying communicative intent during design is ad hoc. Whether we are selecting from a set of visualizations, commissioning someone to produce them, or creating them ourselves, an effective way of specifying intent can help guide this process. Ideally, we would have a concise and shared specification language. In previous work, we have argued that communicative intents can be viewed as a learning/assessment problem (i.e., what should the reader learn and what test should they do well on). Learning-based specification formats are linked (e.g., assessments are derived from objectives) but some may more effectively specify communicative intent. Through a large-scale experiment, we studied three specification types: learning objectives, insights, and assessments. Participants, guided by one of these specifications, rated their preferences for a set of visualization designs. Then, we evaluated the set of visualization designs to assess which specification led participants to prefer the most effective visualizations. We find that while all specification types have benefits over no-specification, each format has its own advantages. Our results show that learning objective-based specifications helped participants the most in visualization selection. We also identify situations in which specifications may be insufficient and assessments are vital.

Communicative visualizations are omnipresent. They exist in everything from news articles to scientific papers and from Web pages to television broadcasts. The people involved in the visualization design process must make design decisions based on their intents or goals. Unfortunately, most design guidelines do not connect communicative intent to the actual design. Instead, there is a significant amount of information on what makes communicative visualizations effective perceptually, focusing on making the visualization readable. However, There may be several stakeholders who are responsible for creating a visualization, ranging from those who have an initial idea for it, to those who create it, to those who choose the final version to publish. It's possible that there are different people for each of these steps, in the case of a client, a contractor, and an executive editor, or it's possible that one person carries out all steps of this process from start to finish. At every step, each stakeholder would benefit from a formal language for communicative intent. For creating a visualization, a clear communicative intent specification can help the designer identify their own goals and evaluate whether their visualization achieves them. Professional designers may already know how to identify a communicative intent. Even so, they could still benefit from a framework to communicate with other stakeholders. For a client commissioning a designer, a shared language of communicative intent would be helpful for the client to be able to succinctly express their goals to the designer. Sometimes, there might be several visualizations, either created by a designer or automatically generated by a computer, that a client would need to choose from in order to best fulfill their communicative goals. In this scenario, a formal structure for communicative intent can help the client analyze the different visualization designs to choose the best one. While there might be several stakeholders involved in creating, choosing, and publishing a visualization, all of them would benefit from a formal language for communicative intent to guide their design decisions, communicate with each other, and decide between multiple visualization designs.

The framing of communicative visualization design as a learning design problem [1] is a formal structure that prioritizes higher-level intents. The stakeholder takes on the role of a teacher, explicitly defining their goals of what they want the viewer (student) to be able to do or remember after viewing the visualization. For example, the viewer will be able to recall the change in unemployment last year, or the viewer will be able to determine the optimal treatment. These learning objective statements are based on Bloom's Taxonomy [3] , though many other variants are possible [5, 39] . Critically, a specification of a communicative visualization as a learning problem can lead to appropriate assessments. A stakeholder can implement an actual test (e.g., a multiple-choice question "post-test") to determine if the visualization meets the specification criteria and satisfies the communicative intents.

Even in the context of learning, there are many ways that a specification can be crafted for a communicative visualization. For example, learning objective statements (e.g., the viewer will be able to describe the reasons why Norway is a Winter Olympics powerhouse) are broad. This type of specification is high-level; it includes both the cognitive ability (e.g., the verb describe) and knowledge (e.g., the 'noun' Norway's success in the Olympics). To achieve this, the viewer may need to read many facts, or insights, that are in one or more visualizations. An alternative specification format might emphasize one or more of these lower-level insights: the high number of Olympic medals relative to population or the many skating rinks that Norway has. While insights may not be as broad as learning objectives, they may be easier to map to 'graphical facts' or annotations and might be more comfortable for a designer to work with. A third specification format might privilege the assessments themselves-the tests we could use to validate learning: In the last ten Winter Olympics, how many times did Norway win the most medals? While specific, this format may help the designer select a visualization that will do best in testing. Conversely, an over-focus on specific test questions or insights may lead to visualization choices that are not effective broadly. This is the design equivalent of 'teaching to the test. ' The formats of these specifications are often transformable. For example, we could create one or more assessments for each insight or learning objective. Similarly, we can identify the insights that need to be communicated and learned to achieve a higher-level learning objective. We hypothesize that different specification forms emphasize different ways of thinking. Given these different specifications, designers or editors might make different choices. More critically, some of these choices might be better for learning outcomes.

In this paper, we measure the impact of different specifications on visualization design choices. We contrast the visualization preferences when participants are exposed to different specifications in the form of learning objectives, insights, and assessments, and contrast these to baseline preferences. Additionally, we run a set of assessments against the possible visualizations to determine which produce the most accurate answers. Using both the preference data (i.e., the most preferred visualization given the specification) and accuracy (i.e., the best performing visualizations), we identify when different specifications lead to better learning outcomes. Specifically, we find:

1. Adding specification information changes preferences over baseline judgments. 2. Learning objectives and insight specifications guide designers to prefer better-performing visualizations. 3. Evaluations of assessments with user testing provide critical feedback of what designs are effective.

We briefly examine various ways to describe visualization intent. We focus on summarizing the pros and cons of different approaches.

A common approach in understanding communicative visualizations is a focus on the outcome rather than intent. Specifically, evaluations provide a mechanism to judge whether a visualization achieves a designated communication goal. However, these approaches often emphasize cognitive effectiveness metrics. For example, one common way to evaluate a visualization is to test how fast and accurately a user can perform a task while viewing the visualization. Because designers can use low-level perceptual task taxonomies (e.g., as described by [2] ) at the beginning of their design process, these evaluation mechanisms can encompass designers' intent. Using a combination of tasks, designers can identify the visualization that is most effective for each task [35] . However, while benchmark usability tests are standard, these tasks are elementary and not representative of more complicated interactions. Furthermore, these measures may involve trade-offs between speed and accuracy [31] . These generic and simplified tasks cannot evaluate more domain-specific and higher-level goals. This limitation inspires us to identify a better mechanism for goal specification and evaluation.

There are various memory evaluations for visualizations. For example, we could simply ask: "if shown a visualization twice, would a viewer recall seeing it before?" An alternative test may target a subgoal of the visualization by asking whether the viewer remembers a particular takeaway after the viewer has stopped looking at the visualization. Memorability is an essential part of communicative intent; designers should not be happy if the information went in one ear and out the other. Works in semiology such as [4, 29, 30] offer insights into how meaning is formed and communicated through images and that they should be 'memorizable.' Although this group of works has limited analyses for communicative visualizations, they argue that designers should consider long-term engagement and retention as valuable goals.

Memorability tests shape designers' design decisions and direction. For example, visual images such as chart junk may increase the memorability of a visualization [6] . Designers might also choose to add visual difficulties to a visualization to increase engagement, higher processing, and long-term retention [25] . Our experimental design targets memorability as a key feature.

Designers can evaluate visualizations with various usability tests. Specific techniques range from usability inspection metrics (as described by [24] ), running interface-focused user studies [24] , and running qualitative measures such as interviews and observations [27] . Evaluations of visualizations can be based on common design guidelines and heuristics, such as "this visualization makes important information visually salient" [18, 21] . For narrative visualizations, [22] and [32] discuss the potentials of using the Elicitation Interview technique to evaluate static and narrative visualizations. For visual analytic systems, process data could be collected as the user explores the data and can be used to evaluate the system [10] .

Compared to generic task taxonomies, usability tests are specific to the visualization systems and their communicative intent. Therefore, they have the potential to target more complex cognitive levels. Nevertheless, they can be costly and time-consuming. Based on heuristics used in the tests, these evaluations might require trained experts.

While most of the evaluation mechanisms target generic visualizations, some strategies help designers achieve communicative visualization objectives. For example, using visualization rhetoric as an analytical framework for narrative visualization can help designers make engaging, layered visualizations that have clear interpretation priorities [26] . Draw-your-own style visualizations ask viewers to chart out their expectations before seeing the data, an action that supports learning [28] . Furthermore, bringing gamification to visualization has the potential to improve engagement [13] . Nevertheless, these strategies describe how to achieve an objective rather than offering a language to specify those objectives.

Overall, existing evaluation methods rarely focus on the specific higherlevel communicative goals. While there are strategies to improve communicative visualizations, designers might still find it challenging to describe their intent due to a lack of resources and guidelines. As a result, designers often need to evaluate communicative visualizations from the cognitive efficiency perspective. A different approach is to frame the evaluation of visualizations as a learning-based problem, both for communicative visualizations [1] and visual analytic systems [10] . Specifically, this framework connects studies in learning science and communication theory to construct an insight-to-learning-objective taxonomy that models designers' intent.

In the visualization community, insights are commonly thought of as the goal of visualizations [8, 41] . A widely accepted definition of insight is a unit of information or knowledge that a visualization communicates [9] . An insight could be an observation of a fact, pattern, or relationship in the data [11] . Insights can range from simple to complex on several characteristics: complex, deep, qualitative, unexpected, and relevant [31] . From a communicative viewpoint, designers could consider insights as the conclusion or the primary takeaway from the graph.

With insights as communicative intent, designers aim to guide the viewer to understand potentially multiple data insights. Before or during the design process, designers will know what specific insight(s) they want to communicate. There are several limitations to this type of specification. Using insights to formulate communicative intent focuses on the data, therefore, often does not capture how the viewer will interact with the data. Also, insights might not capture all of the different kinds of communicative goals, such as outcomes or actions that designers want the viewer to do after seeing the visualization.

An alternative to insight is learning objectives. We adapt learning objectives from Bloom's Taxonomy for describing communicative intent for visualizations [1] . Through the lens of a teaching problem, designers take on a teacher's role, explicitly defining their goals of what they want the viewer (student) to do or remember after viewing the visualization. Learning objectives have a structure of The viewer will [verb] [noun], where verbs specify cognitive change and the noun is a piece of knowledge. Bloom's taxonomy [3] provides a structure for designers to articulate what they want their audience to do effectively and at varying complexity levels.

In our study, we used three types of specifications to frame communicative intent: learning objectives, insights, and assessments. Each type reflects a different way of considering how the viewer might learn.

To briefly introduce these types of specifications, we discuss a simple example of a dataset with three values of fruit sales: $500 in apple sales, $200 in pear sales, and $100 in kiwi sales. This data can be visualized as a bar chart with three groups (Figure 2, dataset, we can imagine that the person creating or choosing a visualization has learning objectives: LO 1 : the viewer will recall the amount sold of each fruit and/or LO 2 : the viewer will recall the differences between fruit sales. Depending on the learning objective(s) and their relative importance (and factoring in the properties of the audience), we might make different choices when selecting one of the visualizations in Figure 2 . Learning objectives can build off each other (e.g., we have to recall something before we can use it for synthesis tasks). Mathematically, learning objectives may also be equivalent or overlapping. Depending on our learning goal, we may focus on the LO 1 or LO 2 insights. Note that our definition of insight centers on facts or statements about data. These need not be the 'a-ha' style insights experienced in analytical visualization forms [9] . Rather, these are communicative insights-those chosen by the visualization designer-that they would like the viewer to recognize as 'facts' both during and after reading a visualization. We specifically relate insights to the learning objectives in that they are statistical statements about the data that emerge from the noun portion of the learning objective. In most situations, we expect that the viewer would be able to validate these insights as true or false. In our example above, there may be multiple insights that arise from one learning objective. Thus, we can specify our intent using these statements (or some representative sub-sample).

As we generated insights, we can also create assessments. For LO 2 , or equivalently, the second set of insights (I 4 , I 5 , I 6 ), we could write a 'test' to evaluate whether the viewer had recalled the differences between fruit sales. There are multiple possible assessments that we could use. For example, we could ask the viewer to answer "What's the difference between apple sales and kiwi sales?" Alternatively, we could ask a true or false question, "True or False: apple sales were $200 more than kiwi sales?" There are also easier, but insufficient assessments one could use, such as ranking fruits by sales. This question would assess whether the viewer knows the ordinal rank of each fruit, which would be a first step in knowing the exact differences. Another test would be to ask a question to assess I 1 "What was the total value of apple sales?" This assessment would give the stakeholder information on what fundamental information may be learned or lacking from the viewer. This assessment provides more detailed information on what parts of the visualization are falling short, allowing the designer to address any shortcomings. Just as we could create a specification of our intent using learning objectives, or insight statements, we could use a set of assessments instead. Critically, each form of specification has positive and negative features.

Our simple learning objective above took the form of, "the viewer will recall the amount sold of each fruit. (LO 1 )" The template for this statement is derived from the application of the Bloom's Taxonomy [1, 3] . Learning objectives have a structure of the viewer will [verb] [noun] . Verbs correspond to one of the cognitive dimensions: [15] , (B) Phillips curve [14] , and (C) COVID and employment [16] .

remember, understand, apply, analyze, evaluate, and create (each with sub-categories and synonyms). These categories are largely hierarchical, with lower-level tasks (e.g., recall) contrasting to high-level objectives (e.g., generate). Nouns, in our formulation, are specific. For example, "unemployment levels in California" or "the best algorithm." Nouns also fall broadly into factual, conceptual, procedural, and metacognitive knowledge. Taken together, a more sophisticated learning objective might be: LO 3 : The viewer will summarize the relationship between loss of employee and loss in sales across different industries due to COVID-19. We have previously demonstrated how communicative visualization intents can be mapped to this taxonomy [1] . Our work demonstrated that designers can map many of their intentions to this framework. The benefit of this approach is that learning objectives can encapsulate a wide range of lower-level goals (e.g., remembering a set of facts) through succinct high-level descriptions. However, creating objectives is difficult even with training [19, 34] . Thus, constructing objectives as specifications may not always be practical or convenient.

An alternative way of considering learning is to ignore the cognitive dimension and focus on key data "facts" or "insights." Though there are numerous ways of modeling insight [9] , we are concerned with a narrower definition. In the context of communicative visualization, insights are statements about data that the designer would like the viewer to recognize and internalize. In our fruit sales example, we suggested an insight (I 1 ) of, "There was $500 in apple sales." An insight may be the "noun" portion of a learning objective but is often more atomic and more specific. For example, if we consider the objective (LO 3 ) above, an insight form might be:

There is a linear relationship between loss of employees and loss in sales across different industries due to COVID-19. Communicative insights are often easy to identify and articulate. Many visualization annotations are reflections of insights in graphical form. Thus, a visualization goal specification may consist of one or more insights. The designer can focus on this set when designing or selecting a visualization. However, because insights are low-level and speak directly to data facts, we may require many insights in our specification to cover one learning objective. Too many insights in a specification may overwhelm the designer. Too few, and they might fixate on a narrower objective than intended. Additionally, because of their data orientation, insights do not always support other types of learning. For example, we may want a communicative visualization that teaches the viewer how to apply an algorithm (e.g., a triage flowchart at a hospital). It is more difficult to formulate that type of goal as an insight statement.

Assessments are an alternative, outcome-focused way of describing learning. A feature of learning objectives is an extensive mechanism for mapping objectives to concrete assessments [20, 33] . For example, an assessment for the objective/insight (LO 3 /I 7 ) above might be: Q: What is the relationship between employment and sales? A: There exists a linear relationship where a loss of sales results in a loss of employment. Even multiple-choice style questions can be combined to identify if learning has been achieved. When these are insufficient, open-ended questions with a rubric may also work. In our context, assessments of this type are used to measure "program effects." That is, we are not concerned with a single viewer's performance, but rather the effect of the visualization on all the viewers. By specifying visualization goals through a list of questions, the designer may be forced to reckon with their choice relative to a very specific set of outcomes. The designer should create or select the visualization that will lead to the best assessment results. An assessment-based formulation may have both the benefit and disadvantages of focusing the designer on an outcome metric. Instructors will recognize the danger here of, "teaching to the test" (or in this case, "designing to the test."). As with insights, we may need many assessment questions to describe one learning objective or 'triangulate' if learning has happened.

To summarize, we note that these three specification formats are connected to each other, but vary in systematic ways. Learning objectives, which are explicit in specifying the cognitive impact, are often broader and more abstract than insights or assessments. Assessments are more specific than insights, focusing designers on only one possible question they could ask of a viewer. We expect that these differences in specification format will lead to differences in designer preferences. It is also ultimately possible that the best specification will be a mix of these approaches. The different forms of specification have different levels of granularity, ambiguity, generalizable properties, communicative efficiency, and other features that would make their use more or less desirable for describing intent. While these facets are all potentially interesting, we begin with a more high-level question: does communicative intent specified in one of these forms lead to better learning outcomes?

In our study, we conduct two experiments to investigate whether the type of specification had an impact on preferences for designs and, ultimately, their effectiveness. More specifically, our main research questions are:

1. Do different specifications (learning objectives, insights, assessments) affect the choice of visualization designs? 2. Are the chosen designs more effective than the alternatives?

To create materials for this experiment, we identified a set of three starting articles from which we generated visualization designs, learning objectives, insights, and assessments. We used three articles from The Economist's Graphic Detail series: "American retailers have laid off or furloughed one-fifth of their workers" (COVID) [16] , "The Phillips curve may be broken for good" (Phillips Curve) [14] , and "Italy spends lots fixing old roads, not enough building new ones" (Road Quality) [15] . Each of the articles is centered on one or two visualizations with a relatively short article that describes key details (see Figure 3 ).

Because the text is mostly anchored to elements of the visualization, it becomes easier to infer the likely communicative intents and reverse engineer plausible specifications. For each key fact discussed in the article, we developed a learning objective, the same information expressed as an insight(s), and the assessment(s). These linked information groups of specifications all focus on communicating the same element of the visualization. By having roughly equivalent specifications, we can meaningfully compare specification types to each other in our study. Because of the lack of resources on how to formulate learning objectives and how they relate to insights and assessments, we explored possible ways to make linked groups of specifications. We ultimately decided to create specification groups by having learning objectives encompassing one or more insights, and each insight has one corresponding assessment question. Because of this structure, we ended up with fewer learning objectives than insights and assessments-one learning objective could comprise of more than one insight and assessment.

For the COVID retail article [16] , a broad learning objective is "Summarize outlier industries/ differences in sectors." The three insights in this information group are: "Health and Personal care have an expected amount of layoffs"; "Food service has more layoffs than expected"; and "Electronic stores have fewer layoffs than expected." However, in more 'basic' (i.e., simpler) learning objectives, the information between the two specifications are more similar. For example, one of our learning objectives was "Recall that American's retailers have laid off of furloughed a fifth of employees because of COVID-19," and the single corresponding insight was "American's retailers have laid off of furloughed a fifth of employees because of COVID-19". To constrain our experiment, we created one assessment per insight. In reality, one could construct assessments that cover multiple insights simultaneously or, conversely, multiple assessments for the same insight.

After creating the set of linked information groups of specifications, we created several different visualization designs. The original visualizations from the Economist are professionally constructed (our reconstructed versions are shown in Figure 3 ). These visualizations provided us with a starting point for alternatives. With learning in mind, we generated multiple visualizations of varying expressiveness [30] . We created designs that we thought would be the most effective for a particular specification. For example, we designed a simple pie chart highlighting 20% for the specification "American's retailers have laid off or furloughed a fifth of employees because of COVID-19." Figure 4 illustrates the process that we used to generate the materials. Table 1 summarizes the materials generated for the three articles. The complete instrument is available in our supplemental materials.

In our first experiment, we collected participant preferences for multiple visualization designs. Though we only had our participants choose between different visualization designs, our study may be more broadly applicable to designers that actually create different designs, weighing different design decisions based on communicative goals.

Participants in our study began by reading an informed consent document and completing two qualification questions. These questions tested the participants' abilities to correctly extract information from a bar graph and a scatterplot. This ensured a level of visual literacy. After passing the qualification questions, participants read instructions and completed three practice questions with feedback. In this experiment, the visualizations were shown within-subjects and the specifications were shown between-subjects. Participants previewed all of the visualizations from one dataset on the next page. This preview was to encourage participants to compare the visualizations to each other in their ratings. On the following page, we showed each visualization with a specification and elicited a rating on a five-point Likert scale from "Extremely good" to "Extremely bad." Each participant rated each visualization once based on a single specification.

In previous pilot iterations, we experimented with having participants rank graphs either in a pair or in a group. The advantage of rating each graph in a Likert scale instead of ranking is that we could evaluate the magnitude of differences between the graphs. Additionally, rating all of the visualizations meant that preferences could be relative to all graphs in the set. To obtain baseline ratings of visualizations, we conducted the same experiment without providing any specification. In this condition, we simply asked "Please evaluate each visualization." Participants were free to interpret this as they chose. With the same procedure as before, participants rated the visualization on its own without a specification. With this measure, we can adjust each preference based on how much it deviates from its baseline rating. This allows us to compare the visualizations to each other by adjusting for each graph's varying aesthetics and designs.

For the preference experiment, we hosted our experiment on Qualtrics and recruited 667 participants from Amazon Mechanical Turk. We compensated participants with $1.00, and the median completion time was 6.2 minutes. For the baseline experiment of evaluating the visualizations without any specifications, we recruited 69 participants. We compensated participants with $0.45, and the median completion time was 3.4 minutes.

In contrast to the participant who decides which visualization is best (the message sender), we refer to the recipient as the viewer. To compare the participant preferences to an empirical evaluation of the effectiveness of the design, we conducted another experiment to evaluate viewer accuracy on the assessments. In order to evaluate the visualizations, we implemented three versions of the accuracy experiment. In this experiment, the visualizations were shown between-subjects. We asked participants to answer the question in one of three conditions: before they saw the visualization (no-visualization baseline), while they were viewing the visualization (readability), or after they had seen the visualization and it was taken away (memorability). Each participant only viewed one test question and one visualization. The no-visualization baseline condition allows us to determine the question's baseline difficulty based on guessing and prior knowledge. Then, we evaluated readability accuracy for answering the assessment while viewing the visualization. Finally, we evaluated the memorability accuracy on how effective the visualizations were for information that was to be remembered.

For the accuracy experiment we again used Qualtrics and recruited participants from Amazon Mechanical Turk. We recruited 203 participants for the no-visualization baseline assessment ($0.10, median completion time = 1.1 minutes), 1658 participants for the readability assessment during the visualization ($0.10, median completion time = 1.4 minutes), and 1126 participants for the memorability assessment ($0.20, median completion time = 2.1 minutes).

After collecting the experimental data, we are able to return to our two main research questions. First, we investigate how different visualization specifications (learning objectives, insights, assessments) impact 

Baseline Baseline preferences are significantly higher compared to specification preferences.

Insight preferences are significantly higher compared to both learning objectives and assessments.

Error bars show 95% confidence intervals. Fig. 6 . The average preference for each of the specification types: no specification (baseline), learning objectives, insights, and assessments. Error bars show 95% confidence intervals. Participants preferred visualizations more when rating them with no information.

the choice of visualization designs. Then, we determine if the chosen visualizations are more effective than the alternatives.

Our analysis begins by solely looking at preferences. That is, we explore whether the different types of specifications changed preference ratings of visualizations. Recall that participants rated their preference of visualizations on a five-point Likert scale from "Extremely good" (5) to "Extremely bad" (1). In our analysis, we look at both the raw averages of the preference scores and the rank of visualizations in relation to each other. First, we examined whether there were differences in average preference for the type of specification (learning objectives, insights, assessments, or no information). A Kruskal-Wallis rank sum test showed that preferences significantly differed by specification type, H(3) = 74.35, p < 0.001. Post-hoc Dunn's test with a Bonferroni correction showed that preferences for the three specifications were significantly lower than the baseline of no information, p < 0.001. On average, the participants rated the visualization lower when evaluating based on a specification compared to no specification (see Figure  6 ). A possible explanation for this is that the constraints imposed by specifications lead participants to be more critical of the visualizations. Additionally, preference scores from the insight specification were significantly higher than the learning objectives specifications (p = 0.03) and the assessment specifications (p = 0.02).

Next, we looked at the similarity of the ranks of preferences. Even though the baseline preferences are significantly higher than the specification preferences, the visualizations may all be ranked the same. We conducted a Spearman correlation between the ranks of the visualizations for each specification type. We found that all specification types were correlated with each other. We investigated whether different specification types lead participants to have different 'most-preferred' visualizations. For each specification group, we looked at the most-preferred visualization for each specification type. Out of the 21 groups, four groups had different most-preferred visualizations for all three specifications. Eleven groups had one specification lead to a different most-preferred visualization, while the other two specifications lead to the same most-preferred visualization. Six groups had all three specifications lead to the same most-preferred visualization. See Figure 1 for an example of an information group where each specification type leads to a different most-preferred visualization.

Additionally, we can contrast the baseline most-preferred visualization (i.e., the one chosen based on considerations without specifications) against the specification's most-preferred visualization. We find in six cases the baseline matched no specifications (i.e., given a specification, the visualization was never most-preferred), seven times it matched one specification, five times it matched two specifications, and three times it matched all three specifications. This demonstrates the potential pitfall in a completely specification-free choice.

Just as we study the most-preferred visualization, we can also consider the least-preferred one. Out of the 21 groups, five groups had different least-preferred visualizations for all three specifications. Seven groups had one specification lead to a different least-preferred visualization, while the other two specifications lead to the same least-preferred visualization. Nine groups had all three specifications lead to the same least-preferred visualization. This indicates more agreement on the least-preferred choice given specifications. When considering the least-preferred visualization from the no-specification baseline, we find a similar pattern. In five cases the least-preferred no-specification visualization did not match any of the least-preferred visualizations given a specification. That is, the specification caused participants to reassess the 'worst' visualizations. In six cases the least-preferred no-specification matched one specification, five times it matched two specifications. In five cases it matched all three specifications. In those few cases, the initial 'gut' preference on the worst visualization corresponded to the preference after observing the specification.

To summarize, specifications affected visualization preferences. When given no information about the communicative goal, participants rated the visualizations higher than when they were given a specification. Out of the three specification formats, participants rated visualizations higher when viewing insights compared to either learning objectives or assessments. This suggests that insights may have more of a direct mapping onto graphical features of the visualization. Even though there was a difference between preferences, we found that the ranks of designs among specification types were correlated with each other, indicating some similarities between formats.

While understanding preferences is interesting, we would ultimately like to know if specification formats lead participants to select more effective visualizations. We do so by analyzing the viewer accuracy on the assessments. We investigate three evaluative conditions: baseline, readability, and memorability (see Figure 5 ). Recall that baseline is the performance before seeing the visualization, readability is the performance while seeing the visualization, and recall is after seeing it (as described in section 4.3).

We start by examining if there were differences in average accuracy between the three conditions (baseline, readability, and memorability). An ANOVA showed that accuracy significantly differed by condition, F(2, 429) = 10.57, p < 0.001. Post-hoc tests with a Bonferroni correction showed that baseline accuracy is significantly lower than the readability accuracy (p < 0.001) and memorability accuracy (p = 0.005). Memorability accuracy is not significantly different from readability accuracy (p = 0.6). Overall, the baseline accuracy for questions was approximately chance (M = 0.37, SD = 0.18). As expected, readability accuracy (M = 0.49, SD = 0.24) increased from the baseline of answering questions without the visualization. The memorability accuracy was between those two conditions (M = 0.45, SD = 0.25), though not significantly different from readability accuracy. These results are consistent with expectations (see Figure 7) . Without a visualization viewers perform the worst. While looking at the visualization viewers perform the best (in many cases they can decode the answer from the vi- 

Error bars show 95% confidence intervals. Fig. 7 . The average accuracy for each of the conditions: no-visualization baseline, readability, and memorability. Error bars show 95% confidence intervals. Participants performed equally as good on the memorability assessments as they did on the readability assessments.

sualization). When we take the visualization away, viewer performance is slightly lower as viewers have to rely on their memory to answer the question (but not significantly different from when viewing the visualization). We note that performance in the memorability variant may decline as additional time or distractions are introduced. While the general baseline-readability-memory pattern was consistent, we found that the baseline difficulty of the questions without a visualization varied. The baseline difficulty ranged from an accuracy of 0.11 (more likely to answer incorrectly than chance) to 0.80 (more likely to answer correctly than chance). We use the baseline difficulty to adjust questions to their underlying difficulty. We calculate the increase (or decrease) in accuracy from the baseline to the readability or memorability accuracy. This allows us to study the specific effect that the visualization has on accuracy.

We now compare the preferences to the accuracy to understand if the specification type influences participants to prefer better or worse visualizations. In this section, accuracy is defined as the increase (or decrease) in accuracy from the no-visualization baseline to the readability or memorability accuracy, as described above.

We computed Spearman's correlation coefficient for each specification between preference and change in accuracy from the baseline for readability accuracy and memorability accuracy. There is a significant relationship between preferences for learning objectives and change in accuracy for both the readability accuracy and the memorability accuracy (see Table 3 and Figure 8 ). For insights, there is a significant relationship between preferences and change in accuracy for readability accuracy, but not memorability accuracy. For assessments, there was not a significant relationship between preferences and change in accuracy for neither readability accuracy nor memorability accuracy. Additionally, there is not a significant relationship between baseline preferences and accuracy. When looking at the difference between the two accuracy conditions, we find the same trends for accuracy in the memorability condition as the readability condition, but not as strong. These correlations suggest that learning objective specifications lead people to prefer better visualization designs. A multiple regression model predicting the change in accuracy (baseline to readability) from baseline preference, learning objective preference, insight preference, and assessment preference was significant, F(4,139) = 7.80, R 2 = 0.18, p < 0.001. Learning objective preferences and insight preferences significantly predict accuracy (see Table 4 ). Neither the assessment preferences nor the baseline preferences were significant predictors. Consistent with the correlation tests, we found similar results for a multiple regression model predicting change in accuracy for baseline accuracy to memorability accuracy, F(4,139) = Table 3 . Spearman's correlation coefficients between preferences and the adjusted accuracy for each specification. Adjusted accuracy is defined as change from the no-visualization baseline to either the readability assessments or the memorability assessments. P-values are judged at the bonferroni-adjusted significance level of p = 0.0125. Table 4 . Multiple regression models predicting accuracy from baseline preference, learning objective preference, insight preference, and assessment preference. The first model is predicting change from baseline accuracy to readability accuracy; the second model is predicting change from baseline accuracy to memorability accuracy.

4.91, R 2 = 0.12, p < 0.001. Furthermore, we found that designer preferences for graphs do not always line up with assessments. For each question, we identified the design that had the highest accuracy from the baseline and looked at preferences for that design. In a few cases, participants identified the graph with the highest accuracy as the least preferred. For example, question 11 -"Q: What is the Phillips Curve? A: The trade-off between unemployment and inflation" is most correctly answered (in the readability condition) with a very minimal line chart. However, participants rated that chart as least preferred for all specifications (see Figure 9 ). This shows that articulating a specification is important but sometimes insufficient. Thus, while assessments may not be the best specification format, they may be important in evaluating designer choices.

Our project investigated the effects of different specification types on selecting 'better' visualizations. We found that just having a specification changed preferences over no information at all. More importantly, specifications led participants to prefer more effective visualizations, while baseline preferences did not. Our result shows that it is important to define a communicative intent as part of the design process. Specifying a communicative intent can guide the designer to focus on a goal, and help them make better design choices towards this goal.

In our analysis of the specifications, we found evidence that the three specification forms are similar, but may lead to significant differences in some situations. The informational content in each is similar, so it is not surprising that they were correlated with each other (Table 3) . However, the specification format can emphasize a different way of thinking about the goal. We found that preferences from learning objectives (and to a lesser extent, insights) were the most correlated with design effectiveness (Figure 8 ), suggesting they may be best overall.

While our results conclude that specifications help people choose better visualizations, it also reveals that they may not be sufficient to choose the best visualization. In some cases, the most effective designs for a specification may be intuitive. However, sometimes evaluations may reveal effective design options that would not have otherwise been considered. The analysis of the most effective design option illustrates this point (Figure 9 ). Participants generally preferred the most effective graph. However, occasionally they were very wrong, rating the most effective graph as their least preferred graph. Incorporating this evaluative feedback into the design process could help designers choose better graphs than preferences alone. Our study also shows that having viewers complete the assessments after viewing the visualization helps evaluate the effectiveness of the design relative to communicative intent. Through the assessments, we evaluated both readability accuracy (answering the test question while looking at the visualization) and memorability accuracy (answering the test question after the visualization has been taken away). Our results demonstrate that while they are useful as a specification, assessments may be even more valuable in the design process as they help identify less effective choices. Because assessments are derived directly from the objectives and insights they can better ensure that the design choices made are good ones. Additionally, running multiple assessments may allow the designer to understand the trade-offs between different goals (i.e., when one visualization cannot 'maximize' effectiveness across multiple intents).

Finally, it is worth noting that in many situations designers are working in collaborative settings (e.g., with editors, journalists, clients, publishers, etc.). While an expert designer may create or select good visualizations on an independent project, a formal specification 'language' Readability accuracy

For information group 11, the most e ective visualization (right) was ranked as least preferred (8th out of 8) in all three speci cation conditions .

For information group 1, only the learning objectives group rated the most e ective visualization (right) as most preferred. Fig. 9 . The visualization with the highest accuracy (for both readability assessments and memorability assessments) was identified for each information group. We show the rank of that visualization for each of the specifications (Learning objectives, Insights, Assessments).

may enable better collaborations.

Our study also demonstrates that creating learning objectives for visualization design is a possible way to formulate communicative intent.

Although the participants in our study did not create the learning objectives themselves, they could meaningfully use them to choose better designs. As we demonstrate, it is feasible to create these learning objectives. However, while feasible, generating specifications is not easy. During the process of creating these learning objectives we ourselves needed multiple iterations to create the learning objectives. Previous work in this area has also noted that using learning objectives is difficult and needs to be learned and practiced [1] . We suggest that more resources be developed to help designers-or those working with them-to create learning objectives for visualizations (just as there are resources in the original domain to help educators create learning objectives in the context of classrooms). It is also worth acknowledging that many individuals work as visualization designers without any specific training (e.g., a scientist writing an article with figures). A formal specification may help guide these many 'novice designers' in building better visualizations. Even though there is a learning curve to learning objectives, we have shown an advantage to this specification form over the other types that we have studied.

Communicative visualizations are often embedded in a broader context: articles, presentations, reports. Our goal, in this first study, was to understand the effectiveness of visualizations independently of this additional context. While our specifications were extracted from the original materials, these intents were emphasized differently in the text in ways that could not be easily controlled for. Future studies can clarify the interaction between specification, context, and visualization(s).

In this study, we recruited crowd workers to rate different visualizations. The reality is that 'designers' in our broad sense, are common. Many people must create or pick visualizations without significant training. Thus, helping them specify their intent may lead to better outcomes. However, among this population, we observed a relatively low qualification pass rate. In some sense this is encouraging as it ensured a high level of visual literacy among our participants. However, we hope to validate our work with other populations. For example, we would like to know if those with more training or experience in design similarly benefit from specifications (and which ones).

To allow for broad participation, we utilized pre-designed graphs instead of focusing on the visualization creation process. In the future, we hope to examine how different types of specifications will affect the design process. We hope to conduct in-depth qualitative design sessions to reveal more about how designers think about communicative intent and how they use it to make design choices. Finally, we hope to study further how the combination of learning objectives and assessments can not only guide communicative intent, but can be utilized as part of the design process to inform better design choices.

Communicative intent is an essential part of the visualization design process. Without clear intent, choosing what data to show and how to show it, may lead to worse choices. There are multiple ways to construct a statement of communicative intent (learning objective, insights, assessments). In this work, we identified which specification format was most likely to change baseline preferences for the better. While we found that learning objectives may be the best specification format, we saw improvements over the baseline for the other specification forms as well. Moreover, establishing a communicative intent makes it possible to evaluate the graphs based on the goal. If designers do not have a clear idea of what the visualization should be achieving, then it is hard to meaningfully evaluate the graphs other than just by aesthetics and generic designer guidelines. We encourage designers to articulate their communicative intent to guide more effective design choices.

Communicative visualizations as a learning problem

Low-level components of analytic activity in information visualization

A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom's Taxonomy of Educational Objectives

Semiology of Graphics: Diagrams, Networks, Maps

Evaluating the quality of learning: The SOLO taxonomy (Structure of the Observed Learning Outcome)

Beyond memorability: Visualization recognition and recall

The Functional Art: An introduction to information graphics and visualization. New Riders

Readings in Information Visualization: Using Vision to Think

Defining insight for visual analytics

Learning-based evaluation of visual analytic systems

Toward effective insight management in visual analytics systems

Graphics for learning: Proven guidelines for planning, designing, and evaluating visuals in training materials

Playable data: Characterizing the design space of game-y infographics

The Phillips curve may be broken for good

Italy spends lots fixing old roads, not enough building new ones

American retailers have laid off or furloughed onefifth of their workers

Show Me the Numbers: Designing Tables and Graphs to Enlighten

An heuristic set for evaluation in information visualization

Gronlund's writing instructional objectives

Developing and validating multiple-choice test items

Evaluating information visualization via the interplay of heuristic evaluation and question-based scoring

The elicitation interview technique: Capturing people's experiences of data representations

Communicating with interactive articles. Distill

Usability inspection methods after 15 years of research and practice

Benefitting InfoVis with visual difficulties

Visualization rhetoric: Framing effects in narrative visualization

Grounded evaluation of information visualizations

Explaining the gap: Visualizing one's predictions improves recall and comprehension of data

An algebraic process for visualization design

Automating the design of graphical presentations of relational information

Toward measuring visualization insight

A micro-phenomenological lens for evaluating narrative visualization

Test better, teach better: The instructional role of assessment. ssociation forSupervision and Curriculum Development

Writing measurable learning objectives to aid successful online course development

Task-based effectiveness of basic visualizations

The Visual Display of Quantitative Information

Beautiful evidence

Information visualization: perception for design

Understanding by Design. Association for Supervision and Curriculum Development

The Wall Street Journal Guide to Information Graphics: The Dos and Dont's of Presenting Data, Facts, and Figures

Understanding and characterizing insights: How do people gain insights using information visualization?

We thank our anonymous reviewers for their feedback. We are grateful to the NSF for their support of this work through NSF IIS-1815760.