key: cord-0483718-xh21gwi3
authors: Lee, Doris Jung-Lin; Tang, Dixin; Agarwal, Kunal; Boonmark, Thyne; Chen, Caitlyn; Kang, Jake; Mukhopadhyay, Ujjaini; Song, Jerry; Yong, Micah; Hearst, Marti A.; Parameswaran, Aditya G.
title: Lux: Always-on Visualization Recommendations for Exploratory Dataframe Workflows
date: 2021-04-30
journal: nan
DOI: nan
sha: 8b7362332e30b6f1987cd06660d73780ab765073
doc_id: 483718
cord_uid: xh21gwi3

Exploratory data science largely happens in computational notebooks with dataframe APIs, such as pandas, that support flexible means to transform, clean, and analyze data. Yet, visually exploring data in dataframes remains tedious, requiring substantial programming effort for visualization and mental effort to determine what analysis to perform next. We propose Lux, an always-on framework for accelerating visual insight discovery in dataframe workflows. When users print a dataframe in their notebooks, Lux recommends visualizations to provide a quick overview of the patterns and trends and suggests promising analysis directions. Lux features a high level language for generating visualizations on demand to encourage rapid visual experimentation with data. We demonstrate that through the use of a careful design and three system optimizations, Lux adds no more than two seconds of overhead on top of pandas for over 98% of datasets in the UCI repository. We evaluate Lux in terms of usability via a controlled first-use study and interviews with early adopters, finding that Lux helps fulfill the needs of data scientists for visualization support within their dataframe workflows. Lux has already been embraced by data science practitioners, with over 3.1k stars on Github.

Exploratory data science is an iterative, trial-and-error process, involving many interleaved stages of data cleaning, transformation, analysis, and visualization. Data scientists typically use a dataframe library [41, 60] , such as pandas [72] , which offers a flexible and rich set of operators to transform, analyze, and clean This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 15, No. 3 ISSN 2150-8097. doi:10.14778/3494124.3494151 tabular datasets. They manipulate dataframes within a computational notebook such as Jupyter, which offers a flexible medium to write and execute snippets of code; nearly 75% of data scientists use them everyday [14] . In between these dataframe transformation operations, users visually inspect intermediate results, either by printing the dataframe, or by using a visualization library to generate visual summaries. This visual inspection is essential to validate whether the prior operations had their desired effect and determine what needs to be done next. However, visualizing dataframes is an unwieldy and error-prone process, adding substantial friction to the fluid, iterative process of data science, for two reasons: cumbersome boilerplate code and challenges in determining the next steps.

Cumbersome Boilerplate Code. Substantial boilerplate code is necessary to simply generate a visualization from dataframes. In a formative study, we analyzed a sample of 587 publicly-available notebooks from Rule et al. [61] to understand current visualization practices. A surprising number of notebooks apply a series of data processing operations to wrangle the dataframe into a form amenable to visualization, followed by a set of highly-templatized visualization specification code snippets copy-and-pasted across the notebook. Our findings echo a recent study of 6386 Github notebooks [47] , where visualization code was the most dominant category of duplicated code (21%). On top of the high cognitive cost when writing "glue code" to go from dataframes to visualizations [21, 76] , users have to context-switch between thinking about data operations and visual elements. These barriers hinder exploratory visualizations and, as a result, users often only visualize during the "late stages of [their] workflow" [22, 44] , rather than for experimenting with possible analyses-which is precisely when visualization is likely to be most useful.

Steps. Beyond writing code to generate a given visualization, there are challenges in determining which visualizations to generate in the first place. Dataframe APIs support datasets with millions of records and hundreds of attributes, leading to many combinations of visualizations that can be generated. The many choices make it hard for the data scientist to determine what visualization to generate to advance analysis, and automated assistance is not provided.

. To address the above challenges, we introduce Lux, a seamless extension to pandas that retains its convenient and powerful API, but enhances the tabular outputs with automatically-generated visualizations highlighting interesting patterns and suggesting next steps for analysis. Lux has already been adopted by data scientists from a diverse set of industries, and has gained traction in the open-source community, with the number of monthly downloads around 9k (with a total of 62k downloads), and over 3.1k stars on Github, as of November 2021. Multiple industry users have created blog posts or YouTube videos extolling the virtues of Lux [9-11, 30, 59, 78] .

Contributions. Our contributions are as follows.

First, we introduce a novel, always-on framework that provides visualizations for the dataframe as it stands at any point in the workflow ( §3). This is in contrast with existing visualization specification libraries [39, 73, 77] that require users to write substantial code to generate visualizations. This multi-tiered dataframe interaction framework supports pandas' 600+ operators without compromising the ease and flexibility of data transformation and analysis ( §4).

Second, we introduce an expressive and succinct intent language powered by a formal, algebra that allows users to specify their fuzzy intent at a high level. Compared to existing languages for partial specification [52, 58, 68, 79] , the intent language in Lux not only allows users to create one or more visualizations but also flexibly indicate their high-level analysis interest, without worrying about how the data elements map onto aspects of the visualization ( §5).

Third, we introduce a novel recommendation system that uses automatically extracted information about dataframes to implicitly infer the appropriate visualizations to recommend. This is in contrast with most existing visualization recommendation systems, which are situated in GUI-based charting tools, whereas Lux is one of the first of such systems that is designed to fit into a programmatic dataframe workflow. In particular, we introduce two novel classes of recommendations based on dataframe structure and history specific to such workflows ( §6).

Fourth, we identify opportunities wherein we can adapt techniques from approximate query processing [27, 33] , early pruning [45, 54, 75] , caching and reuse [35, 71] , and asynchronous computation [24, 83] to provide interactive feedback, which is critical for usability; Lux adds no more than two seconds of overhead on top of medium-to-large real-world datasets ( §8).

Finally, we evaluate the interactive latency of Lux ( §9) and usability with early adopters ( §10) that assess the effectiveness of this lightweight, always-on approach to visualizing dataframes.

Lux draws from work on visualization recommendation systems, visualization specification, and visual dataframe tools.

Visualization Recommendation (VisRec). To visualize data, data scientists need to subselect the aspects of data, and then define a mapping from data to graphical encodings. Interactive interfaces, such as Tableau [4, 70] and PowerBI [13] , offer easy-to-use interfaces for visualization construction. Some systems offer suggestions on other possible visualizations for users to browse through, as visualization recommendations. VisRec systems can either suggest interesting portions of the data to visualize based on statistical properties [28, 43, 49, 57, 68, 74, 75] or better ways to visualize attributes that users have selected [37, 55, 56, 58, 79] . Similarly, there has been research on recommending interesting attributes or filters to avoid manual data exploration during OLAP [42, 48, [62] [63] [64] 81] . While interactive GUI-based tools have gained adoption among business analysts, they are not as widely used by data scientists with programming expertise, due to their lack of customizability and integration with the rest of the data science workflow. Lux draws on recommendation principles from this literature and explores how visualization recommendations can support a dataframe workflow. Moreover, Figure 5 outlines a novel, multi-tiered framework that Lux employs to support flexible visual and programmatic interactions with a dataframe, overcoming the limitation in expressiveness of existing GUI-based VisRec tools.

Visualization Specification (VisSpec). VisSpec frameworks codify visualization design principles and best practices to simplify the task of creating a visualization [25, 65, 66, 69, 77] . These frameworks encompass a range of abstractions depending on the degree to which users are required to specify low-level details associated with the visualization definition. For example, imperative visualization libraries, such as plotly [40] , D3 [25] , and matplotlib [39] , require users to manually compute the data associated with the graphical elements (e.g., position or size of marks) before defining the visualization characteristics. Declarative visualization languages, such as Altair [73] and Vega-Lite [65] , enable rapid specification of visualizations by applying smart defaults to synthesize low-level visualization details, so that users are not required to specify common chart components, such as axes, ticks, and labels. Lux is built on top of these imperative and declarative frameworks and synthesizes visualization code to enable users to customize as needed.

Partial specification languages, such as Draco [58] and Com-passQL [79] , commonly employed in VisRec systems, support reasoning based on a partial specification provided by the user and design constraints encoded in the system. A partial specification can be thought of as a "query", with the system automatically ranking a set of perceptually-effective visualizations that match the query. As we will see in Section 5, the intent language in Lux is more convenient to specify than these existing languages in that it only requires users to specify data aspects of interest (or omit them entirely), instead of having to worry about visualization encodings. Lux is also more versatile in that it supports functionalities beyond visualization creation for steering the recommendations generated. That said, as a promising direction for future work, Lux could make use of Draco's sophisticated reasoning around visualization design to improve which visualizations are displayed, going beyond the rule-based heuristics in its current implementation.

Compared to imperative, declarative, and partial VisSpec frameworks, Figure 6 illustrates how Lux's intent language further reduces the specification burden on users, allowing them to provide lightweight intent as opposed to writing long code fragments for visualization; we will elaborate on this in Section 5.

Visual Data Exploration with Dataframes. Of late, dataframes have become the de-facto framework for interactive data science. The comprehensive, incremental set of operators make it easy to do sophisticated data transformation, while also allowing validation after each step. However, exploring dataframes is challenging, requiring substantial programming and analytical know-how. Many visualization tools have been developed for dataframes [1, 7, 20, 31, 67] . These tools generate summaries, covering analyses spanning missing values, outliers, attribute-level visualizations, and associated statistics. In addition, bamboolib [7], pandas-profiling [1], dataprep [67] , sweetviz [31] , and pandasgui [20] offer a GUI for constructing visualizations and data transformations. Unlike these existing tools, Lux lowers the barrier to visualizing dataframes by adopting an always-on approach so that dataframe visualizations are always recommended to users at all times, instead of relying on users to explicitly call external commands to plot or profile as needed.

In this example workflow, we demonstrate how always-on visualization support for dataframes accelerates exploration and discovery. We present a workflow of Alice, a public policy analyst, exploring the relationship between world developmental indicators (such as life expectancy, inequality, and wellbeing) and the country's early effort in COVID-19 response. A live demo of the example notebook can be found at http://tinyurl.com/demo-lux.

Always-on dataframe visualization. Alice opens up a Jupyter notebook and imports pandas and Lux. Using pandas's read_csv command, Alice loads the Happy Planet Index (HPI) [3] dataset of country-level data on sustainability and well-being. To get an overview, Alice prints 1 the dataframe df and Lux displays the default pandas tabular view, as shown in Figure 1 (top, orange box). By clicking on the toggle button, Alice switches to the Lux view that displays a set of univariate and bivariate visualizations (bottom), including scatterplots, bar charts, and maps, showing an overview of the trends. Visualizations are organized into sets called actions, displayed as tabs. The one displayed currently is the Geographic action. By inspecting the Correlation tab in Figure 1 (not displayed here), she learns that there is a negative correlation between AvrgLifeExpectancy and Inequality (same chart as Figure 2 left); in other words, countries with higher levels of inequality also have a lower average life expectancy. She also examines the other tabs, which show the Distribution of quantitative attributes and the Occurrence of categorical attributes. df = pd.read_csv("hpi.csv") df import pandas as pd import lux right, Alice sees two sets of recommendations that add an additional attribute (Enhance) or add an additional filter (Filter) to her intent. By looking at the colored scatterplots in the Enhance action, she learns that most G10 industrialized countries (Figure 2 center) are on the upper left quadrant on the scatterplot (low inequality, high life expectancy). In the breakdown by Region (Figure 2 right), she finds countries in Sub-Saharan Africa (yellow points) tend to be on the bottom right, with lower life expectancy and higher inequality. showing how stricter countries (blue) corresponded to countries with higher life expectancy and lower levels of inequality. This visualization indicates that these countries have a more well-developed public health infrastructure that promoted the early pandemic response. However, we observe three outliers (red arrow on Figure 4 right) that seem to defy this trend. When she filters the dataframe to learn more about these countries (Figure 4 left), she finds that these correspond to Afghanistan, Pakistan, and Rwanda-countries that were praised for their early pandemic response despite limited resources [6, 8, 23] . She clicks on the visualization in the Lux widget and the button to export the visualization from the widget to a Vis object. Alice can access the exported Vis via the df.exported property and print it as code, following which she can tweak the plotting style before sharing Figure 4 Figure 4 : The scatterplot shows a separation between countries with high and low stringency in their COVID response. By filtering the dataframe (left), we see that Afghanistan, Pakistan, and Rwanda correspond to the three outliers (red boxed) that defies the trend.

Overall, this example demonstrates the value of always-on visualization support within a dataframe workflow: the tight integration between Lux and dataframes enabled Alice to seamlessly perform data cleaning via a familiar API and notebook environment.

In this section, we propose a novel always-on framework for visual interaction with dataframes as outlined in Figure 5 . The example workflow illustrated the many flexible ways users interact with a dataframe to achieve their analytical goal. Here, we summarize these ways and contrast it to existing visualization specification approaches in dataframe workflows.

As shown in Figure 5 , in both existing dataframe workflows (a) and the always-on framework (b), users can work with the dataframe API and see the table view by default (grey). For creating visualizations in an existing workflow, as shown in Figure 5a , users would typically need to explicitly write visualization specification code in a language such as matplotlib or altair (orange) to create individual visualizations (blue). In our always-on framework, as shown in Figure 5b , users further inspect a dashboard of recommended visualizations, as part of a multi-tiered framework (blue), all of which is driven by a user-or system-specified intent (orange), described below. Intent: Users can indicate aspects of the dataframe that they are interested in via a lightweight intent specification ( §5). The intent drives the visualizations, actions, and dashboard. In the example, Alice indicated that she wants to learn about AvrgLifeExpectancy and Inequality; Lux displayed visualizations related to these variables. Unlike existing visualization libraries, intent can also be system-specified-meaning that the visual display will be alwayson, even if the user does not explicitly specify intent. We now describe the different layers in our always-on framework, following the notation in Figure 5 .

A Visualization: Visualizations are created by applying the intent to a given dataframe. Each visualization, i.e., Vis, is an intent operating on a specific dataframe instance; a collection of visualizations is known as a VisList. B Actions: Each action is an ordered collection of visualizations (VisList), e.g., the Correlation action plots pairwise relationships ranked by Pearson's correlation.

C Dashboard: A dashboard is composed of one or more actions that may be relevant to the user.

Users can either make changes to the dataframe or the intent in order to fulfill different analytical needs. Dataframe operations are exact, leveraging the expressiveness of the dataframe API. On the other hand, the intent is a high-level specification of user interest, either explicitly provided by the user or system-inferred, steering Lux's recommendations.

By working with both the intent and dataframe API, Lux supports a flexible and intuitive experience for interacting with data. Next, we describe the intent grammar that underlies the always-on framework of Lux.

Visualization Table View Dataframe API Intent

Vis Spec Figure 5 : Conceptual framework for dataframe interaction. Users can make changes to anything below the dotted line (write), elements displayed to the user are shown above the dotted line (read). (a) In existing workflows, users write visualization specification code to create one or more visualizations. (b) In Lux's alwayson framework, users can optionally make changes to the intent, which steers the recommended visualizations (Visualizations, Action, Dashboard).

The intent language is a lightweight, succinct means for users to declaratively specify their high-level interests. In this section, we introduce this language and its underlying grammar, and how it differs from existing approaches.

The intent grammar describes what the user is interested in within a dataframe. The intent is composed of one or more clauses, each of which is either an axis or a filter of interest.

An axis defines one or more attribute(s), mapped appropriately to a specific encoding or channel of the corresponding visualizations.

For the axis, apart from the mandatory attribute(s), specified under ⟨ ⟩, the remaining properties are optional-and can be automatically inferred.

Filters define a subset of data that the user is interested in. To specify a filter, the attribute being filtered, the operation, and the value, are required.

Consider the simple case when ⟨ ⟩ refers to a single attribute and ⟨ ⟩ refers to a single value in Equations 2 and 3; then, an intent with multiple clauses (axis or filter) represents a user preference to see each of the axis attributes visualized, for the subset of data corresponding to the conjunction of the filters.

In the more general case, ⟨ ⟩ can correspond to a union of attributes, or a special wildcard value ? (with an optional constraint to define the subset of attributes), while the ⟨ ⟩ can refer to a union of values, or a special wildcard value ? .

The use of unions in either case (as well as ? which implicitly is a union of all alternatives) admits a disjunction of options for the axis or filter clause. If there are ≥ 1 alternatives for the ℎ clause, we can construct a collection of 1 × 2 × . . . × visualizations by taking the cross-product of alternatives per clause. Constructing a collection of visualizations via partial specification of this sort has been explored in ZQL [68] and CompassQL [79] .

The aforementioned grammar is decoupled from our specific implementation, which uses syntactic sugar for expressing the intent in a convenient Python-based API. Users can specify an intent indicating their analysis interests or create desired visualizations by applying the intent to a specific dataframe. We note that while the focus here is describing user-specified intent, the same intent language is used by the system for generating recommendations as will be described in Section 6.

Attaching an Intent to a Dataframe. Building on the grammar described above, within Lux, a Clause can specify one or more columns (i.e., Axis) or rows (i.e., Filter) of interest. Q1. To set Age and Education as columns of interest for a given dataframe df, one can state:

Or one can also use the equivalent shortcut:

Once the intent is set, whenever df is printed, the Lux widget will use the intent to determine what visualizations to show to the user. Here, Lux would display visualizations related to attributes Age and Education from df.

We can compose Axis and Filter together, as follows. Q2. Explore the Ages for employees in the Sales Department. axis = "Age" filter = "Department=Sales" df.intent = [axis, filter]

Based on the specified intent, Lux not only shows the Age distribution filtered to the Sales department, but also displays a set of related visualizations, such as visualizations involving one additional attribute or one additional filter.

In the following, we will showcase the Lux intent syntax as part of Vis and VisList, but the syntax can also be used to simply set intent as in df.intent above.

Instead of attaching an intent to a dataframe, one can use the Vis and VisList keyword to directly generate specific visualizations. Q3. Compare average Age across different Education levels. axis1 = lux.Clause(attribute="Age") axis2 = lux.Clause(attribute="Education") Vis([axis1,axis2],df) Query 3 is similar to Query 1, except that the intent is immediately applied to the dataframe df to create a visualization via Vis, rather than changing the intent associated with the dataframe (to be used when the dataframe is eventually printed). Given that the intent involves one measure (Age) and one dimension (Education), Lux will display a bar chart. By default, average is the function used for aggregation.

Aggregation is one of three optional properties for Axis (Equation 2); others are channel and binning. If any of these are explicitly specified, they override Lux's defaults, as in the following query. Q4. Compare the variance of MonthlyIncome based on employee Attrition. axis1 = lux.Clause("MonthlyIncome", aggregation=numpy.var) axis2 = "Attrition" Vis([axis1,axis2],df)

To generate multiple visualizations, one could either set df.intent as in Section 5.2.1, which would generate a collection of visualizations related to the intent, or specify intent as an input to a VisList, as in the following query. Q5. Show how factors related to the rate of compensation differ for employees with different EducationFields. rates = ["HourlyRate","DailyRate","MonthlyRate"] VisList(["EducationField",rates],df)

Here, there is one Vis corresponding to EducationField combined with each of HourlyRate, DailyRate, and MonthlyRate. The wildcard character ? , when used as part of an Axis, can be used to enumerate over all attributes in a dataframe; constaints may be used to restrict them to a certain type. Q6. Browse through relationships between any two quantitative columns in the dataframe. any = lux.Clause("?",data_type = "quantitative") VisList([any, any],df)

This VisList corresponds to the search space for the Correlation action; the Correlation action additionally ranks and sorts each Vis in the VisList based on their Pearson's correlation score.

Filter values can also be specified as a list or via wildcards across all possible values for a fixed filter attribute. Q7. Examine Age distributions across different Countries.

The generated VisList contains histograms of Age, one each for individuals where Country is USA, Japan, Germany, and so on.

Due to the heavy cognitive cost of writing glue code to visualize their data [21, 76] , users often opt to visualize in the later stages of their workflow [22, 44] . Instead, our goal with Lux's intent language has been to support visualization to be used throughout; and for this, users should not have to expend too much effort in thinking about what and how to visualize. There are two key characteristics of our intent language that support this quick and flexible programmatic specification, described next.

Versatility: Our intent language is versatile in that it serves both as a mechanism for steering recommendations (Q1-2) and as a way of directly creating visualizations on top of dataframes (Q3-7). This is unlike existing specification approaches whose sole focus is the creation of one or more visualizations. This versatility means that whenever users specify their intent, they are not committing to a pre-defined set of operations. Instead, the system leverages explicit user input (in the form of intent), as well as implicit signals to determine what to display to users.

Consider Q2, which demonstrates the versatility of intent beyond the specification of a single visualization. Here, the user simply specifies the data-specific aspects they are interested in, i.e., the attribute Age, and the Sales Department filter; these are used as cues by Lux to generate visualizations, including those that wouldn't ordinarily be picked if we were using a conventional visualization specification framework (such as those with a different filter). This versatility makes it easy for users to communicate their analysis intent even when they do not have a specific visualization in mind.

Convenience: Our intent language only requires specification of data-oriented aspects, while existing approaches also require users to specify visual encoding-oriented aspects to generate visualizations. Our minimalistic language design is intended to alleviate the common challenge in exploratory analysis where users struggle to translate their high-level data questions to exact visualization specifications [34] . Lux supports convenient specification shorthands and defaults and automatically infers the necessary details to transform user-specified intent into complete specifications.

As shown in Q3-7, where the target is one or more specific visualizations, Lux enables users to visualize their data with only a single line of code, effectively lowering the barrier to visual exploration. In Figure 6 , we outline the code required to create a single visualization based on Query 3, and compare the key differences in the required specification across various languages, including Draco [58] , matplotlib [39] , and Vega-Lite [65] . Other languages often require users to specify the field type, channel, and marks, while Lux can reason over underspecified intent. This reduces the effort required on the part of the user. 

{ "mark": "bar", "data":"{…}, "encoding": { "x": { "type": "quantitative", "field": "Age", "aggregate": "average" }, "y": { "type": "nominal", "field": "Education", } } } Vis(["Age", "Education"],df) bar=df.groupby("Education").mean()["Age"] y_pos=range(len(bar)) plt.barh(y_pos,bar,align='center') plt.yticks(y_pos,list(bar.index)) plt.xlabel('Mean of Age') plt.ylabel('Education') data("…"). encoding(e0). channel(e0,x). type(e0,quantitative). field(e0,"Age"). aggregate(e0,mean). encoding(e1). channel(e1,y). type(e1,nominal). field(e1,"Education"). Figure 6 : Comparison between the level of specification required from Lux versus other existing approaches for Query 3.

In the previous section, we have seen how users can either attach an intent to a dataframe, or this intent can be programmatically generated as part of Lux's recommendations. We discuss the latter in this section. In Lux, an action describes a ranked list of visualization recommendations based on a predefined search space. Lux supports four major classes of actions, as summarized in Table 1 . Metadata-and intent-based ones are akin to those used in past visualization recommendation systems [38, 50, 80 ]-see Lee et al. [50] for details. As described in Section 2, most existing VisRecs are situated in GUI-based charting tools; our key novelty is that Lux is one of the first visualization recommendation systems that is designed to fit into a programmatic dataframe workflow. Specifically, here, we introduce two novel classes of recommendations specific to dataframe-based workflows, based on dataframe structure and history.

Univariate vis of quantitative attributes (histogram) Metadata-based Recommendations. Lux maintains dataframe metadata, including attribute-level statistics such as min/max and cardinality to determine the semantic data type of each column and to automatically populate visualization settings. For example, based on data type, Lux can generate univariate and bivariate overviews. In Figure 1 , Distribution, Occurrence, Temporal, and Geographical actions provide univariate overviews of columns, while the Correlation action provides bivariate overviews of all possible pairs of quantitative attributes, ranked based on Pearson's correlation.

Metadata-based recommendations have been used extensively in past visualization recommendation systems [38, 80] .

Intent-based Recommendations. Lux displays recommendations based on the user-specified intent. On printing the dataframe, Lux displays a visualization based on the user-specified intent as in Figure 2 , as the Current Visualization. In addition, Lux provides recommendations based on valuable next analysis steps starting from that visualization. For example, the Enhance action recommends visualizations formed by adding an additional attribute to the current visualization.

Structure-based recommendations. Data scientists often reshape their dataframes in ways that are more amenable to downstream analysis, modeling, or presentation. One of our key insights is that the dataframe "structure" reveals strong signals for what the users subsequently choose to visualize, thus providing implicit information on what recommendations to display automatically by Lux.

Index-based visualizations: Dataframe indexes provide a natural way to order and label dataframe rows and columns. Indexes are typically created as a result of grouping and aggregation through operations such as groupby, pivot, crosstab. For any pre-aggregated dataframe (i.e., dataframes resulting from an aggregation operation), Lux creates visualizations by grouping the values row or column-wise. For example, Figure 7 displays the result of a pivot operation, where each row is visualized as a time series line chart. Lux currently only supports single-level indexes, visualization of multi-level indexes is a potential direction for future work.

Series visualizations: Series are dataframes with a single column. Lux leverages the same dataframe visualization mechanism for Series, displaying univariate, metadata-based visualization, such as a bar chart for categorical and histogram for quantitative Series. By visualizing dataframe structure, Lux provides a natural and intuitive representation of dataframes and their derivative products. These visual representations can be extended to other dataframe-derived structures (e.g., GroupBy, Offset, or Interval) to help novices learn, debug, and validate complex dataframe operations. History-based recommendations. Apart from dataframe structure, another source of implicit information from the dataframe is the historical set of operations performed by users. For example, if the user cleaned up a particular column and renames it, it is likely that they would want to visualize the same column soon thereafter. Lux displays history-based recommendations based on whether the dataframe has been filtered or aggregated in its recent history. For example, when a filtering-based operation leads to a small dataframe (such as when a head or tail is performed), Lux visualizes the previous unfiltered dataframe since there are too few tuples for generating recommendations in the filtered dataframe. Lux also uses history to determine if an aggregation has been performed, helping identify the structure-based recommendations described earlier.

To collect this history, since Lux acts as a wrapper around pandas (described in the next section), we instrument each dataframe function and track each one with minimal overhead and store it as part of the dataframe, instead of requiring program analysis, which is prone to false positives [84] . Given that new dataframes or intermediate objects (e.g., GroupBy, Series) are often created when the user performs an operation, Lux propagates the history over to derived objects so that the history is not lost. A key challenge for leveraging dataframe history to infer better recommendations would be around surfacing the inferred implicit intent in a way that is interpretable and explains resulting recommendations choices.

Lux employs a client-server model, leveraging computational notebooks as a frontend client. Lux currently supports Jupyter Notebooks, Jupyter Lab, Jupyter Hub, Microsoft Visual Studio Code, and Google Colab. The ipywidgets library is used for rendering an interactive HTML widget as the cell output. Once users import Lux, they can interact with a LuxDataFrame instead of a regular pandas dataframe. LuxDataFrame acts as a wrapper around pandas, and supports all existing pandas operations, while storing additional information, such as the intent, metadata, structure, and history, for generating visual recommendations. As shown in Figure 8 , the server side logic is largely separated into two distinct layers: 1) the intent processing layer is responsible for processing intent into executable instructions, and 2) the recommendation layer is responsible for generating the displayed visualizations. To generate the visualization recommendations, as well as compute metadata that is used in various stages, the execution engine performs the required data processing and optimization, either as a series of dataframe operations in pandas or equivalently in SQL queries in relational databases ( §8). Finally, the system design is modular and extensible so that alternatives can be swapped in at different layers, e.g., Altair and Matplotlib visualization rendering libraries. 

Here, we discuss how Lux processes user intent to automatically infer missing details and determine appropriate visualization mappings. The intent processing layer parses, validates, and compiles the user's underspecified intent into complete specifications.

7.1.1 Parser and Validator. In Section 5, we saw how Axis and Filter can be be used to compose Clauses; the parser parses the user-inputted strings into an internal Clause representation. Subsequently, the validator checks for any inconsistencies between user-specified Clauses and the dataframe content. To do so, it leverages the dataframe's pre-computed metadata to verify the input intent. If the user's input does not align with the data present in the dataframe, the validator provides early warnings and suggests corrections to the input intent.

7.1.2 Compiler. During intent specification, users have the ability to omit certain optional details, making them partial specifications.

Users also implicitly construct a collection of visualizations by using a union or wildcard character for Axis or Filter. Post validation, the compiler expands the Clauses into multiple visualizations and adds in defaults for the omitted details, making the Clauses complete. This transformation is performed in three steps.

Vis objects as a cross-product of the specified Clauses, leading to a VisList containing the resulting visualization specifications.

2) Lookup: For each Vis in the VisList, Lux populates the omitted details using the dataframe's pre-computed metadata. The compiler also removes any invalid visualizations generated that are either not supported in Lux or use ineffective encodings.

3) Infer: Finally, Lux infers the visualization encodings, including the marks, channels, and transforms (sort, aggregation, binning) required for generating the visualizations. The compiler implements rule-based heuristics drawn from best practices in design [32, 56] . After intent processing, Lux can now use the complete intent specification to either generate a Vis directly or generate a set of appropriate recommendations (described next).

As described in the framework in Figure 5 , actions organize collections of views into recommendations displayed to the users. The action registry in Lux keeps tracks of a list of possible actions that could be applicable for generating recommendations at any point in the analysis. On initialization, Lux registers a set of default actions (described in Section 6) applied to all dataframes. Users can also register their own custom actions programmatically by writing a Python-based UDF. The UDF generates a VisList of possible visualizations and optionally scores and ranks each Vis. The custom action is "triggered" whenever the dataframe satisfies the user-specified condition on when the action is applicable; Lux recommends visualizations based on the action.

We now describe Lux's execution engine. We first describe the two major tasks performed by this execution engine. Then, we describe three optimizations aimed at speeding up these tasks.

We now discuss how we computes metadata and visualizations.

Metadata Computation: The metadata computed includes attributelevel statistics and data types. The statistics include the list of unique values, cardinality, and min/max of the attribute. The unique values are used to determine the candidates generated by a wildcard for a filter on the column, or for validating filter input for the column, and for computing the cardinality. The cardinality information is used to determine the data type, while min/max is used for determining the limits on the visualization axes. Next, the execution engine infers the semantic data type based on the internal data type and cardinality information. Lux supports nominal, quantitative, geographic, and temporal data types. If the data type is misclassified, users can override the automatically-inferred data type.

Visualization Processing: After the user or system-specified intent has been transformed into one or more Vis objects with a complete specification, the execution engine translates each Vis to queries responsible for processing the data required for the visualizations. First, the engine applies any filters and retrieves relevant attributes. Next, the execution engine performs different visualization-specific operations depending on the mark type. For example, to process the data for a histogram, the engine bins an attribute into fixed-sized bins and performs a count aggregation for each bin. Table 2 summarizes the relational operations that corresponds to processing different visualization types.

Next, we describe several optimizations aimed at minimizing the overhead incurred by Lux. Computing metadata and processing data for visualizations can be time consuming, even for a moderately-sized dataframe. Therefore, we adapt optimizations from approximate query processing [27, 33] , early pruning [45, 54, 75] , caching and reuse [35, 71] , and asynchronous computation [24, 83] , to improve the interactivity of Lux.

Intelligent workflow-based optimizations (wflow): During an analysis session, users constantly modify and operate on dataframes, which means that the metadata and associated recommendations can change throughout a session, especially during reshaping and type-modifying operations. Figure 9 shows an example of one such workflow. Thus, unlike conventional visual analytics, where metadata can be computed upfront and stays fixed throughout, here, metadata needs to be constantly updated to ensure that recommendations are generated correctly. As a result, the computation associated with keeping the metadata "fresh" after each dataframe operation can be computationally expensive. We propose two techniques to reduce this overhead: 1) lazily compute the metadata and recommendations only when users explicitly print dataframes; 2) cache and reuse results later on in the session. Since users often intersperse dataframe printing with several dataframe operations, it is likely that the computed metadata and recommendations would be outdated before users see the results. As a result, we can delay computation and compute the metadata and recommendations only after the user has explicitly requested to print a dataframe. Each LuxDataFrame keeps track of how fresh the metadata and recommendations are and expires them when an operation makes a change to the dataframe. In particular, we leverage pandas's internal functions that are triggered when: • the dataframe is modified inplace instead of returning a new dataframe, e.g., df.dropna(inplace=True) • columns in the dataframe are updated, either through the bracket or dot notation, e.g., df.Frac or df["val2"]=df["val"]/5 • the row or column labels are changed, e.g., df.rename(columns ="val":"value") Additionally, recommendations are expired when the intent is modified. On printing the dataframe, Lux recomputes metadata and generates the recommendations accordingly. This lazy strategy ensures no overhead on any non-print operations. Future work on more intelligent, fine-grained maintenance and expiration strategies can improve system performance (e.g., only refresh metadata and recommendation relevant to a specific column instead of entire dataframe for a single column update).

Lux further memoizes the metadata and recommendations so that any subsequent prints to an unmodified dataframe do not require recomputation. Users frequently perform "non-committal" operations that do not make changes to the dataframe to be used in subsequent analyses, involving printing dataframes as intermediate results to facilitate quick experimentation and debugging. As shown in cells labeled [3] [4] [5] in Figure 9 , users may print a column, perform grouping and aggregation, or print descriptive summaries, all without modifying the dataframe. In this case, when the user revisits the original dataframe, the memoized recommendations are immediately accessible to them.

Note that while lazy computation and caching and reuse are wellstudied in the database literature [35, 71, 82] , identifying that lazy computation may be beneficial non-committal dataframe operations is a novel insight. Similarly, recognizing that when users repeatedly print the same dataframe without modifying it, caching and reuse could be valuable is our novel contribution. Combined, these observations around common dataframe usage patterns inspired our novel approach for determining when and how to expire metadata and recommendations in a dataframe workflow.

[1] [2] [3] Approximate, early pruning of search space (prune): As described in Section 4, each visualization in an action is ranked based on a scoring function, computed based on the data associated with each visualization. Inspired by existing work in early pruning [45, 54, 75] , Lux estimates the visualization score to speed up the retrieval of top-k visualizations for each action. We employ approximate query processing to reduce the cost by estimating the scores using sampled data. Specifically, Lux first performs a preliminary pass over the VisList to approximate the score of each visualization and then proceeds to recompute the top-k selected visualizations in a second pass to process each of the displayed visualizations exactly. Currently Lux leverages a cached sample of the dataframe to approximate visualization scores (e.g., for a dataframe with 1M rows, approximating correlation score by using only 30k rows), although other approximate query processing methods could be applied. Given that the prune optimization performs two passes over the VisList (first pass for pruning, followed by an exact recomputation for the top-k), the additional recomputation cost incurred can be higher than doing a single pass over the VisList. For example, dataframes that are wide or contain high-cardinality attributes can often result in actions involving large visualization search spaces. Therefore, this optimization should only be applied when the approximate savings are larger than the recomputation cost of the top k visualizations: × ≫ × + × , where represents the number of candidate visualizations, and are the cost of computing the exact and approximate scores, respectively. Intuitively, in the ideal case where is close to zero, needs to be at least greater than as a minimum requirement for the prune optimization to provide meaningful savings. The cost of scoring a visualization, and , is determined by the relational operations for extracting the required visualization data (as shown in Table 2 ).

Here, while the use of approximate samples to rank and identify top-k visualizations is not new [54, 75] , our use of approximation in conjunction with a cost model to determine its potential interestingness is a novel application of the technique.

Cost-based scheduling of actions (async): We find that users generally spend an average of 28 seconds 2 skimming through the pandas table view before toggling to the Lux view. Leveraging past work on asynchronous query execution [24, 83] , recommendation results can be streamed into the frontend widget as the computation for each action completes to ensure interactive responses, without having to wait for all of the actions to finish rendering. After compiling the visualizations for each action, we estimate the cost of the action as the sum of the visualization costs in the VisList, using the cost model described in our technical report [51] . This estimate is then used for scheduling the cheapest action to compute first, followed by computing the remaining in the background. In datasets where a few "laggard" actions dominate the overall recommendation generation (e.g., Correlation for a wide and highly quantitative dataset), the async optimization provides users with early results and returns interactive control back to the user, instead of incurring a high wait time during their analysis session.

The idea of exploiting asynchronous execution during user waittime has been well-established [24, 83] , but our work is the first to apply this technique in a visualization recommendation context, by leveraging cost estimates to prioritize cheaper-to-compute visualizations. Our cost model across different visualization types is an independent valuable contribution.

We evaluate Lux to measure its performance on large real-world datasets and notebook sessions, along the following dimensions: • RQ1: What is the overall performance of Lux? Can Lux achieve interactive latency during a typical dataframe workflow? • RQ2: What is the effect of the number of columns on Lux's performance? • RQ3: How does the approximation-based prune condition affect the quality of the recommendations relative to no approximation? We focus on evaluating the interactive latency in this section; we describe the usability evaluation in the following section. Source code for experiments and analysis are available online 3 .

Data: We use two real-world datasets to evaluate the performance of Lux. The Airbnb dataset [29] contains 12 columns while the Communities [46] dataset contains 128 columns. For both datasets, we duplicated the dataset multiple times (up to 10M rows for Airbnb and up to 100k rows for Communities) to investigate the effects of scaling with the number of rows. After duplication, Airbnb exemplifies datasets with a moderate number of columns and a large number of rows, while Communities exemplifies those with a large number of columns. The upper limits on the two datasets cover around 98% of the datasets in the UCI repository [19] .

Setup: All of our experiments were conducted on a Macbook Pro with 32GB of RAM and an Intel Core i9 processor running macOS 10.15.6. The experiments were run using Python 3.7.7, pandas 1.2.1, and a version of lux-api 0.2.3 adapted for purpose of the experiments. We used papermill [12] to programmatically execute each notebook cell. We set for top k as 15 and apply prune for any action where the number of visualizations exceeds . For the sampling policy, we used cached random samples capped at 30k rows for approximating the visualization interestingness of dataframes over 30k rows (the choice of this parameter is justified in Section 9.4). For the runtimes reported, we exclude the frontend drawing time for each visualization given that it is constant and highly dependent on the chosen visualization library and frontend.

Conditions: Our experiment measures the time it takes to execute every cell in the notebook across five different conditions: • no-opt: Baseline condition with no optimization applied, representing a naive implementation of Lux where the results are explicitly computed at the end of every cell involving a reference to the dataframe 4 . • wflow: Condition with the wflow optimization applied. • wflow+prune: Both wflow and prune applied. • all-opt: All wflow, prune, and async applied, representing the best achievable performance. • pandas: Condition with only pandas and without using Lux, representing the raw performance of dataframe workflows without the benefits of always-on visualizations.

To evaluate the overall performance of Lux with a dataframe-based workflow, we measured the runtime for executing an example notebook involving pandas.

Workload: The workload is based on publicly available notebooks on Kaggle for Airbnb and Communities. These notebooks follow a typical exploratory analysis of a dataframe that includes loading, transformation, cleaning, computing statistics, and machine learning. We modified these notebooks to print out dataframes and series at various points in the notebook akin to what a user would typically do for validating the results of operations. In addition, we label each cell in the notebook as either a print of a dataframe, print of a series, or neither (i.e., any non-Lux Python command) to separately measure the runtime for different cell types. Table 3 shows the breakdown of the two notebook workloads by different cell types. We define overhead as the difference in runtime between the all-opt and pandas condition, i.e., the additional time required to support always-on visualizations via Lux. Overall runtime: To understand the overall performance of Lux on dataframes with varying sizes, we varied the dataframe size from 10k to 10M rows. Figure 10 displays the overall runtime averaged over all cells in the notebook. We find that the best achievable performance with Lux led to significant speedup with up to 11X improvement in overall runtime for the Airbnb dataset (and up to 345X for Communities) compared to the no-optimization baseline. Printing dataframes and series: We measure the performance of each cell that prints a dataframe or series to understand the overheads associated with Lux. Figure 11 shows the average time it takes for printing a dataframe for Airbnb and Communities. In particular, the overhead of Lux for each print can be determined by comparing against the cost for a print in pandas. When the dataframe contains fewer than 1M rows for Airbnb, each print incurs no more than 2 seconds in addition to pandas (in the 10M case, each print incurred an overhead of 21 seconds). For Communities, the overhead was no more than 1.5 seconds. As shown in the sparkline visualization in Table 3 row 2, the performance for printing series follows the same pattern as that of the dataframe. However, since series only involves a single column, it effectively avoids the costly procedure of traversing through a large search space. The overhead on top of pandas is no more than 1 second for each series print even on the largest datasets.

Non-Lux operations: Across all conditions except the baseline, the runtime for non-Lux operations (Table 3 row 3) is the samedemonstrating how Lux incurs zero overhead on any Python operations in a notebook session. When compared against the baseline, Lux is over 100X faster for 10M Airbnb and over 650X faster for 100k Communities. The performance improvement for non-Lux operations demonstrates how wflow's lazy evaluation strategy avoids unnecessary computation.

We investigate how the performance of Lux varies depending on the number of columns in the dataframe. To understand the effect of the width of a dataframe ( ), we measure the processing time for a single dataframe print (after the metadata has already been precomputed). Given the dependence of actions on data types, we leverage a synthetic dataset generated using the faker[16] library to vary the number of columns in the dataframe, while fixing the proportion of data types. The simulated dataframe contains 100k rows with 78% quantitative columns, 20% nominal columns, and 2% as temporal. Across the quantitative columns, half of the columns are integers, while the other half are floats. For the nominal columns, we generate columns of strings with varying cardinalities chosen based on a geometric series between 1 to 10000. Figure 12 left shows the runtime for different dataframe widths 5 . We note that the blue no-opt curve (power=2.53) scales exponentially with the number of columns. By applying the prune and async optimizations (red), Lux effectively lowers the cost of printing a dataframe by bringing the runtime closer to linear (power=1.07).

To understand how the approximation-based prune condition affects the recommended results, we experimented with different fractional sizes of the dataframe to be used in the sample and its effect on the recommendation ranking. We compared the list of recommendations generated with and without the optimization applied. We computed Recall@15 of the top k results against the ground truth rankings. We chose recall, instead of other rank position-dependent measures, because the top-k visualizations are computed exactly and re-ranked after selection, so the metric only needs to capture how accurately the top-k visualizations are retrieved.

The recall curves in Figure 12 right shows that for most actions 10% (5k rows) is required in the sample for achieving over 90% accuracy. For the 100k Airbnb dataset, the sample requirement is around 20-40% (i.e., 20-40k rows). As a result, we chose the sampling cap in our experiment to be 30k rows to reach an average of 90.5% on Airbnb dataset and near perfect (≥ 95%) on Communities. Compared to other actions, since Filter (light green in Figure 12 right) enumerates over data subsets, it requires more samples to ensure enough data points per stratum to achieve the same accuracy. 

Lux was first released in October 2020 and gained substantial traction in the open-source community since then. In this section, we report on a field study with existing users of Lux and lessons learned from developing Lux.

We performed a study to understand participants' initial impressions of Lux and whether they are able to use Lux effectively in a controlled setting. This study was performed remotely from October to November 2020 using lux-api 0.2.0. This study was part of a 90-minute interactive session where participants were first introduced to the basics of Lux and guided through a set of hands-on exercises on how to use Lux. The study was conducted with two focus groups: the first was a bootcamp for industry data practitioners (N=20) and the second was an online lecture for students in a graduate-level data visualization course (N=15). Both groups engaged in the same set of instructions and tasks. The instructions and tasks were made available to participants via a web link to a live Jupyter notebook. Participants were led through three notebooks in sequence. Each notebook contained examples and exercises covering the key concepts in Lux using three datasets (College [5] , Happy Planet Index [3] , and Olympics [2] ). Interactions on the Lux widget and actions performed on the notebook were logged via a custom extension [53] . The session concluded with a short survey documenting participants' experience. Due to the remote and unsupervised study setting, not all participants submitted survey responses or performed notebook operations that were logged.

Study Findings. We collected 16 survey responses (6 from bootcamp, 10 from lecture). The results were thematically coded and classified by one of the authors. In response to background questions regarding the existing exploration workflows of the participants, their concerns echoed the pain points that Lux aims to address, including difficulty in determining the "right" visualization to plot (5/16), modifying and iterating on visualizations (4/16), and determining where to begin an analysis (4/16). When asked to comment on aspects of Lux that they liked, 9/16 participants cited how the ability to print and visualize dataframes was the most useful. Participants also noted how the integration of Lux with their data science workflow was seamless and intuitive. When asked to comment on aspects of Lux that they found challenging, 8/16 participants described unfamiliarity and the learning curve associated with the intent syntax. When asked about what they would like to see most in future versions, participants were most interested in improving Lux's latency on large datasets (12/16) 6 , followed by support for a wider and more useful set of recommendations (8/16) and making the intent language more customizable (7/16). At the end of the survey, 13/16 participants signed up for follow-ups and expressed interest in continuing to use Lux. To evaluate whether participants were able to accomplish controlled tasks with Lux, we collected 23 unique logs of the participants' interaction with the notebooks. We qualitatively graded how well participants performed across the three exercises. The task success rate for the three exercises was 68% (for composing an intent indicating multiple views), 87% (for specifying a desired Vis), and 71% (for creating a VisList).

By inspecting the trace of attempts, on average participants were able to obtain the first successful answer within their first five tries. Participants' most common mistakes involved confusion around the syntax for specifying multiple visualizations via union. Finally, participants were encouraged to try out one of the provided datasets for open-ended exploration. While participants successfully used Lux to print and visualize their dataframes, due to the setting and time constraints, their interactions with Lux were brief. The limited insight into how users performs open-ended exploration with Lux motivated the need for the following study.

From December 2020 to January 2021, we conducted semi-structured interviews with participants who used Lux in their data science work. We interviewed two industry data scientists in an insurance (P1) and retail company (P3), and a researcher in education (P2). Given that participants had extended exposure to Lux, our questions largely focused on understanding how Lux fits into their existing workflows. Before the interviews, participants used Lux over the span of 1-2 months in their professional data science work. Their usage frequency varied: P1 used Lux daily, P2 used Lux once every one or two weeks, P3 used Lux around ten times in total.

Unlike the first-use study where participants were led through instructions dedicated to how to create Vis and VisList, field study participants learned how to use Lux on their own through tutorials and documentation on our website. We performed a walk-through of real-world notebooks in which participants had used Lux. All three participants expressed that understanding their data was a challenge during exploration. In fact, two of the participants have developed their own homegrown solutions for past projects (echoing findings from Alspaugh et al. [21] ), ranging from for loops across matplotlib charts in notebooks to VBA scripts that generate plots in Excel. In their existing workflows, P1 and P2 visualized their data programmatically via matplotlib, while P3 largely on Tableau's GUI for creating visualizations.

On dataframe visualizations: All three participants expressed that they appreciated how the automatic visualizations provided by Lux afforded them quick insight into their dataframes without the need for code. P2 typically examines over 100 columns of data as part of an educational course survey, and stated that Lux sped up the amount of time for EDA by at least two-fold: " it really helps speed up my exploratory analysis. If not, it will take me forever to go through these many variables. " When asked about the scenarios for which they would toggle to the Lux view versus the default pandas table, most participants preferred seeing the Lux view for the purposes of EDA. Participants described how they only use the pandas table to quickly check if "the data looks okay" (P1) and rarely toggle back to it unless they observe anomalous trends in the visualizations. During the study, P2 adopted a workflow where they sampled a single row to display the pandas table in one notebook cell, then printed the Lux view in the cell below to check that the data falls in the expected ranges as displayed in the visualizations.

On dataframe intents: Participants indicated that the concept of intent was an intuitive way for steering the course of their analysis. P1 and P2 leveraged intent as a way of systematically exploring groups of variables they were interested in. To investigate their research questions, P2 listed groups of independent and dependent variables as their intent to explore each group one at a time. P1 and P3 used intent as a way of exploring predictive variables of interest, such as whether a customer purchased accessories alongside their orders, to help inform feature engineering for downstream machine learning. However, challenges in specification sometimes prevented them from making use of intent fully. In particular, P2 and P3 both described that they were interested in exploring alternative data subsets for an attribute of interest (a query that is expressible in Lux's intent language); however, they were unaware that they could specify filter intent with wildcards. Improving the API for intent specification remains an important direction for future work.

On custom actions: Participants noted how the default Lux actions largely covered the basic sets of analyses that they would typically perform on their own. While most participants were unaware that Lux supported the ability to create custom actions, during various points in the interview, they described additional actions that they would find useful. For example, P3 described how they wanted to create a custom action that lists the top ten dataframe columns with the most influence over a desired predictive variable. Other participants described actions that are similar to the default Lux actions, but with a different ranking. For example, P2 was interested in categorical variables that involved bar charts that looked very even, since that means that it has a closer-to-equal likelihood of being in either categories, so the trend is potentially interesting.

On user-specified views: Somewhat surprisingly, the use of Vis and VisList were rarely brought up in the field study interviews. Possible explanations for their limited use include the unfamiliarity with these concepts and their usage of Lux in conjunction with other visualization tools. All participants used an existing visualization tool (e.g., matplolib or Tableau) while exploring their data with Lux. As a result, they simply used their familiar tools for specific visualizations when they knew exactly what to plot. To fully leverage Vis and VisList in their work, participants often asked for ways to extend or customize the visualization type for a user-specified view. For example, P3 explained how market share data was best visualized as a top-k pie chart, while P2 was interested in examining overlaid histogram distribution of different measures for binary variables, such as whether or not a course was open-ended. These findings indicate that increased flexibility in the intent language could afford the familiar visualization capabilities for users when creating specified views.

Usage of Lux in data science workflows: All three participants described using Lux explicitly in the exploration stage after data loading and cleaning, but before advanced analysis or modeling. P1 and P2 used Lux in conjunction with custom matplotlib code that they repurposed for their analysis. When asked why participants did not print the dataframe for visualizations during the data transformation and cleaning phase, P1 and P3 answered that since the dataframe prints resulted in a few seconds of latency, they were hesitant to do it until they were ready to "chuck in [their] data and get the charts out" (P3). Participants also described how Lux needed to be more robust in visualizing dirty or ill-formatted data.

Summary and Limitations: Table 4 details participants' Likert scale ratings of the functionalities and benefits of Lux. The average System Usability Scale (SUS) [26] score across participants is 70/100. All three participants were interested in continuing to use Lux in their data science work. We learned that Vis and VisList are not as discoverable and easy to use as always-on dataframe visualizations. Despite the enthusiasm around Lux, we find that participants are still attached to their existing visualization tool for this functionality. They shared concerns around customizability and the inability to express their desired visualizations in Lux, pointing to the need for improving the flexibility of the intent language. Furthermore, a controlled study comparing Lux with a manual baseline approach would further quantify the expected benefits of the tool.

We summarize the implementation challenges and lessons learned from the longitudinal open-source deployment of Lux over 15 months, with over 62k downloads. Given that the nature of such engagement is largely organic, ranging from feature requests stemming from Github Issues to questions and discussions with users in our Slack community, these observations and findings will remain qualitative.

Metadata Propagation: To preserve the comprehensive array of convenient operators offered by the dataframe API, we aimed to natively support any possible pandas operations. This led to our design of LuxDataFrame as a wrapper around the pandas dataframe. However, dataframes can often get transformed to other intermediate data structures such as GroupBy, Series, or Index objects when users are working with dataframes. To this end, we extended specific pandas functions and classes to ensure that the metadata and associated information is propagated across a workflow, so that the context does not get lost in intermediate operations.

Failproofing always-on dataframe display: One of the reasons why dataframes have become popular is the ease and flexibility of being able to work with the data in a schema-free manner [60] . However, this can be problematic for generating recommendations, since underlying operations for visualization recommendations can lead to errors. For instance, dataframes can often be ill-formatted in a way that is not amenable to visualizations. One example of this is dataframes with missing values, or when dataframes contains mixed datatypes (a common issue when loading in spreadsheets). To this end, we provide pandas-consistent output behavior by safeguarding Lux with informative warnings, and falling back to pandas upon internal errors, to always ensure that Lux provides at least the pandas table as the default display. To allow users to effectively operate on their data, it is crucial that the system provides users with the unmodified, consistent state of the dataframe. In other words, Lux should not modify and change the user's dataframe in the process of visualization to maintain "What You See Is What You Get" (WYSIWYG) behavior. As a result, all recommendations in Lux are generated as views that are decoupled from the dataframe content.

Integration with Downstream Reports: One common use case for Lux is to get a quick overview of insights on a new dataset. We found that users often wanted a way to share their findings in their organization. Our initial use case for supporting downstream reporting was to allow users to select one or more visualizations and export it as matplotlib or altair code. However, this approach quickly became unsustainable when users wanted to share many visualizations from their dashboard at the same time. To support presentation and collaboration, we implemented various options for export, from static HTML reports to integration with popular libraries for creating interactive "data apps", including Dat-aPane [15], Panel [17], and Streamlit [18] . Given that many Lux users often share their findings with business stakeholders without a Python development setup, future work might include supporting exports to readily-accessible presentation formats, such as PDF or Powerpoint.

Another lesson that we learned is that ease of initial installation and setup is a primary driver impacting the adoption of tools like Lux. In a similar vein, our user surveys and online discussions suggest that the minimal API as demonstrated in our documentation and tutorials is attractive to many data practitioners.

We propose Lux, an always-on visualization framework for exploratory dataframe workflows. Lux is a lightweight wrapper that reduces the barrier of visualizing data, enabling seamless exploration and visual discovery in-situ. To provide better visualization recommendations, we make use of user-provided intent and history, as well as structural information and metadata. We extend and evaluate various optimization strategies that minimize the overhead of Lux, including approximate query processing, lazy computation, and caching and reuse. Lux's adoption over the last year and success of user evaluation points to its importance for dataframe workflows-steering users towards valuable insights as they ponder what to do next with their data.

120 years of olympic history: athletes and results

Us department of education: College scorecard data

Afghanistan: WHO mission reviews COVID-19 response. World Health Organization

COVID-19 in Pakistan: WHO fighting tirelessly against the odds. World Health Organization

Faster data exploration in Jupyter through Lux

LUX Library: Matplotlib replacer? YouTube

Power BI: Interactive data visualization BI Tools

State of Data Science and Machine Learning 2020. Kaggle

UCI Machine Learning Repository

Futzing and moseying: Interviews with professional data analysts on exploration practices

The Interactive Visualization Gap in Initial Exploratory Data Analysis

Why Rwanda Is Doing Better Than Ohio When It Comes To Controlling COVID-19. NPR

Antifreeze for large and complex spreadsheets: Asynchronous formula computation

SUS-A quick and dirty usability scale

Synopses for massive data: Samples, histograms, wavelets, sketches. Found. Trends Databases

Foresight: Rapid data exploration through guideposts

Speed up EDA With the Intelligent Lux. Medium

Show Me the Numbers: Designing Tables and Graphs to Enlighten

Approximate query processing: Taming the terabytes

How information visualization novices construct visualizations

Selection of views to materialize in a data warehouse

A global panel database of pandemic policies (Oxford COVID-19 Government Response Tracker)

Vizml: A machine learning approach to visualization recommendation

Dive: A mixed-initiative system supporting integrated data exploration workflows

Matplotlib: A 2d graphics environment

Magpie: Python at speed and scale using cloud backends

Smart Drill-Down : A New Data Exploration Operator

Trendquery: A system for interactive exploration of trends

Enterprise Data Analysis and Visualization: An Interview Study

Rapid sampling for visualizations with ordering guarantees

Analyzing UCI Crime and Communities Dataset. Kaggle

Code duplication and reuse in jupyter notebooks

Avoiding Drilldown Fallacies with VisPilot: Assisted Exploration of Data Subsets

The Case for a Visual Discovery Assistant: A Holistic Solution for Accelerating Visual Data Exploration

Deconstructing Categorization in Visualization Recommendation: A Taxonomy and Comparative Study

Always-on visualization recommendations for exploratory data science

Dziban : Balancing Agency & Automation in Visualization Design via Anchored Recommendations

Fastmatch: Adaptive algorithms for rapid discovery of relevant histogram visualizations

Automating the design of graphical presentations of relational information

Show Me: Automatic presentation for visual analysis

Expressive Time Series Querying with Hand-Drawn Scale-Free Sketches

Formalizing visualization design knowledge as constraints: Actionable and extensible models in draco

Intelligent visual data discovery with lux: A python library. Medium

Towards scalable dataframe systems

Aiding Collaborative Reuse of Computational Notebooks with

Explaining differences in multidimensional aggregates

User-adaptive exploration of multidimensional data

Discovery-Driven Exploration of OLAP Data Cubes

Vega-lite: A grammar of interactive graphics

Reactive vega: A streaming dataflow architecture for declarative interactive visualization

Effortless data exploration with zenvisage: an expressive and interactive visual analytics system

Polaris: A system for query, analysis, and visualization of multidimensional relational databases

Polaris: a system for query, analysis, and visualization of

multidimensional relational databases

Intermittent query processing

The pandas development team. pandas-dev/pandas: Pandas

Altair: Interactive statistical visualizations for python

Towards visualization recommendation systems

Seedb: efficient data-driven visualization recommendations to support visual analytics

Visualization by example. Proc. ACM Program. Lang., 4(POPL)

ggplot2: Elegant Graphics for Data Analysis

Quick recommendation-based data exploration with lux. Medium

Towards a general-purpose query language for visualization recommendation

Visualizing dataflow graphs of deep learning models in tensorflow

Scorpion: Explaining Away Outliers in Aggregate Queries

Helix: Holistic optimization for accelerating iterative machine learning

Enhancing the interactivity of dataframe queries by leveraging think time

Auto-suggest: Learning-to-recommend data preparation steps using data science notebooks

We thank the anonymous reviewers for their valuable feedback. This work is supported by a Facebook Fellowship, grants IIS-2129008, IIS-1940759, and IIS-1940757 awarded by the National Science Foundation, and funds from the Alfred P. Sloan Foundation. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies and organizations.